- 12 Dec, 2014 1 commit
-
-
Anton Mitrofanov authored
Didn't affect output due to the incorrect values either not being used in the code path or producing equal results compared to the correct values. Also deduplicate hpel_ref arrays.
-
- 01 Dec, 2014 1 commit
-
-
Anton Mitrofanov authored
-
- 29 Nov, 2014 1 commit
-
-
Henrik Gramner authored
It would previously report FAILED if any of the earlier plane_copy tests failed.
-
- 17 Oct, 2014 3 commits
-
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
Henrik Gramner authored
40->27 cycles on Haswell.
-
- 09 Oct, 2014 1 commit
-
-
Henrik Gramner authored
Improves the accuracy of benchmarks, especially in short functions. To quote the Intel 64 and IA-32 Architectures Software Developer's Manual: "The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed. If software requires RDTSC to be executed only after all previous instructions have completed locally, it can either use RDTSCP (if the processor supports that instruction) or execute the sequence LFENCE;RDTSC." RDTSCP would accomplish the same task, but it's only available since Nehalem. This change makes SSE2 a requirement to run checkasm.
-
- 29 Sep, 2014 1 commit
-
-
Vittorio Giovara authored
-
- 16 Sep, 2014 5 commits
-
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
- 03 Sep, 2014 2 commits
-
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
- 26 Aug, 2014 25 commits
-
-
Anton Mitrofanov authored
-
Henrik Gramner authored
Previously there was a limit of two cpuflags.
-
Henrik Gramner authored
Reduce the number of vector registers used from 7 to 5. Eliminate some moves in the AVX implementation. Avoid bypass delays for transitioning between int and float domains.
-
Henrik Gramner authored
Also drop the MMX version instead of doing a bunch of ifdeffery to support it after this change.
-
Anton Mitrofanov authored
-
Henrik Gramner authored
-
Janne Grunau authored
Deblock chroma/luma are based on libav's h264 aarch64 NEON deblocking filter which was ported by me from the existing ARM NEON asm. No additional persons to ask for a relicense.
-
Janne Grunau authored
Ported from the ARM NEON asm.
-
Janne Grunau authored
Ported from the ARM NEON asm.
-
Janne Grunau authored
Ported from the ARM NEON asm.
-
Janne Grunau authored
Ported from the ARM NEON asm.
-
Janne Grunau authored
Ported from the ARM NEON asm.
-
Janne Grunau authored
-
Janne Grunau authored
-
Janne Grunau authored
-
Janne Grunau authored
-
Janne Grunau authored
9-19% faster on a cortex-a9.
-
Janne Grunau authored
mc_weight_w4_*neon is also used for width 2 which does not guarantee 4-byte aligned destination. Fixes crashes caused by random memory corruption.
-
Janne Grunau authored
The memory acts as compiler barrier preventing aggressive reordering of read_time calls. gcc 4.8 reorders some of initial read_time calls after the second when targeting arm.
-
Janne Grunau authored
The integrated assembler in llvm trunk (to be released as 3.5) is otherwise capable enough to assemble the arm asm correctly.
-
Janne Grunau authored
-
Janne Grunau authored
The gas manual states "Repeat the sequence of lines between the .rept directive and the next .endr directive ...". GNU as seems to support instructions on the same line as .rept anyway but the integrated assembler in llvm trunk (to be released 3.5 in August 2014) does not.
-
Janne Grunau authored
-
Janne Grunau authored
-
Tristan Matthews authored
-