- 16 Dec, 2014 4 commits
-
-
35 times faster than C.
-
zigzag_scan_4x4_field_neon, zigzag_sub_4x4_field_neon, zigzag_sub_4x4ac_field_neon, zigzag_sub_4x4_frame_neon, igzag_sub_4x4ac_frame_neon more than 2 times faster zigzag_scan_8x8_frame_neon, zigzag_scan_8x8_field_neon, zigzag_sub_8x8_field_neon, zigzag_sub_8x8_frame_neon 4-5 times faster zigzag_interleave_8x8_cavlc_neon 6 times faster
-
~20% faster than calling pixel_sa8d_16x16 and pixel_satd_16x16 separately.
-
25% faster than the previous version.
-
- 12 Dec, 2014 4 commits
-
-
Henrik Gramner authored
All CPUs with AVX2 supports FMA3 (but not the other way around).
-
Anton Mitrofanov authored
-
Henrik Gramner authored
-
Anton Mitrofanov authored
Didn't affect output due to the incorrect values either not being used in the code path or producing equal results compared to the correct values. Also deduplicate hpel_ref arrays.
-
- 01 Dec, 2014 1 commit
-
-
Anton Mitrofanov authored
-
- 29 Nov, 2014 1 commit
-
-
Henrik Gramner authored
It would previously report FAILED if any of the earlier plane_copy tests failed.
-
- 17 Oct, 2014 3 commits
-
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
Henrik Gramner authored
40->27 cycles on Haswell.
-
- 09 Oct, 2014 1 commit
-
-
Henrik Gramner authored
Improves the accuracy of benchmarks, especially in short functions. To quote the Intel 64 and IA-32 Architectures Software Developer's Manual: "The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed. If software requires RDTSC to be executed only after all previous instructions have completed locally, it can either use RDTSCP (if the processor supports that instruction) or execute the sequence LFENCE;RDTSC." RDTSCP would accomplish the same task, but it's only available since Nehalem. This change makes SSE2 a requirement to run checkasm.
-
- 29 Sep, 2014 1 commit
-
-
Vittorio Giovara authored
-
- 16 Sep, 2014 5 commits
-
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
- 03 Sep, 2014 2 commits
-
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
- 26 Aug, 2014 18 commits
-
-
Anton Mitrofanov authored
-
Henrik Gramner authored
Previously there was a limit of two cpuflags.
-
Henrik Gramner authored
Reduce the number of vector registers used from 7 to 5. Eliminate some moves in the AVX implementation. Avoid bypass delays for transitioning between int and float domains.
-
Henrik Gramner authored
Also drop the MMX version instead of doing a bunch of ifdeffery to support it after this change.
-
Anton Mitrofanov authored
-
Henrik Gramner authored
-
Janne Grunau authored
Deblock chroma/luma are based on libav's h264 aarch64 NEON deblocking filter which was ported by me from the existing ARM NEON asm. No additional persons to ask for a relicense.
-
Janne Grunau authored
Ported from the ARM NEON asm.
-
Janne Grunau authored
Ported from the ARM NEON asm.
-
Janne Grunau authored
Ported from the ARM NEON asm.
-
Janne Grunau authored
Ported from the ARM NEON asm.
-
Janne Grunau authored
Ported from the ARM NEON asm.
-
Janne Grunau authored
-
Janne Grunau authored
-
Janne Grunau authored
-
Janne Grunau authored
-
Janne Grunau authored
9-19% faster on a cortex-a9.
-
Janne Grunau authored
mc_weight_w4_*neon is also used for width 2 which does not guarantee 4-byte aligned destination. Fixes crashes caused by random memory corruption.
-