- 16 Dec, 2014 19 commits
-
-
x264_mbtree_propagate_cost_neon is ~7 times faster. x264_mbtree_propagate_list_neon is 33% faster.
-
3.5 times faster.
-
All functions ~33% faster.
-
deblock_luma_intra[0]_neon is 2 times fastes, deblock_luma_intra[1]_neon is ~4 times faster.
-
deblock_h_chroma_422 2.5 times faster
-
deblock_chroma_420_mbaff_neon 2 times faster
-
deblock_h_chroma_420_intra, deblock_h_chroma_422_intra and x264_deblock_h_chroma_intra_mbaff_neon are ~3 times faster. deblock_chroma_intra[1] is ~4 times faster than C.
-
-
integral_init4h_neon and integral_init8h_neon are 3-4 times faster than C. integral_init8v_neon is 6 times faster and integral_init4v_neon is 10 times faster.
-
Between 10% and 40% faster than C.
-
decimate_score15 and 16 are 60% faster, decimate_score64 is 4 times faster than C.
-
4 times faster than C.
-
7 times faster than C.
-
pixel_sad_4x16_neon: 33% faster than C pixel_satd_4x16_neon: 5 times faster pixel_ssd_4x16_neon: 4 times faster
-
13 times faster than C.
-
35 times faster than C.
-
zigzag_scan_4x4_field_neon, zigzag_sub_4x4_field_neon, zigzag_sub_4x4ac_field_neon, zigzag_sub_4x4_frame_neon, igzag_sub_4x4ac_frame_neon more than 2 times faster zigzag_scan_8x8_frame_neon, zigzag_scan_8x8_field_neon, zigzag_sub_8x8_field_neon, zigzag_sub_8x8_frame_neon 4-5 times faster zigzag_interleave_8x8_cavlc_neon 6 times faster
-
~20% faster than calling pixel_sa8d_16x16 and pixel_satd_16x16 separately.
-
25% faster than the previous version.
-
- 12 Dec, 2014 4 commits
-
-
Henrik Gramner authored
All CPUs with AVX2 supports FMA3 (but not the other way around).
-
Anton Mitrofanov authored
-
Henrik Gramner authored
-
Anton Mitrofanov authored
Didn't affect output due to the incorrect values either not being used in the code path or producing equal results compared to the correct values. Also deduplicate hpel_ref arrays.
-
- 01 Dec, 2014 1 commit
-
-
Anton Mitrofanov authored
-
- 29 Nov, 2014 1 commit
-
-
Henrik Gramner authored
It would previously report FAILED if any of the earlier plane_copy tests failed.
-
- 17 Oct, 2014 3 commits
-
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
Henrik Gramner authored
40->27 cycles on Haswell.
-
- 09 Oct, 2014 1 commit
-
-
Henrik Gramner authored
Improves the accuracy of benchmarks, especially in short functions. To quote the Intel 64 and IA-32 Architectures Software Developer's Manual: "The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed. If software requires RDTSC to be executed only after all previous instructions have completed locally, it can either use RDTSCP (if the processor supports that instruction) or execute the sequence LFENCE;RDTSC." RDTSCP would accomplish the same task, but it's only available since Nehalem. This change makes SSE2 a requirement to run checkasm.
-
- 29 Sep, 2014 1 commit
-
-
Vittorio Giovara authored
-
- 16 Sep, 2014 5 commits
-
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
- 03 Sep, 2014 2 commits
-
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
- 26 Aug, 2014 3 commits
-
-
Anton Mitrofanov authored
-
Henrik Gramner authored
Previously there was a limit of two cpuflags.
-
Henrik Gramner authored
Reduce the number of vector registers used from 7 to 5. Eliminate some moves in the AVX implementation. Avoid bypass delays for transitioning between int and float domains.
-