-
Shuffle both chroma components together as a 16 bit unit, and don't write the unchanged columns (like in x264_deblock_h_luma_neon and in the aarch64 version of the function). This causes a minor slowdown for x264_deblock_v_chroma_neon, but it is negligible compared to the speedup. checkasm timing Cortex-A7 A8 A9 deblock_chroma[1]_c 4817 4057 3601 deblock_chroma[1]_neon 1249 716 817 (before) deblock_chroma[1]_neon 1249 766 845 (after) deblock_h_chroma_420_c 3699 3275 2830 deblock_h_chroma_420_neon 2068 1414 1400 (before) deblock_h_chroma_420_neon 1838 1355 1291 (after)
89439b2c