Skip to content
  • Martin Storsjö's avatar
    arm: Optimize x264_deblock_h_chroma_neon · 89439b2c
    Martin Storsjö authored and Henrik Gramner's avatar Henrik Gramner committed
    Shuffle both chroma components together as a 16 bit unit, and
    don't write the unchanged columns (like in x264_deblock_h_luma_neon
    and in the aarch64 version of the function).
    
    This causes a minor slowdown for x264_deblock_v_chroma_neon, but
    it is negligible compared to the speedup.
    
    checkasm timing      Cortex-A7    A8    A9
    deblock_chroma[1]_c         4817  4057  3601
    deblock_chroma[1]_neon      1249  716   817   (before)
    deblock_chroma[1]_neon      1249  766   845   (after)
    
    deblock_h_chroma_420_c      3699  3275  2830
    deblock_h_chroma_420_neon   2068  1414  1400  (before)
    deblock_h_chroma_420_neon   1838  1355  1291  (after)
    89439b2c