Skip to content
  • Martin Storsjö's avatar
    arm: cdef: Do an 8 bit implementation for cases with all edges present · b33f46e8
    Martin Storsjö authored
    This increases the code size by around 3 KB on arm64.
    
    Before:
    ARM32:                    Cortex A7      A8      A9     A53     A72     A73
    cdef_filter_4x4_8bpc_neon:    807.1   517.0   617.7   506.6   429.9   357.8
    cdef_filter_4x8_8bpc_neon:   1407.9   899.3  1054.6   862.3   726.5   628.1
    cdef_filter_8x8_8bpc_neon:   2394.9  1456.8  1676.8  1461.2  1084.4  1101.2
    ARM64:
    cdef_filter_4x4_8bpc_neon:                            460.7   301.8   308.0
    cdef_filter_4x8_8bpc_neon:                            831.6   547.0   555.2
    cdef_filter_8x8_8bpc_neon:                           1454.6   935.6   960.4
    
    After:
    ARM32:
    cdef_filter_4x4_8bpc_neon:    669.3   541.3   524.4   424.9   322.7   298.1
    cdef_filter_4x8_8bpc_neon:   1159.1   922.9   881.1   709.2   538.3   514.1
    cdef_filter_8x8_8bpc_neon:   1888.8  1285.4  1358.5  1152.9   839.3   871.2
    ARM64:
    cdef_filter_4x4_8bpc_neon:                            383.6   262.1   259.9
    cdef_filter_4x8_8bpc_neon:                            684.9   472.2   464.7
    cdef_filter_8x8_8bpc_neon:                           1160.0   756.8   788.0
    
    (The checkasm benchmark averages three different cases; the fully
    edged case is one of those three, while it's the most common case
    in actual video. The difference is much bigger if only benchmarking
    that particular case.)
    
    This actually apparently makes the code a little bit slower for the w=4
    cases on Cortex A8, while it's a significant speedup on all other cores.
    b33f46e8
Loading