Skip to content
  • Martin Storsjö's avatar
    arm64: looprestoration: NEON implementation of SGR for 10 bpc · e3dbf926
    Martin Storsjö authored
    This only supports 10 bpc, not 12 bpc, as the sum and tmp buffers can
    be int16_t for 10 bpc, but need to be int32_t for 12 bpc.
    
    Make actual templates out of the functions in looprestoration_tmpl.S,
    and add box3/5_h to looprestoration16.S.
    
    Extend dav1d_sgr_calc_abX_neon with a mandatory bitdepth_max parameter
    (which is passed even in 8bpc mode), add a define to bitdepth.h for
    passing such a parameter in all modes. This makes this function
    a few instructions slower in 8bpc mode than it was before (overall impact
    seems to be around 1% of the total runtime of SGR), but allows using the
    same actual function instantiation for all modes, saving a bit of code
    size.
    
    Examples of checkasm runtimes:
                               Cortex A53        A72        A73
    selfguided_3x3_10bpc_neon:   516755.8   389412.7   349058.7
    selfguided_5x5_10bpc_neon:   380699.9   293486.6   254591.6
    selfguided_mix_10bpc_neon:   878142.3   667495.9   587844.6
    
    Corresponding 8 bpc numbers for comparison:
    selfguided_3x3_8bpc_neon:    491058.1   361473.4   347705.9
    selfguided_5x5_8bpc_neon:    352655.0   266423.7   248192.2
    selfguided_mix_8bpc_neon:    826094.1   612372.2   581943.1
    e3dbf926