Skip to content

Buffer cdef input in 8 block wide chunks

Also, implement new CDEF for AVX2. The AVX2 implementation offsets the input pixels by 128 and interleaves the chroma planes.

Profiling results from zen2.

NEW:

(Filters 2 chroma planes at once)

cdef_filter_uv_4x4_01_8bpc_avx2: 102.2
cdef_filter_uv_4x4_10_8bpc_avx2: 79.7
cdef_filter_uv_4x4_11_8bpc_avx2: 199.6
cdef_filter_uv_4x8_01_8bpc_avx2: 171.9
cdef_filter_uv_4x8_10_8bpc_avx2: 128.3
cdef_filter_uv_4x8_11_8bpc_avx2: 251.6
cdef_filter_uv_8x8_01_8bpc_avx2: 294.7
cdef_filter_uv_8x8_10_8bpc_avx2: 240.6
cdef_filter_uv_8x8_11_8bpc_avx2: 436.2
cdef_filter_y_01_8bpc_avx2: 188.9
cdef_filter_y_10_8bpc_avx2: 112.8
cdef_filter_y_11_8bpc_avx2: 241.6

(Prepares the input buffer for cdef)

cdef_prep_uv_4x4_8bpc_avx2: 60.0
cdef_prep_uv_4x8_8bpc_avx2: 90.5
cdef_prep_uv_8x8_8bpc_avx2: 126.0
cdef_prep_y_8bpc_avx2: 81.6

Runtime from a WebRTC sample clip (about ~10% spent in cdef):

  Time (mean ± σ):     881.6 ms ±   6.6 ms    [User: 873.3 ms, System: 6.7 ms]
  Range (min … max):   874.7 ms … 898.2 ms    20 runs

OLD:

cdef_filter_4x4_01_8bpc_avx2: 86.8
cdef_filter_4x4_10_8bpc_avx2: 80.3
cdef_filter_4x4_11_8bpc_avx2: 116.5
cdef_filter_4x8_01_8bpc_avx2: 122.8
cdef_filter_4x8_10_8bpc_avx2: 90.2
cdef_filter_4x8_11_8bpc_avx2: 176.2
cdef_filter_8x8_01_8bpc_avx2: 171.2
cdef_filter_8x8_10_8bpc_avx2: 122.2
cdef_filter_8x8_11_8bpc_avx2: 247.7
  Time (mean ± σ):     909.1 ms ±   4.8 ms    [User: 900.6 ms, System: 6.1 ms]
  Range (min … max):   902.5 ms … 921.7 ms    20 runs
Edited by Kyle Siefring

Merge request reports