Skip to content

x86: Improve AVX2 generate_grain asm

Henrik Gramner requested to merge gramner/dav1d:gen_grain_avx2 into master
                                           HSW                    SKL
                                      old       new          old       new

gen_grain_y_ar0_8bpc_avx2:         19298.9   15342.1      18711.9   17842.5
gen_grain_y_ar1_8bpc_avx2:         63983.7   56378.3      59358.9   57213.2
gen_grain_y_ar2_8bpc_avx2:         86822.1   78599.7      92137.2   90092.4
gen_grain_y_ar3_8bpc_avx2:         88543.4   80883.7      97682.3   94461.6

gen_grain_uv_ar0_8bpc_420_avx2:     5742.2    4976.4       6157.3    6049.9
gen_grain_uv_ar1_8bpc_420_avx2:    15999.0   15003.4      15211.3   15038.6
gen_grain_uv_ar2_8bpc_420_avx2:    21082.1   19496.9      22533.4   22401.5
gen_grain_uv_ar3_8bpc_420_avx2:    23159.6   19810.1      25008.9   22810.5

gen_grain_uv_ar0_8bpc_422_avx2:    11475.5    9300.6      11404.1   11278.6
gen_grain_uv_ar1_8bpc_422_avx2:    31161.1   29267.6      29484.3   29482.6
gen_grain_uv_ar2_8bpc_422_avx2:    41179.1   38221.5      44523.1   44358.9
gen_grain_uv_ar3_8bpc_422_avx2:    46002.3   39058.1      49497.1   45007.2

gen_grain_uv_ar0_8bpc_444_avx2:    20684.6   16200.5      21090.6   20429.6
gen_grain_uv_ar1_8bpc_444_avx2:    62772.6   56551.0      58890.7   57936.0
gen_grain_uv_ar2_8bpc_444_avx2:    80320.2   74349.6      87507.6   86792.5
gen_grain_uv_ar3_8bpc_444_avx2:    89649.4   76560.7      97022.6   88563.1


gen_grain_y_ar0_16bpc_avx2:        19713.1   15822.2      18787.0   17809.5
gen_grain_y_ar1_16bpc_avx2:        61425.5   58064.0      58335.0   57371.9
gen_grain_y_ar2_16bpc_avx2:        78416.5   74290.5      87864.8   85194.7
gen_grain_y_ar3_16bpc_avx2:        81198.8   75390.7      91357.7   87434.1

gen_grain_uv_ar0_16bpc_420_avx2:    6176.9    5290.4       6105.0    5997.7
gen_grain_uv_ar1_16bpc_420_avx2:   16173.8   15294.2      15064.0   15205.6
gen_grain_uv_ar2_16bpc_420_avx2:   20498.5   19281.6      22811.8   22490.3
gen_grain_uv_ar3_16bpc_420_avx2:   21811.7   19930.5      23423.7   22575.6

gen_grain_uv_ar0_16bpc_422_avx2:   11959.1    9805.9      11332.1   11142.3
gen_grain_uv_ar1_16bpc_422_avx2:   31821.6   29928.0      29457.0   29673.6
gen_grain_uv_ar2_16bpc_422_avx2:   39711.1   38179.8      44745.5   44483.1
gen_grain_uv_ar3_16bpc_422_avx2:   41998.4   38771.1      46371.9   44546.4

gen_grain_uv_ar0_16bpc_444_avx2:   21107.8   17165.8      21581.3   20343.0
gen_grain_uv_ar1_16bpc_444_avx2:   61445.1   58489.6      58594.0   58647.9
gen_grain_uv_ar2_16bpc_444_avx2:   78127.2   74867.5      87758.6   87106.2
gen_grain_uv_ar3_16bpc_444_avx2:   81489.5   76197.8      91190.6   87434.1

gen_grain_uv_ar1_16bpc is very marginally slower on Skylake after the changes due to using scalar loads instead of gathers, but everything else is a win across the board. Should be even more beneficial on AMD CPUs with notoriously poor gather performance.

Partially addresses #377 (closed) due to grain generation functions no longer using gathers. The issue still remains for the main film grain functions though.

Merge request reports