Skip to content

riscv64/mc: Add 8bpc w_mask RVV function

Bogdan Gligorijević requested to merge BogdanW3/dav1d:w_mask_8bpc into master

The function is separated into a case for w<32 and w>=32 to lessen the penalty that using large vector groups has on performance with small widths. Otherwise it's mostly a 1:1 rewrite of the C code into RVV asm. Some parts of the function are commented but I can always remove this if it's not by the repo standards.

Benchmarks:

Kendryte K230 SpacemiT K1
w_mask_420_w4_8bpc_c:        776.4 ( 1.00x)
w_mask_420_w4_8bpc_rvv:      445.3 ( 1.74x)
w_mask_420_w8_8bpc_c:       2331.1 ( 1.00x)
w_mask_420_w8_8bpc_rvv:      706.7 ( 3.30x)
w_mask_420_w16_8bpc_c:      7492.2 ( 1.00x)
w_mask_420_w16_8bpc_rvv:    2218.5 ( 3.38x)
w_mask_420_w32_8bpc_c:     29651.4 ( 1.00x)
w_mask_420_w32_8bpc_rvv:    4595.4 ( 6.45x)
w_mask_420_w64_8bpc_c:     72406.5 ( 1.00x)
w_mask_420_w64_8bpc_rvv:   11303.3 ( 6.41x)
w_mask_420_w128_8bpc_c:   181788.8 ( 1.00x)
w_mask_420_w128_8bpc_rvv:  28050.0 ( 6.48x)
w_mask_422_w4_8bpc_c:        739.9 ( 1.00x)
w_mask_422_w4_8bpc_rvv:      434.5 ( 1.70x)
w_mask_422_w8_8bpc_c:       2253.1 ( 1.00x)
w_mask_422_w8_8bpc_rvv:      685.3 ( 3.29x)
w_mask_422_w16_8bpc_c:      7202.7 ( 1.00x)
w_mask_422_w16_8bpc_rvv:    2183.1 ( 3.30x)
w_mask_422_w32_8bpc_c:     28612.1 ( 1.00x)
w_mask_422_w32_8bpc_rvv:    4451.0 ( 6.43x)
w_mask_422_w64_8bpc_c:     70351.0 ( 1.00x)
w_mask_422_w64_8bpc_rvv:   11135.4 ( 6.32x)
w_mask_422_w128_8bpc_c:   176852.9 ( 1.00x)
w_mask_422_w128_8bpc_rvv:  27691.7 ( 6.39x)
w_mask_444_w4_8bpc_c:        793.6 ( 1.00x)
w_mask_444_w4_8bpc_rvv:      407.5 ( 1.95x)
w_mask_444_w8_8bpc_c:       2475.5 ( 1.00x)
w_mask_444_w8_8bpc_rvv:      649.8 ( 3.81x)
w_mask_444_w16_8bpc_c:      8070.2 ( 1.00x)
w_mask_444_w16_8bpc_rvv:    2024.1 ( 3.99x)
w_mask_444_w32_8bpc_c:     32181.4 ( 1.00x)
w_mask_444_w32_8bpc_rvv:    4540.4 ( 7.09x)
w_mask_444_w64_8bpc_c:     77966.4 ( 1.00x)
w_mask_444_w64_8bpc_rvv:   11522.2 ( 6.77x)
w_mask_444_w128_8bpc_c:   193108.6 ( 1.00x)
w_mask_444_w128_8bpc_rvv:  28573.0 ( 6.76x)
w_mask_420_w4_8bpc_c:        767.7 ( 1.00x)
w_mask_420_w4_8bpc_rvv:      418.4 ( 1.83x)
w_mask_420_w8_8bpc_c:       2267.4 ( 1.00x)
w_mask_420_w8_8bpc_rvv:      668.2 ( 3.39x)
w_mask_420_w16_8bpc_c:      7277.0 ( 1.00x)
w_mask_420_w16_8bpc_rvv:    1093.7 ( 6.65x)
w_mask_420_w32_8bpc_c:     28808.6 ( 1.00x)
w_mask_420_w32_8bpc_rvv:    4485.5 ( 6.42x)
w_mask_420_w64_8bpc_c:     69491.7 ( 1.00x)
w_mask_420_w64_8bpc_rvv:    6015.6 (11.55x)
w_mask_420_w128_8bpc_c:   171952.1 ( 1.00x)
w_mask_420_w128_8bpc_rvv:  15042.3 (11.43x)
w_mask_422_w4_8bpc_c:        731.3 ( 1.00x)
w_mask_422_w4_8bpc_rvv:      415.1 ( 1.76x)
w_mask_422_w8_8bpc_c:       2202.2 ( 1.00x)
w_mask_422_w8_8bpc_rvv:      661.5 ( 3.33x)
w_mask_422_w16_8bpc_c:      7042.0 ( 1.00x)
w_mask_422_w16_8bpc_rvv:    1074.7 ( 6.55x)
w_mask_422_w32_8bpc_c:     27826.3 ( 1.00x)
w_mask_422_w32_8bpc_rvv:    4408.7 ( 6.31x)
w_mask_422_w64_8bpc_c:     67104.2 ( 1.00x)
w_mask_422_w64_8bpc_rvv:    5987.9 (11.21x)
w_mask_422_w128_8bpc_c:   166266.5 ( 1.00x)
w_mask_422_w128_8bpc_rvv:  15161.8 (10.97x)
w_mask_444_w4_8bpc_c:        773.0 ( 1.00x)
w_mask_444_w4_8bpc_rvv:      404.4 ( 1.91x)
w_mask_444_w8_8bpc_c:       2425.0 ( 1.00x)
w_mask_444_w8_8bpc_rvv:      647.6 ( 3.74x)
w_mask_444_w16_8bpc_c:      7936.0 ( 1.00x)
w_mask_444_w16_8bpc_rvv:    1058.3 ( 7.50x)
w_mask_444_w32_8bpc_c:     31188.3 ( 1.00x)
w_mask_444_w32_8bpc_rvv:    4401.3 ( 7.09x)
w_mask_444_w64_8bpc_c:     75513.6 ( 1.00x)
w_mask_444_w64_8bpc_rvv:    6172.5 (12.23x)
w_mask_444_w128_8bpc_c:   187070.1 ( 1.00x)
w_mask_444_w128_8bpc_rvv:  15719.6 (11.90x)
Edited by Bogdan Gligorijević

Merge request reports