riscv64/mc: Add bidir functions
This code compromises between the performance of a dedicated kernel per VLEN/width pair, and the flexibility of a fully VLEN-dynamic loop, by using a single special case for w=4, and subdividing the rest into the unrolled four line fast path, and the general-purpose slow path (for large width on small VLEN). Kendryte K230 avg_w4_8bpc_c: 346.8 ( 1.00x) avg_w4_8bpc_rvv: 50.3 ( 6.90x) avg_w8_8bpc_c: 1054.9 ( 1.00x) avg_w8_8bpc_rvv: 139.1 ( 7.58x) avg_w16_8bpc_c: 3396.3 ( 1.00x) avg_w16_8bpc_rvv: 350.6 ( 9.69x) avg_w32_8bpc_c: 13734.3 ( 1.00x) avg_w32_8bpc_rvv: 1226.3 (11.20x) avg_w64_8bpc_c: 33260.9 ( 1.00x) avg_w64_8bpc_rvv: 3869.4 ( 8.60x) avg_w128_8bpc_c: 83441.3 ( 1.00x) avg_w128_8bpc_rvv: 9765.1 ( 8.54x) w_avg_w4_8bpc_c: 444.3 ( 1.00x) w_avg_w4_8bpc_rvv: 75.8 ( 5.86x) w_avg_w8_8bpc_c: 1365.6 ( 1.00x) w_avg_w8_8bpc_rvv: 208.8 ( 6.54x) w_avg_w16_8bpc_c: 4420.8 ( 1.00x) w_avg_w16_8bpc_rvv: 570.7 ( 7.75x) w_avg_w32_8bpc_c: 18010.9 ( 1.00x) w_avg_w32_8bpc_rvv: 2074.4 ( 8.68x) w_avg_w64_8bpc_c: 43050.4 ( 1.00x) w_avg_w64_8bpc_rvv: 5799.5 ( 7.42x) w_avg_w128_8bpc_c: 107153.6 ( 1.00x) w_avg_w128_8bpc_rvv: 14272.0 ( 7.51x) mask_w4_8bpc_c: 497.6 ( 1.00x) mask_w4_8bpc_rvv: 88.5 ( 5.63x) mask_w8_8bpc_c: 1528.5 ( 1.00x) mask_w8_8bpc_rvv: 253.1 ( 6.04x) mask_w16_8bpc_c: 4953.8 ( 1.00x) mask_w16_8bpc_rvv: 679.0 ( 7.30x) mask_w32_8bpc_c: 20298.3 ( 1.00x) mask_w32_8bpc_rvv: 3012.9 ( 6.74x) mask_w64_8bpc_c: 49718.8 ( 1.00x) mask_w64_8bpc_rvv: 7291.7 ( 6.82x) mask_w128_8bpc_c: 126740.3 ( 1.00x) mask_w128_8bpc_rvv: 18351.1 ( 6.91x)
Loading
Please register or sign in to comment