riscv64/mc: warp_8x8 and warp_8x8t 8bpc
Benchmarks:
- Kendryte K230:
warp_8x8_8bpc_c: 4549.7 ( 1.00x)
warp_8x8_8bpc_rvv: 2504.7 ( 1.82x)
warp_8x8t_8bpc_c: 4414.7 ( 1.00x)
warp_8x8t_8bpc_rvv: 2465.7 ( 1.79x)
- Banana Pi BPI-F3:
warp_8x8_8bpc_c: 4431.2 ( 1.00x)
warp_8x8_8bpc_rvv: 3297.4 ( 1.34x)
warp_8x8t_8bpc_c: 4299.3 ( 1.00x)
warp_8x8t_8bpc_rvv: 3255.7 ( 1.32x)
Due to using segmented indexed loads, this function currently doesn't give as major a boost to current hardware as the hardware itself seems to have a penalty for these loads. New implementations might reap more benefits.