avcodec/riscv: add h264 dc idct rvv
- h264_idct4_add_dc_8bpp_c: 1.7
- h264_idct4_add_dc_8bpp_rvv_i64: 1.2
- h264_idct4_add_dc_9bpp_c: 1.5
- h264_idct4_add_dc_9bpp_rvv_i64: 0.7
- h264_idct4_add_dc_10bpp_c: 1.5
- h264_idct4_add_dc_10bpp_rvv_i64: 0.7
- h264_idct4_add_dc_12bpp_c: 1.5
- h264_idct4_add_dc_12bpp_rvv_i64: 17.7
- h264_idct4_add_dc_14bpp_c: 1.7
- h264_idct4_add_dc_14bpp_rvv_i64: 0.7
- h264_idct8_add_dc_8bpp_c: 6.2
- h264_idct8_add_dc_8bpp_rvv_i64: 2.2
- h264_idct8_add_dc_9bpp_c: 6.0
- h264_idct8_add_dc_9bpp_rvv_i64: 1.2
- h264_idct8_add_dc_10bpp_c: 6.0
- h264_idct8_add_dc_10bpp_rvv_i64: 1.2
- h264_idct8_add_dc_12bpp_c: 6.2
- h264_idct8_add_dc_12bpp_rvv_i64: 1.2
- h264_idct8_add_dc_14bpp_c: 6.2
- h264_idct8_add_dc_14bpp_rvv_i64: 1.5
Signed-off-by: J. Dekker jdek@itanimul.li
Merge request reports
Activity
- libavcodec/riscv/h264dsp_rvv.S 0 → 100644
38 .endif 39 addi a3, a3, 32 40 srai a3, a3, 6 41 .if \depth == 8 42 sh zero, 0(a1) 43 .else 44 sw zero, 0(a1) 45 .endif 46 add t2, a2, a2 47 mv t1, a4 48 add t2, t2, a2 49 1: 50 .if \depth == 8 51 vsetvli zero, a4, e8, m1 52 vle8.v v0, (a0) 53 add a0, a0, a2 changed this line in version 3 of the diff
- libavcodec/riscv/h264dsp_rvv.S 0 → 100644
44 sw zero, 0(a1) 45 .endif 46 add t2, a2, a2 47 mv t1, a4 48 add t2, t2, a2 49 1: 50 .if \depth == 8 51 vsetvli zero, a4, e8, m1 52 vle8.v v0, (a0) 53 add a0, a0, a2 54 vle8.v v1, (a0) 55 add a0, a0, a2 56 vle8.v v2, (a0) 57 add a0, a0, a2 58 vle8.v v3, (a0) 59 vwcvtu.x.x.v v4, v0 changed this line in version 3 of the diff
- libavcodec/riscv/h264dsp_rvv.S 0 → 100644
59 vwcvtu.x.x.v v4, v0 60 vwcvtu.x.x.v v6, v1 61 vwcvtu.x.x.v v8, v2 62 vwcvtu.x.x.v v10, v3 63 vsetvli zero, a4, e16, m1 64 .else 65 vsetvli zero, a4, e16, m1 66 vle16.v v4, (a0) 67 add a0, a0, a2 68 vle16.v v6, (a0) 69 add a0, a0, a2 70 vle16.v v8, (a0) 71 add a0, a0, a2 72 vle16.v v10, (a0) 73 .endif 74 vadd.vx v4, v4, a3 changed this line in version 3 of the diff
- Resolved by J. Dekker
added 368 commits
- c6c755da...77d971c3 - 358 earlier commits
- 7904ec2d - avcodec/vvcdec: refact, remove hf_idx and vf_idx from mc_xxx's param list
- cae0b012 - avcodec/vvcdec: increase edge_emu_buffer for RPR
- 1b33c9a5 - avcodec/vvcdec: support Reference Picture Resampling
- b8eb8b4f - Changelog: add DVB compatible information for VVC decoder
- a9dc7dd7 - checkasm: vvc_alf: Limit benchmarking to a reasonable subset of functions
- b1adf6d1 - checkasm: add runs argument to adjust during bench
- d43e1238 - checkasm: print bench runs when benchmarking
- 60933671 - checkasm: h264dsp: Avoid out of buffer writes when benchmarking
- a1e620db - avcodec/riscv: add h264 dc idct rvv
- d16e9826 - wip
Toggle commit listI tried to use
m4
to reduce the number of widens/adds/narrows but according to tests they were slower than just using mf2/m1 alone and reducing the total number of vsetvlis.At first the idea was to do two functions which would cover low and high bit depth of 4x4 and 8x8, to me it seemed very vector-y. See
ff_h264_idct4_dc_add_8_rvv_new
for the idea behind doubling the number of functions (one for low4, low8, high4, high8), this reads a lot like traditional SIMD implementations though. From some (very noisy) benchmarks it seems to still be reasonably faster overall.I've tried
rdtime
andrdcycle
with varying number of runs, I'm going to tryclock_gettime()
again since @unlord says that it's fine in dav1d checkasm.Using
clock_gettime()
:user@canaan ~/ffmpeg $ ./tests/checkasm/checkasm --test=h264dsp --bench --runs=17 benchmarking with native FFmpeg timers nop: 64.4 checkasm: using random seed 3650982421 checkasm: bench runs 131072 (1 << 17) RVVi64: - h264dsp.idct [OK] checkasm: all 10 tests passed h264_idct4_add_dc_8bpp_c: 57.9 h264_idct4_add_dc_8bpp_rvv_i64: 30.1 h264_idct4_add_dc_9bpp_c: 57.9 h264_idct4_add_dc_9bpp_rvv_i64: 30.1 h264_idct4_add_dc_10bpp_c: 57.9 h264_idct4_add_dc_10bpp_rvv_i64: 20.9 h264_idct4_add_dc_12bpp_c: 48.6 h264_idct4_add_dc_12bpp_rvv_i64: 21.1 h264_idct4_add_dc_14bpp_c: 57.9 h264_idct4_add_dc_14bpp_rvv_i64: 20.9 h264_idct8_add_dc_8bpp_c: 224.6 h264_idct8_add_dc_8bpp_rvv_i64: 57.9 h264_idct8_add_dc_9bpp_c: 224.6 h264_idct8_add_dc_9bpp_rvv_i64: 39.4 h264_idct8_add_dc_10bpp_c: 224.6 h264_idct8_add_dc_10bpp_rvv_i64: 39.4 h264_idct8_add_dc_12bpp_c: 224.6 h264_idct8_add_dc_12bpp_rvv_i64: 48.6 h264_idct8_add_dc_14bpp_c: 224.6 h264_idct8_add_dc_14bpp_rvv_i64: 48.6 user@canaan ~/ffmpeg $ ./tests/checkasm/checkasm --test=h264dsp --bench --runs=17 benchmarking with native FFmpeg timers nop: 51.8 checkasm: using random seed 666058969 checkasm: bench runs 131072 (1 << 17) RVVi64: - h264dsp.idct [OK] checkasm: all 10 tests passed h264_idct4_add_dc_8bpp_c: 51.8 h264_idct4_add_dc_8bpp_rvv_i64: 42.5 h264_idct4_add_dc_9bpp_c: 61.0 h264_idct4_add_dc_9bpp_rvv_i64: 33.3 h264_idct4_add_dc_10bpp_c: 61.3 h264_idct4_add_dc_10bpp_rvv_i64: 24.0 h264_idct4_add_dc_12bpp_c: 61.0 h264_idct4_add_dc_12bpp_rvv_i64: 24.0 h264_idct4_add_dc_14bpp_c: 61.0 h264_idct4_add_dc_14bpp_rvv_i64: 24.0 h264_idct8_add_dc_8bpp_c: 227.5 h264_idct8_add_dc_8bpp_rvv_i64: 61.0 h264_idct8_add_dc_9bpp_c: 227.8 h264_idct8_add_dc_9bpp_rvv_i64: 51.8 h264_idct8_add_dc_10bpp_c: 227.8 h264_idct8_add_dc_10bpp_rvv_i64: 42.5 h264_idct8_add_dc_12bpp_c: 218.5 h264_idct8_add_dc_12bpp_rvv_i64: 51.8 h264_idct8_add_dc_14bpp_c: 227.8 h264_idct8_add_dc_14bpp_rvv_i64: 42.5