arm64: ipred: 16 bpc NEON implementation of the Z1 and Z3 functions
As usual, there's a handful of minor things to fix in the 8 bpc case that I notice after looking closer at it again.
Overall relative speedup over C code:
Cortex A53 A55 A72 A73 A76 Apple M1
intra_pred_z1_w4_16bpc_neon: 3.49 2.63 2.83 3.85 3.14 9.00
intra_pred_z1_w8_16bpc_neon: 6.19 4.39 3.65 6.58 4.99 6.50
intra_pred_z1_w16_16bpc_neon: 6.65 4.64 3.97 7.78 4.87 7.00
intra_pred_z1_w32_16bpc_neon: 7.76 5.49 5.17 7.83 5.59 8.24
intra_pred_z1_w64_16bpc_neon: 8.02 5.80 5.33 8.41 5.77 8.70
intra_pred_z3_w4_16bpc_neon: 3.06 2.87 2.17 1.97 2.33 7.75
intra_pred_z3_w8_16bpc_neon: 3.90 3.94 2.97 3.16 2.93 4.43
intra_pred_z3_w16_16bpc_neon: 4.08 4.48 3.31 4.68 3.13 5.00
intra_pred_z3_w32_16bpc_neon: 4.43 4.85 3.50 4.02 3.33 5.62
intra_pred_z3_w64_16bpc_neon: 4.68 5.30 3.72 3.96 3.52 5.78
Merge request reports
Activity
Filter activity
added ARM performance labels
added 11 commits
-
16c94348 - 1 commit from branch
videolan:master
- 50a89b63 - arm: ipred: Fix a misindented line in the C wrapper
- da9602a3 - arm64: ipred: Fix a misindented operand in the assembly
- ab6977bc - arm64: ipred: Rename a misnamed local label in the assembly
- 8ee450cb - arm64: ipred: Remove leftover instructions at the start of z3_fill2
- 92d93f4b - arm64: ipred: Optimize the 3tap filter padding in z1_filter_edge
- 7be5347c - arm64: ipred: Improve accumulation ordering in 8bpc z1
- 6f5bf165 - arm64: ipred: Use fewer registers for table lookups in w=8 in z3_fill1 for 8bpc
- ec38062a - arm: ipred: Make a SIMD pixel_set function for padding
- 2eb92391 - arm64: ipred: 16 bpc NEON implementation of the Z1 function
- e75caab9 - arm64: ipred: 16 bpc NEON implementation of the Z3 function
Toggle commit list-
16c94348 - 1 commit from branch
enabled an automatic merge when the pipeline for e75caab9 succeeds
mentioned in issue #215
changed milestone to %1.2.0
Please register or sign in to reply