arm64: ipred: 16 bpc NEON implementation of the Z1 and Z3 functions
- Mar 21, 2023
-
-
Martin Storsjö authored
Relative speedup over the C code: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z3_w4_16bpc_neon: 3.06 2.87 2.17 1.97 2.33 7.75 intra_pred_z3_w8_16bpc_neon: 3.90 3.94 2.97 3.16 2.93 4.43 intra_pred_z3_w16_16bpc_neon: 4.08 4.48 3.31 4.68 3.13 5.00 intra_pred_z3_w32_16bpc_neon: 4.43 4.85 3.50 4.02 3.33 5.62 intra_pred_z3_w64_16bpc_neon: 4.68 5.30 3.72 3.96 3.52 5.78
e75caab9 -
Martin Storsjö authored
Relative speedup over the C code: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z1_w4_16bpc_neon: 3.49 2.63 2.83 3.85 3.14 9.00 intra_pred_z1_w8_16bpc_neon: 6.19 4.39 3.65 6.58 4.99 6.50 intra_pred_z1_w16_16bpc_neon: 6.65 4.64 3.97 7.78 4.87 7.00 intra_pred_z1_w32_16bpc_neon: 7.76 5.49 5.17 7.83 5.59 8.24 intra_pred_z1_w64_16bpc_neon: 8.02 5.80 5.33 8.41 5.77 8.70
2eb92391 -
Martin Storsjö authored
For 8 bpc, there's probably not much difference to a decent memset, but for 16 bpc, there might be a bigger difference.
ec38062a -
Martin Storsjö authored
Add comments explaining the exact dimensions of the gather tables used currently. That reasoning shows that the w=8 case can do with one register less. Before: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z3_w8_8bpc_neon: 356.2 376.2 218.9 246.4 176.1 0.6 After: intra_pred_z3_w8_8bpc_neon: 339.6 357.3 205.6 232.3 160.0 0.5
6f5bf165 -
Martin Storsjö authored
Start out the multiplication/accumulation with a register that is available sooner. Before: Cortex A53 A55 A72 A73 A76 Apple M1 intra_pred_z1_w8_8bpc_neon: 266.3 268.9 146.6 155.3 103.9 0.4 intra_pred_z1_w16_8bpc_neon: 528.6 574.4 333.9 364.3 209.1 0.7 intra_pred_z1_w32_8bpc_neon: 1149.3 1245.4 752.3 811.5 503.4 1.3 intra_pred_z1_w64_8bpc_neon: 2198.4 2360.6 1462.9 1575.0 1007.6 2.4 After: intra_pred_z1_w8_8bpc_neon: 266.3 269.1 146.6 155.0 100.1 0.4 intra_pred_z1_w16_8bpc_neon: 528.6 573.3 347.9 352.4 204.3 0.7 intra_pred_z1_w32_8bpc_neon: 1149.2 1245.3 763.4 759.6 474.8 1.3 intra_pred_z1_w64_8bpc_neon: 2198.8 2360.6 1430.0 1417.4 943.5 2.3
7be5347c -
Martin Storsjö authored
The second register will at most contain one valid pixel, the padding pixel. Thus skip padding the register and just fill it with the padding pixel.
92d93f4b -
Martin Storsjö authored
There were redundant leftovers from copypasting bits when writing this function.
8ee450cb -
Martin Storsjö authored
This is for cases with h >= 16.
ab6977bc -
Martin Storsjö authoredda9602a3
-
Martin Storsjö authored50a89b63
-