- May 25, 2024
-
-
Jean-Baptiste Kempf authored
-
- May 20, 2024
-
-
Use a slightly shorter series of instructions to compute cdf update rate.
-
Henrik Gramner authored
Error out early instead of producing bogus mismatch errors in case of an incorrect cpu mask for example.
-
- May 19, 2024
-
-
Martin Storsjö authored
The ldr instruction can take an immediate offset which is a multiple of the loaded element size. If the ldr instruction is given an immediate offset which isn't a multiple of the element size, most assemblers implicitly generate a "ldur" instruction instead. Older versions of MS armasm64.exe don't do this, but instead error out with "error A2518: operand 2: Memory offset must be aligned". (Current versions don't do this but correctly generate "ldur" implicitly.) Switch this instruction to an explicit "ldur", like we do elsewhere, to fix building with these older tools.
-
- May 18, 2024
-
-
NDK 26 dropped support for API versions 19 and 20 (KitKat, Android 4.4). The minimum supported API is now 21 (Lollipop, Android 5.0).
-
- May 14, 2024
-
-
Kyle Siefring authored
Changes stem from redesigning the reduction stage of the multisymbol decode function. * No longer use adapt4 for 5 possible symbol values * Specialize reduction for 4/8/16 decode functions * Modify control flow +------------------------+--------------+--------------+---------------+ | | Neoverse V1 | Neoverse N1 | Cortex A72 | | | (Graviton 3) | (Graviton 2) | (Graviton 1) | +------------------------+-------+------+-------+------+-------+-------+ | | Old | New | Old | New | Old | New | +------------------------+-------+------+-------+------+-------+-------+ | decode_bool_neon | 13.0 | 12.9 | 14.9 | 14.0 | 39.3 | 29.0 | +------------------------+-------+------+-------+------+-------+-------+ | decode_bool_adapt_neon | 15.4 | 15.6 | 17.5 | 16.8 | 41.6 | 33.5 | +------------------------+-------+------+-------+------+-------+-------+ | decode_bool_equi_neon | 11.3 | 12.0 | 14.0 | 12.2 | 35.0 | 26.3 | +------------------------+-------+------+-------+------+-------+-------+ | decode_hi_tok_c | 73.7 | 57.8 | 73.4 | 60.5 | 130.1 | 103.9 | +------------------------+-------+------+-------+------+-------+-------+ | decode_hi_tok_neon | 63.3 | 48.2 | 65.2 | 51.2 | 119.0 | 105.3 | +------------------------+-------+------+-------+------+-------+-------+ | decode_symbol_\ | 28.6 | 22.5 | 28.4 | 23.5 | 67.8 | 55.1 | | adapt4_neon | | | | | | | +------------------------+-------+------+-------+------+-------+-------+ | decode_symbol_\ | 29.5 | 26.6 | 29.0 | 28.8 | 76.6 | 74.0 | | adapt8_neon | | | | | | | +------------------------+-------+------+-------+------+-------+-------+ | decode_symbol_\ | 31.6 | 31.2 | 33.3 | 33.0 | 77.5 | 68.1 | | adapt16_neon | | | | | | | +------------------------+-------+------+-------+------+-------+-------+
-
Optimize the widening copy part of subpel filters (the prep_neon function). In this patch we combine widening shifts with widening multiplications in the inner loops to get maximum throughput. The change will increase .text by 36 bytes. Relative performance of micro benchmarks (lower is better): Cortex-A55: mct_w4: 0.795x mct_w8: 0.913x mct_w16: 0.912x mct_w32: 0.838x mct_w64: 1.025x mct_w128: 1.002x Cortex-A510: mct_w4: 0.760x mct_w8: 0.636x mct_w16: 0.640x mct_w32: 0.854x mct_w64: 0.864x mct_w128: 0.995x Cortex-A72: mct_w4: 0.616x mct_w8: 0.854x mct_w16: 0.756x mct_w32: 1.052x mct_w64: 1.044x mct_w128: 0.702x Cortex-A76: mct_w4: 0.837x mct_w8: 0.797x mct_w16: 0.841x mct_w32: 0.804x mct_w64: 0.948x mct_w128: 0.904x Cortex-A78: mct_w16: 0.542x mct_w32: 0.725x mct_w64: 0.741x mct_w128: 0.745x Cortex-A715: mct_w16: 0.561x mct_w32: 0.720x mct_w64: 0.740x mct_w128: 0.748x Cortex-X1: mct_w32: 0.886x mct_w64: 0.882x mct_w128: 0.917x Cortex-X3: mct_w32: 0.835x mct_w64: 0.803x mct_w128: 0.808x
-
Save a complex arithmetic instruction in the jump table address calculation of prep_neon function.
-
Move the BTI landing pads out of the inner loops of prep_neon function. Only the width=4 and width=8 cases are affected. If BTI is enabled, moving the AARCH64_VALID_JUMP_TARGET out of the inner loops we get better execution speed on Cortex-A510 relative to the original (lower is better): w4: 0.969x w8: 0.722x Out-of-order cores are not affected.
-
- May 13, 2024
-
-
Arpad Panyik authored
Optimize the copy part of subpel filters (the put_neon function). For small block sizes (<16) the usage of general purpose registers is usually the best way to do the copy. Relative performance of micro benchmarks (lower is better): Cortex-A55: w2: 0.991x w4: 0.992x w8: 0.999x w16: 0.875x w32: 0.775x w64: 0.914x w128: 0.998x Cortex-A510: w2: 0.159x w4: 0.080x w8: 0.583x w16: 0.588x w32: 0.966x w64: 1.111x w128: 0.957x Cortex-A76: w2: 0.903x w4: 0.683x w8: 0.944x w16: 0.948x w32: 0.919x w64: 0.855x w128: 0.991x Cortex-A78: w32: 0.867x w64: 0.820x w128: 1.011x Cortex-A715: w32: 0.834x w64: 0.778x w128: 1.000x Cortex-X1: w32: 0.809x w64: 0.762x w128: 1.000x Cortex-X3: w32: 0.733x w64: 0.720x w128: 0.999x
-
Arpad Panyik authored
Save a complex arithmetic instruction in the jump table address calculation of put_neon function.
-
Arpad Panyik authored
Move the BTI landing pads out of the inner loops of put_neon function, the only exception is the width=16 case where it is already outside of the loops. When BTI is enabled, the relative performance of omitting the AARCH64_VALID_JUMP_TARGET from the inner loops on Cortex-A510 (lower is better): w2: 0.981x w4: 0.991x w8: 0.612x w32: 0.687x w64: 0.813x w128: 0.892x Out-of-order CPUs are mostly unaffected.
-
Henrik Gramner authored
-
Henrik Gramner authored
Both POSIX and the C standard places several environmental limits on setjmp() invocations, with essentially anything beyond comparing the return value with a constant as a simple branch condition being UB. We were previously performing a function call using the setjmp() return value as an argument, which is technically not allowed even though it happened to work correctly in practice. Some systems may loosen those restrictions and allow for more flexible usage, but we shouldn't be relying on that.
-
- May 12, 2024
-
-
Removed some unnecessary vector register copies from the initial horizontal filter parts of the HV subpel filters. The performance improvements are better for the smaller filter block sizes. The narrowing shifts were also rewritten at the end of the *filter8* because it was only beneficial for the Cortex-A55 among the DotProd capable CPU cores. On other out-of-order or newer CPUs the UZP1+SHRN instruction combination is better. Relative performance of micro benchmarks (lower is better): Cortex-A55: mct regular w4: 0.980x mct regular w8: 1.007x mct regular w16: 1.007x mct sharp w4: 0.983x mct sharp w8: 1.012x mct sharp w16: 1.005x Cortex-A510: mct regular w4: 0.935x mct regular w8: 0.984x mct regular w16: 0.986x mct sharp w4: 0.927x mct sharp w8: 0.983x mct sharp w16: 0.987x Cortex-A78: mct regular w4: 0.974x mct regular w8: 0.988x mct regular w16: 0.991x mct sharp w4: 0.971x mct sharp w8: 0.987x mct sharp w16: 0.979x Cortex-715: mct regular w4: 0.958x mct regular w8: 0.993x mct regular w16: 0.998x mct sharp w4: 0.974x mct sharp w8: 0.991x mct sharp w16: 0.997x Cortex-X1: mct regular w4: 0.983x mct regular w8: 0.993x mct regular w16: 0.996x mct sharp w4: 0.974x mct sharp w8: 0.990x mct sharp w16: 0.995x Cortex-X3: mct regular w4: 0.953x mct regular w8: 0.993x mct regular w16: 0.997x mct sharp w4: 0.981x mct sharp w8: 0.993x mct sharp w16: 0.995x
-
- May 10, 2024
-
-
Luca Barbato authored
It relies on vec_absd and vec_xst_len.
-
Luca Barbato authored
-
Luca Barbato authored
Will be used to gate code using vec_absd and other useful instructions.
-
- May 09, 2024
-
-
Arpad Panyik authored
Rewrite the accumulator initializations of the horizontal part of the 2D filters with zero register fills. It can improve the performance on out-of-order CPUs which can fill vector registers by zero with zero latency. Zeroed accumulators imply the usage of the rounding shifts at the end of filters. The only exception is the very short *hv_filter4*, where the longer latency of rounding shift could decrease the performance. The *filter8* function uses a different (alternating) dot product computation order for DotProd+ feature level, it gives a better overall performance for out-of-order and some in-order CPU cores. The i8mm version does not need to use bias for the loaded samples, so a different instruction scheduling is beneficial mostly affecting the order of TBL instructions in the 8-tap case. Relative performance of micro benchmarks (lower is better): Cortex-X3: mct_8tap_regular_w16_hv_8bpc_i8mm: 0.982x mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.979x mct_8tap_regular_w8_hv_8bpc_i8mm: 0.972x mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.969x mct_8tap_regular_w4_hv_8bpc_i8mm: 0.942x mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.935x mc_8tap_regular_w16_hv_8bpc_i8mm: 0.988x mc_8tap_sharp_w16_hv_8bpc_i8mm: 0.982x mc_8tap_regular_w8_hv_8bpc_i8mm: 0.981x mc_8tap_sharp_w8_hv_8bpc_i8mm: 0.975x mc_8tap_regular_w4_hv_8bpc_i8mm: 0.998x mc_8tap_sharp_w4_hv_8bpc_i8mm: 0.996x mc_8tap_regular_w2_hv_8bpc_i8mm: 1.006x mc_8tap_sharp_w2_hv_8bpc_i8mm: 0.993x Cortex-A715: mct_8tap_regular_w16_hv_8bpc_i8mm: 0.883x mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.931x mct_8tap_regular_w8_hv_8bpc_i8mm: 0.882x mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.928x mct_8tap_regular_w4_hv_8bpc_i8mm: 0.969x mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.934x mc_8tap_regular_w16_hv_8bpc_i8mm: 0.881x mc_8tap_sharp_w16_hv_8bpc_i8mm: 0.925x mc_8tap_regular_w8_hv_8bpc_i8mm: 0.879x mc_8tap_sharp_w8_hv_8bpc_i8mm: 0.925x mc_8tap_regular_w4_hv_8bpc_i8mm: 0.917x mc_8tap_sharp_w4_hv_8bpc_i8mm: 0.976x mc_8tap_regular_w2_hv_8bpc_i8mm: 0.915x mc_8tap_sharp_w2_hv_8bpc_i8mm: 0.972x Cortex-A510: mct_8tap_regular_w16_hv_8bpc_i8mm: 0.994x mct_8tap_sharp_w16_hv_8bpc_i8mm: 0.949x mct_8tap_regular_w8_hv_8bpc_i8mm: 0.987x mct_8tap_sharp_w8_hv_8bpc_i8mm: 0.947x mct_8tap_regular_w4_hv_8bpc_i8mm: 1.002x mct_8tap_sharp_w4_hv_8bpc_i8mm: 0.999x mc_8tap_regular_w16_hv_8bpc_i8mm: 0.989x mc_8tap_sharp_w16_hv_8bpc_i8mm: 1.003x mc_8tap_regular_w8_hv_8bpc_i8mm: 0.986x mc_8tap_sharp_w8_hv_8bpc_i8mm: 1.000x mc_8tap_regular_w4_hv_8bpc_i8mm: 1.007x mc_8tap_sharp_w4_hv_8bpc_i8mm: 1.000x mc_8tap_regular_w2_hv_8bpc_i8mm: 1.005x mc_8tap_sharp_w2_hv_8bpc_i8mm: 1.000x
-
- May 08, 2024
-
-
Arpad Panyik authored
Replace the accumulator initializations of the vertical subpel filters with register fills by zeros (which are usually zero latency operations in this feature class), this implies the usage of rounding shifts at the end in the prep cases. Out-of-order CPU cores can benefit from this change. The width=16 case uses a simpler register duplication scheme that relies on MOV instructions for the subsequent shuffles. This approach uses a different register to load the data into for better instruction scheduling and data dependency chain. Relative performance of micro benchmarks (lower is better): Cortex-X3: mct_8tap_sharp_w16_v_8bpc_i8mm: 0.910x mct_8tap_sharp_w8_v_8bpc_i8mm: 0.986x mc_8tap_sharp_w16_v_8bpc_i8mm: 0.864x mc_8tap_sharp_w8_v_8bpc_i8mm: 0.882x mc_8tap_sharp_w4_v_8bpc_i8mm: 0.933x mc_8tap_sharp_w2_v_8bpc_i8mm: 0.926x Cortex-A715: mct_8tap_sharp_w16_v_8bpc_i8mm: 0.855x mct_8tap_sharp_w8_v_8bpc_i8mm: 0.784x mct_8tap_sharp_w4_v_8bpc_i8mm: 1.069x mc_8tap_sharp_w16_v_8bpc_i8mm: 0.850x mc_8tap_sharp_w8_v_8bpc_i8mm: 0.779x mc_8tap_sharp_w4_v_8bpc_i8mm: 0.971x mc_8tap_sharp_w2_v_8bpc_i8mm: 0.975x Cortex-A510: mct_8tap_sharp_w16_v_8bpc_i8mm: 1.001x mct_8tap_sharp_w8_v_8bpc_i8mm: 0.979x mct_8tap_sharp_w4_v_8bpc_i8mm: 0.998x mc_8tap_sharp_w16_v_8bpc_i8mm: 0.998x mc_8tap_sharp_w8_v_8bpc_i8mm: 1.004x mc_8tap_sharp_w4_v_8bpc_i8mm: 1.003x mc_8tap_sharp_w2_v_8bpc_i8mm: 0.996x
-
Replace the accumulator initializations of the horizontal prep filters with register fills by zeros. Most i8mm capable CPUs can do these with zero latency, but we also need to use rounding shifts at the end of the filter. We can see better performance with this change on out-of-order CPUs. Relative performance of micro benchmarks (lower is better): Cortex-X3: mct_8tap_sharp_w32_h_8bpc_i8mm: 0.914x mct_8tap_sharp_w16_h_8bpc_i8mm: 0.906x mct_8tap_sharp_w8_h_8bpc_i8mm: 0.877x Cortex-A715: mct_8tap_sharp_w32_h_8bpc_i8mm: 0.819x mct_8tap_sharp_w16_h_8bpc_i8mm: 0.805x mct_8tap_sharp_w8_h_8bpc_i8mm: 0.779x Cortex-A510: mct_8tap_sharp_w32_h_8bpc_i8mm: 0.999x mct_8tap_sharp_w16_h_8bpc_i8mm: 1.001x mct_8tap_sharp_w8_h_8bpc_i8mm: 0.996x mct_8tap_sharp_w4_h_8bpc_i8mm: 0.915x
-
- May 06, 2024
-
-
Nathan E. Egge authored
-
- May 01, 2024
-
-
Similar to 4796b59f.
-
-
- Apr 29, 2024
-
-
Henrik Gramner authored
-
Henrik Gramner authored
-
- Apr 26, 2024
-
-
Martin Storsjö authored
The --cpumask flag only takes one single flag name, one can't set a combination like neon+dotprod. Therefore, apply the same pattern as for x86, by adding mask values that contain all the implied lower level flags. This is somewhat complicated, as the set of features isn't entirely linear - in particular, SVE doesn't imply either dotprod or i8mm, and SVE2 only implies dotprod, but not i8mm. This makes sure that "dav1d --cpumask dotprod" actually uses any SIMD at all, as it previously only set the dotprod flag but not neon, which essentially opted out from all SIMD.
-
Arpad Panyik authored
Add an Armv8.6-A i8mm code path for standard bitdepth convolutions. Only horizontal-vertical (HV) convolutions have 6-tap specialisations of their vertical passes. All other convolutions are 4- or 8-tap filters which fit well with the 4-element USDOT instruction. Benchmarks show 4-9% FPS increase relative to the Armv8.4-A code path depending on the input video and the CPU used. This patch will increase the .text by around 5.7 KiB. Relative performance to the C reference on some Cortex CPU cores: Cortex-A715 Cortex-X3 Cortex-A510 regular w4 hv neon: 7.20x 11.20x 4.40x regular w4 hv dotprod: 12.77x 18.35x 6.21x regular w4 hv i8mm: 14.50x 21.42x 6.16x sharp w4 hv neon: 6.24x 9.77x 3.96x sharp w4 hv dotprod: 9.76x 14.02x 5.20x sharp w4 hv i8mm: 10.84x 16.09x 5.42x regular w8 hv neon: 2.17x 2.46x 3.17x regular w8 hv dotprod: 3.04x 3.11x 3.03x regular w8 hv i8mm: 3.57x 3.40x 3.27x sharp w8 hv neon: 1.72x 1.93x 2.75x sharp w8 hv dotprod: 2.49x 2.54x 2.62x sharp w8 hv i8mm: 2.80x 2.79x 2.70x regular w16 hv neon: 1.90x 2.17x 2.02x regular w16 hv dotprod: 2.59x 2.64x 1.93x regular w16 hv i8mm: 3.01x 2.85x 2.05x sharp w16 hv neon: 1.51x 1.72x 1.74x sharp w16 hv dotprod: 2.17x 2.22x 1.70x sharp w16 hv i8mm: 2.42x 2.42x 1.72x regular w32 hv neon: 1.80x 1.96x 1.81x regular w32 hv dotprod: 2.43x 2.36x 1.74x regular w32 hv i8mm: 2.83x 2.51x 1.83x sharp w32 hv neon: 1.42x 1.54x 1.56x sharp w32 hv dotprod: 2.07x 2.00x 1.55x sharp w32 hv i8mm: 2.29x 2.16x 1.55x regular w64 hv neon: 1.82x 1.89x 1.70x regular w64 hv dotprod: 2.43x 2.25x 1.65x regular w64 hv i8mm: 2.84x 2.39x 1.73x sharp w64 hv neon: 1.43x 1.47x 1.49x sharp w64 hv dotprod: 2.08x 1.91x 1.49x sharp w64 hv i8mm: 2.30x 2.07x 1.48x regular w128 hv neon: 1.77x 1.84x 1.75x regular w128 hv dotprod: 2.37x 2.18x 1.70x regular w128 hv i8mm: 2.76x 2.33x 1.78x sharp w128 hv neon: 1.40x 1.45x 1.42x sharp w128 hv dotprod: 2.04x 1.87x 1.43x sharp w128 hv i8mm: 2.24x 2.02x 1.42x regular w8 h neon: 3.16x 3.51x 3.43x regular w8 h dotprod: 4.97x 7.43x 4.95x regular w8 h i8mm: 7.28x 10.38x 5.69x sharp w8 h neon: 2.71x 2.77x 3.10x sharp w8 h dotprod: 4.92x 7.14x 4.94x sharp w8 h i8mm: 7.21x 10.11x 5.70x regular w16 h neon: 2.79x 2.76x 3.53x regular w16 h dotprod: 3.81x 4.77x 3.13x regular w16 h i8mm: 5.21x 6.04x 3.56x sharp w16 h neon: 2.31x 2.38x 3.12x sharp w16 h dotprod: 3.80x 4.74x 3.13x sharp w16 h i8mm: 5.20x 5.98x 3.56x regular w64 h neon: 2.49x 2.46x 2.94x regular w64 h dotprod: 3.17x 3.60x 2.41x regular w64 h i8mm: 4.22x 4.40x 2.72x sharp w64 h neon: 2.07x 2.06x 2.60x sharp w64 h dotprod: 3.16x 3.58x 2.40x sharp w64 h i8mm: 4.20x 4.38x 2.71x regular w8 v neon: 6.11x 8.05x 4.07x regular w8 v dotprod: 5.45x 8.15x 4.01x regular w8 v i8mm: 7.30x 9.46x 4.19x sharp w8 v neon: 4.23x 5.46x 3.09x sharp w8 v dotprod: 5.43x 7.96x 4.01x sharp w8 v i8mm: 7.26x 9.12x 4.19x regular w16 v neon: 3.44x 4.33x 2.40x regular w16 v dotprod: 3.20x 4.53x 2.85x regular w16 v i8mm: 4.09x 5.27x 2.87x sharp w16 v neon: 2.50x 3.14x 1.82x sharp w16 v dotprod: 3.20x 4.52x 2.86x sharp w16 v i8mm: 4.09x 5.15x 2.86x regular w64 v neon: 2.74x 3.11x 1.53x regular w64 v dotprod: 2.63x 3.30x 1.84x regular w64 v i8mm: 3.31x 3.73x 1.84x sharp w64 v neon: 2.01x 2.29x 1.16x sharp w64 v dotprod: 2.61x 3.27x 1.83x sharp w64 v i8mm: 3.29x 3.68x 1.84x
-
- Apr 25, 2024
-
-
Arpad Panyik authored
Simplify the DotProd code path of the 2D (horizontal-vertical) subpel filters. It contains some instruction reordering and some macro simplifications to be more similar to the upcoming i8mm version. These changes have negligible effect on performance. Cortex-A510: mc_8tap_regular_w2_hv_8bpc_dotprod: 8.3769 -> 8.3380 mc_8tap_sharp_w2_hv_8bpc_dotprod: 9.5441 -> 9.5457 mc_8tap_regular_w4_hv_8bpc_dotprod: 8.3422 -> 8.3444 mc_8tap_sharp_w4_hv_8bpc_dotprod: 9.5441 -> 9.5367 mc_8tap_regular_w8_hv_8bpc_dotprod: 9.9852 -> 9.9666 mc_8tap_sharp_w8_hv_8bpc_dotprod: 12.5554 -> 12.5314 Cortex-A55: mc_8tap_regular_w2_hv_8bpc_dotprod: 6.4504 -> 6.4892 mc_8tap_sharp_w2_hv_8bpc_dotprod: 7.5732 -> 7.6078 mc_8tap_regular_w4_hv_8bpc_dotprod: 6.5088 -> 6.4760 mc_8tap_sharp_w4_hv_8bpc_dotprod: 7.5796 -> 7.5763 mc_8tap_regular_w8_hv_8bpc_dotprod: 9.3384 -> 9.3078 mc_8tap_sharp_w8_hv_8bpc_dotprod: 11.1159 -> 11.1401 Cortex-A78: mc_8tap_regular_w2_hv_8bpc_dotprod: 1.4122 -> 1.4250 mc_8tap_sharp_w2_hv_8bpc_dotprod: 1.7696 -> 1.7821 mc_8tap_regular_w4_hv_8bpc_dotprod: 1.4243 -> 1.4243 mc_8tap_sharp_w4_hv_8bpc_dotprod: 1.7866 -> 1.7863 mc_8tap_regular_w8_hv_8bpc_dotprod: 2.5304 -> 2.5171 mc_8tap_sharp_w8_hv_8bpc_dotprod: 3.0815 -> 3.0632 Cortex-X1: mc_8tap_regular_w2_hv_8bpc_dotprod: 0.8195 -> 0.8194 mc_8tap_sharp_w2_hv_8bpc_dotprod: 1.0092 -> 1.0081 mc_8tap_regular_w4_hv_8bpc_dotprod: 0.8197 -> 0.8166 mc_8tap_sharp_w4_hv_8bpc_dotprod: 1.0089 -> 1.0068 mc_8tap_regular_w8_hv_8bpc_dotprod: 1.5230 -> 1.5166 mc_8tap_sharp_w8_hv_8bpc_dotprod: 1.8683 -> 1.8625
-
Arpad Panyik authored
Simplify the load sequences in *hv_filter* functions (ldr + add -> ld1) to be more uniform and smaller. Performance is not affected.
-
Arpad Panyik authored
Simplify the TBL usages in small block size (2, 4) parts of the 2D (horizontal-vertical) put subpel filters. The 2-register TBLs are replaced with the 1-register form because we only need the lower 64-bits of the result and it can be extracted from only one source register. Performance is not affected by this change.
-
Arpad Panyik authored
Simplify the inner loops of the DotProd code path of horizontal subpel filters to avoid using 2-register TBL instructions. The store part of block size 16 of the horizontal put case is also simplified (str + add -> st1). This patch can improve performance mostly on small cores like Cortex-A510 and newer. Other CPUs are mostly unaffected. Cortex-A510: mct_8tap_sharp_w16_h_8bpc_dotprod: 2.77x -> 3.13x mct_8tap_sharp_w32_h_8bpc_dotprod: 2.32x -> 2.56x Cortex-A55: mct_8tap_sharp_w16_h_8bpc_dotprod: 3.89x -> 3.89x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.35x -> 3.35x Cortex-A715: mct_8tap_sharp_w16_h_8bpc_dotprod: 3.79x -> 3.78x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.30x -> 3.30x Cortex-A78: mct_8tap_sharp_w16_h_8bpc_dotprod: 4.30x -> 4.31x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.79x -> 3.80x Cortex-X3: mct_8tap_sharp_w16_h_8bpc_dotprod: 4.74x -> 4.75x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.89x -> 3.91x Cortex-X1: mct_8tap_sharp_w16_h_8bpc_dotprod: 4.61x -> 4.62x mct_8tap_sharp_w32_h_8bpc_dotprod: 3.67x -> 3.66x
-
Arpad Panyik authored
Simplify the accumulator initializations of the DotProd code path of vertical subpel filters. This also makes it possible for some CPUs to use zero latency vector register moves. The load is also simplified (ldr + add -> ld1) in the inner loop of vertical filter for block size 16.
-
Arpad Panyik authored
Add \dot parameter to filter_8tap_fn macro in preparation to extend it with i8mm code path. This patch also contains string fixes and some instruction reorderings along with some register renaming to make it more uniform. These changes don't affect performance but simplifies the code a bit.
-
- Apr 22, 2024
-
-
Martin Storsjö authored
Manually add a padding 0 entry to make the odd number of .hword entries align with the instruction size. This fixes assembling with GAS, with the --gdwarf2 option, where it previously produced the error message "unaligned opcodes detected in executable segment". The message is slightly misleading, as the error is printed even if there actually are no opcodes that are misaligned, as the jump table is the last thing within the .text section. The issue can be reproduced with an input as small as this, assembled with "as --gdwarf2 -c test.s". .text nop .hword 0 See a6228f47 for earlier cases of the same error - although in those cases, we actually did have more code and labels following the unaligned jump tables. This error is present with binutils 2.39 and earlier; in binutils 2.40, this input no longer is considered an error, fixed in https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=6f6f5b0adc9efd103c434fd316e8c880a259775d.
-
- Apr 21, 2024
-
-
One addressing optimization and fix some missing changes to a previous commit that ported improvements from hi tok to other decode tok functions.
-
- Apr 16, 2024
-
-
Matthias Dressel authored
Since dav1d was the only user of these crossfiles, it was agreed upon to remove them from the image [0] and move to dav1d directly. [1] [0] docker-images!293 [1] docker-images!294 (comment 434720)
-
- Apr 15, 2024
-
-
Port improvements from the hi token functions to the rest of the symbol adaption functions. These weren't originally ported since they didn't work with arbitrary padding. In practice, zero padding is already used and only the tests need to be updated. Results - Neoverse N1 Old: msac_decode_symbol_adapt4_c: 41.4 ( 1.00x) msac_decode_symbol_adapt4_neon: 31.0 ( 1.34x) msac_decode_symbol_adapt8_c: 54.5 ( 1.00x) msac_decode_symbol_adapt8_neon: 32.2 ( 1.69x) msac_decode_symbol_adapt16_c: 85.6 ( 1.00x) msac_decode_symbol_adapt16_neon: 37.5 ( 2.28x) New: msac_decode_symbol_adapt4_c: 41.5 ( 1.00x) msac_decode_symbol_adapt4_neon: 27.7 ( 1.50x) msac_decode_symbol_adapt8_c: 55.7 ( 1.00x) msac_decode_symbol_adapt8_neon: 30.1 ( 1.85x) msac_decode_symbol_adapt16_c: 82.4 ( 1.00x) msac_decode_symbol_adapt16_neon: 35.2 ( 2.34x)
-
Henrik Gramner authored
6-tap filtering is only performed vertically due to use of VNNI instructions processing 4 pixels per instruction horizontally.
-