- Sep 03, 2021
-
-
Jean-Baptiste Kempf authored
-
Jean-Baptiste Kempf authored
-
Jean-Baptiste Kempf authored
and links
-
Matthias Dressel authored
-
Matthias Dressel authored
-
Martin Storsjö authored
Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 gen_grain_uv_ar0_16bpc_420_neon: 5.05 6.71 5.42 4.95 6.45 9.59 gen_grain_uv_ar0_16bpc_422_neon: 5.54 7.18 6.29 5.45 6.55 8.80 gen_grain_uv_ar0_16bpc_444_neon: 6.64 8.07 6.70 6.89 7.16 9.98 gen_grain_uv_ar1_16bpc_420_neon: 3.22 2.16 2.58 3.51 3.16 4.68 gen_grain_uv_ar1_16bpc_422_neon: 3.24 2.26 2.73 3.83 3.36 4.65 gen_grain_uv_ar1_16bpc_444_neon: 3.48 2.41 2.85 4.32 3.69 4.90 gen_grain_uv_ar2_16bpc_420_neon: 3.29 2.90 2.92 4.14 3.48 4.59 gen_grain_uv_ar2_16bpc_422_neon: 3.35 3.01 3.13 4.50 3.61 4.50 gen_grain_uv_ar2_16bpc_444_neon: 3.66 3.55 3.32 5.15 3.87 4.93 gen_grain_uv_ar3_16bpc_420_neon: 3.39 3.79 3.60 4.67 4.04 4.70 gen_grain_uv_ar3_16bpc_422_neon: 3.39 4.04 3.96 4.93 4.16 4.65 gen_grain_uv_ar3_16bpc_444_neon: 3.79 4.47 4.36 5.54 4.59 5.07 gen_grain_y_ar0_16bpc_neon: 5.05 5.26 6.97 5.47 5.95 8.59 gen_grain_y_ar1_16bpc_neon: 2.35 1.72 2.07 3.53 3.16 3.47 gen_grain_y_ar2_16bpc_neon: 3.02 2.70 2.88 4.19 3.57 4.03 gen_grain_y_ar3_16bpc_neon: 3.49 3.18 3.69 5.01 3.99 4.50
-
Martin Storsjö authored
-
Martin Storsjö authored
This makes it correctly hit some conditions that avoid duplicated code, shrinking the text section by 1524 bytes.
-
- Sep 02, 2021
-
-
-
Allows for sharing more common code
-
- Sep 01, 2021
-
-
Martin Storsjö authored
Relative speedup over C code: Cortex A7 A8 A9 A53 A72 A73 gen_grain_uv_ar0_8bpc_420_neon: 6.13 7.81 8.17 6.78 6.62 11.13 gen_grain_uv_ar0_8bpc_422_neon: 6.34 7.64 8.00 6.83 6.93 10.31 gen_grain_uv_ar0_8bpc_444_neon: 7.09 8.29 8.55 7.95 7.89 11.05 gen_grain_uv_ar1_8bpc_420_neon: 3.39 2.26 3.06 4.13 3.41 4.95 gen_grain_uv_ar1_8bpc_422_neon: 3.40 2.23 3.02 4.18 3.36 4.73 gen_grain_uv_ar1_8bpc_444_neon: 3.46 2.18 2.95 4.46 3.57 4.91 gen_grain_uv_ar2_8bpc_420_neon: 3.88 3.00 3.32 4.74 3.57 5.31 gen_grain_uv_ar2_8bpc_422_neon: 3.92 3.04 3.36 4.82 3.57 5.06 gen_grain_uv_ar2_8bpc_444_neon: 4.32 3.14 3.62 5.56 3.90 5.43 gen_grain_uv_ar3_8bpc_420_neon: 4.35 3.53 4.05 5.35 4.44 5.56 gen_grain_uv_ar3_8bpc_422_neon: 4.38 3.49 4.17 5.41 4.48 5.36 gen_grain_uv_ar3_8bpc_444_neon: 4.84 3.70 4.36 5.95 4.87 5.82 gen_grain_y_ar0_8bpc_neon: 5.18 5.57 7.65 5.93 7.13 9.01 gen_grain_y_ar1_8bpc_neon: 2.64 1.66 2.48 3.32 3.15 3.77 gen_grain_y_ar2_8bpc_neon: 3.57 2.64 3.21 4.59 3.68 4.64 gen_grain_y_ar3_8bpc_neon: 4.27 3.93 4.12 5.41 4.63 5.17 (A73 is benched against C code compiled with a different C compiler, which can explain the slightly differing numbers there.) Absolute numbers: Cortex A7 A8 A9 A53 A72 A73 gen_grain_uv_ar0_8bpc_420_neon: 19614.6 13396.4 12320.4 15030.7 8288.1 8754.4 gen_grain_uv_ar0_8bpc_422_neon: 34660.9 24315.5 22225.3 26809.2 14549.8 15804.6 gen_grain_uv_ar0_8bpc_444_neon: 55625.6 39914.5 37100.2 44658.3 22917.3 27369.6 gen_grain_uv_ar1_8bpc_420_neon: 50049.5 63179.4 44793.1 36406.7 22690.3 25401.9 gen_grain_uv_ar1_8bpc_422_neon: 93289.5 117755.0 82815.4 67081.4 43133.1 46698.0 gen_grain_uv_ar1_8bpc_444_neon: 170880.0 223259.2 156241.5 122760.0 78655.6 85604.9 gen_grain_uv_ar2_8bpc_420_neon: 68185.5 78123.2 61457.3 47886.7 31526.2 36519.6 gen_grain_uv_ar2_8bpc_422_neon: 129195.2 148653.9 114133.2 89822.7 60242.6 70160.1 gen_grain_uv_ar2_8bpc_444_neon: 233133.7 272277.4 214108.7 161589.5 109069.3 127763.7 gen_grain_uv_ar3_8bpc_420_neon: 96374.4 94372.2 79663.8 70832.0 43065.3 50593.9 gen_grain_uv_ar3_8bpc_422_neon: 186324.8 184321.8 151490.1 136200.1 83758.0 98378.7 gen_grain_uv_ar3_8bpc_444_neon: 335596.6 336811.6 279755.5 247251.5 151657.2 178906.0 gen_grain_y_ar0_8bpc_neon: 46109.3 36022.2 28476.2 36478.5 18740.1 20660.4 gen_grain_y_ar1_8bpc_neon: 165054.2 217090.4 152578.9 118409.4 74357.2 83794.5 gen_grain_y_ar2_8bpc_neon: 226576.9 268320.3 210924.6 157829.4 105956.5 124293.2 gen_grain_y_ar3_8bpc_neon: 328337.2 330421.3 275110.1 242097.3 148538.7 177270.8 Corresponding numbers for the original arm64 version: Cortex A53 A72 A73 gen_grain_uv_ar0_8bpc_420_neon: 14874.7 7765.5 8536.0 gen_grain_uv_ar0_8bpc_422_neon: 26510.9 13685.3 15308.2 gen_grain_uv_ar0_8bpc_444_neon: 43189.6 21565.3 24312.0 gen_grain_uv_ar1_8bpc_420_neon: 33715.7 21669.8 22758.3 gen_grain_uv_ar1_8bpc_422_neon: 63955.3 41581.4 42852.5 gen_grain_uv_ar1_8bpc_444_neon: 117390.1 76503.5 78446.4 gen_grain_uv_ar2_8bpc_420_neon: 42779.0 27794.3 29677.9 gen_grain_uv_ar2_8bpc_422_neon: 82283.8 53446.7 58232.2 gen_grain_uv_ar2_8bpc_444_neon: 147773.8 98492.7 103754.1 gen_grain_uv_ar3_8bpc_420_neon: 56698.8 35697.1 40695.9 gen_grain_uv_ar3_8bpc_422_neon: 110132.4 69829.1 79196.8 gen_grain_uv_ar3_8bpc_444_neon: 196642.7 124174.9 141812.5 gen_grain_y_ar0_8bpc_neon: 36461.0 17782.0 19827.0 gen_grain_y_ar1_8bpc_neon: 113202.7 72457.7 75995.8 gen_grain_y_ar2_8bpc_neon: 142894.0 94450.9 100304.5 gen_grain_y_ar3_8bpc_neon: 191697.7 120674.9 137223.8
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Martin Storsjö authored
This should improve scheduling on in-order cores.
-
- Aug 31, 2021
-
-
Henrik Gramner authored
Requires meson 0.51 or newer (older versions will just keep the SSE).
-
- Aug 30, 2021
-
-
Also tweak w_mask_420 to simplify sharing of common code.
-
- Aug 26, 2021
-
-
Henrik Gramner authored
Silences warnings when building using recent meson versions.
-
- Aug 24, 2021
-
-
Martin Storsjö authored
Relative speedup over C code, for arm64: Cortex A53 A72 A73 Apple M1 splat_mv_w1_neon: 1.09 0.95 1.22 - splat_mv_w2_neon: 1.76 1.32 1.74 - splat_mv_w4_neon: 2.78 2.19 2.19 15.00 splat_mv_w8_neon: 3.59 2.06 2.59 12.00 splat_mv_w16_neon: 4.12 1.72 2.53 3.14 splat_mv_w32_neon: 4.07 1.60 2.40 3.00 (The resolution of the timer used on Apple M1 isn't enough to measure the small versions of this function.) Relative speedup over C code, for arm32: Cortex A7 A8 A9 A53 A72 A73 splat_mv_w1_neon: 0.70 1.12 0.91 0.65 1.01 1.06 splat_mv_w2_neon: 0.94 2.16 2.01 0.99 2.52 1.63 splat_mv_w4_neon: 1.27 2.04 1.49 1.52 1.75 2.18 splat_mv_w8_neon: 1.75 2.47 1.16 2.88 1.95 2.58 splat_mv_w16_neon: 2.00 2.44 1.12 3.25 1.85 2.65 splat_mv_w32_neon: 1.43 2.28 1.19 3.55 1.77 2.65
-
- Aug 23, 2021
-
-
Henrik Gramner authored
Add optimized code paths for pri-only and sec-only filter strengths. Also implement the missing 4x8 version (for 4:2:2 chroma subsampling).
-
Henrik Gramner authored
Equivalent since the secondary strength is always a power-of-two.
-
- Aug 19, 2021
-
-
-
-
-
-
Reduces size from 16B to 12B, while maintaining a 4-byte alignment.
-
- Aug 17, 2021
-
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
-
Ronald S. Bultje authored
-
- Aug 16, 2021
-
-
-
Only the primary strength can ever be large enough to result in a negative shift value that requires clipping to zero.
-
- Aug 13, 2021
-
-
Martin Storsjö authored
Relative speedup over C code: Cortex A53 A72 A73 Apple M1 gen_grain_uv_ar0_16bpc_420_neon: 2.90 4.13 5.43 5.80 gen_grain_uv_ar0_16bpc_422_neon: 3.23 4.51 5.52 5.83 gen_grain_uv_ar0_16bpc_444_neon: 4.01 4.97 6.08 5.87 gen_grain_uv_ar1_16bpc_420_neon: 2.94 2.80 3.56 3.48 gen_grain_uv_ar1_16bpc_422_neon: 3.14 3.07 3.68 3.47 gen_grain_uv_ar1_16bpc_444_neon: 3.54 3.51 3.93 2.61 gen_grain_uv_ar2_16bpc_420_neon: 3.92 3.69 4.40 3.98 gen_grain_uv_ar2_16bpc_422_neon: 4.13 3.96 4.42 3.92 gen_grain_uv_ar2_16bpc_444_neon: 4.69 4.33 4.84 3.25 gen_grain_uv_ar3_16bpc_420_neon: 5.05 5.39 5.42 4.74 gen_grain_uv_ar3_16bpc_422_neon: 5.25 5.68 5.57 4.67 gen_grain_uv_ar3_16bpc_444_neon: 6.02 6.33 6.35 4.38 gen_grain_y_ar0_16bpc_neon: 4.67 5.23 5.22 10.11 gen_grain_y_ar1_16bpc_neon: 3.32 3.03 3.28 2.24 gen_grain_y_ar2_16bpc_neon: 4.59 3.95 4.64 3.52 gen_grain_y_ar3_16bpc_neon: 5.89 5.93 6.36 4.79 Absolute numbers: Cortex A53 A72 A73 Apple M1 gen_grain_uv_ar0_16bpc_420_neon: 19797.2 9725.0 9234.0 29.7 gen_grain_uv_ar0_16bpc_422_neon: 34899.4 16875.3 17021.6 57.7 gen_grain_uv_ar0_16bpc_444_neon: 53776.6 28470.1 28773.1 107.8 gen_grain_uv_ar1_16bpc_420_neon: 37998.2 24631.2 24754.0 84.2 gen_grain_uv_ar1_16bpc_422_neon: 70817.5 44642.5 46323.1 166.3 gen_grain_uv_ar1_16bpc_444_neon: 123333.0 77316.4 83523.1 427.5 gen_grain_uv_ar2_16bpc_420_neon: 49115.8 33053.7 33249.9 93.6 gen_grain_uv_ar2_16bpc_422_neon: 92965.3 59663.8 64741.9 187.9 gen_grain_uv_ar2_16bpc_444_neon: 160899.7 108845.6 115422.4 441.8 gen_grain_uv_ar3_16bpc_420_neon: 65786.6 41924.3 45562.1 108.1 gen_grain_uv_ar3_16bpc_422_neon: 126232.3 78691.6 87351.5 217.6 gen_grain_uv_ar3_16bpc_444_neon: 218702.6 140197.8 151294.8 454.3 gen_grain_y_ar0_16bpc_neon: 35867.9 17653.6 20770.7 108.0 gen_grain_y_ar1_16bpc_neon: 118781.8 74777.1 81338.6 426.0 gen_grain_y_ar2_16bpc_neon: 155919.9 102145.8 109698.1 438.5 gen_grain_y_ar3_16bpc_neon: 213348.1 133054.8 144726.0 447.9 Corresponding numbers for 8bpc: Cortex A53 A72 A73 Apple M1 gen_grain_uv_ar0_8bpc_420_neon: 15086.1 8384.7 8556.6 29.4 gen_grain_uv_ar0_8bpc_422_neon: 26800.6 14354.4 15526.5 56.6 gen_grain_uv_ar0_8bpc_444_neon: 43749.6 22408.6 24627.9 108.3 gen_grain_uv_ar1_8bpc_420_neon: 33706.3 21892.6 22835.9 87.1 gen_grain_uv_ar1_8bpc_422_neon: 63897.0 41820.1 43468.9 171.8 gen_grain_uv_ar1_8bpc_444_neon: 117345.1 76372.5 79938.3 370.0 gen_grain_uv_ar2_8bpc_420_neon: 42808.8 28493.8 29932.8 92.2 gen_grain_uv_ar2_8bpc_422_neon: 82282.5 53969.4 58191.1 181.8 gen_grain_uv_ar2_8bpc_444_neon: 147641.4 98136.4 103157.6 430.2 gen_grain_uv_ar3_8bpc_420_neon: 56784.3 36342.0 40812.3 102.2 gen_grain_uv_ar3_8bpc_422_neon: 110249.7 70215.6 79716.0 200.5 gen_grain_uv_ar3_8bpc_444_neon: 196461.7 125802.8 141781.5 440.1 gen_grain_y_ar0_8bpc_neon: 36451.7 17794.4 19839.3 109.5 gen_grain_y_ar1_8bpc_neon: 113155.6 71811.9 77296.8 370.2 gen_grain_y_ar2_8bpc_neon: 142812.3 95042.4 100434.4 431.8 gen_grain_y_ar3_8bpc_neon: 191608.6 121199.5 136946.4 437.2
-
Martin Storsjö authored
No difference in genereated code, but >210 lines less of duplicated source code.
-
Martin Storsjö authored
No practical difference in generated code (or the size of it), but less source code to handle.
-
Martin Storsjö authored
These are never executed as they come after an unconditional branch.
-
Martin Storsjö authored
This shrinks the code section by 288 bytes.
-
Martin Storsjö authored
-
Martin Storsjö authored
-
Henrik Gramner authored
Improve the error message on failure to specify which registers that have been clobbered.
-