Commits · 805d9e5a8ffce3ef78cebde4bfedf3642907b2d3 · VideoLAN / dav1d

May 25, 2024
- Update NEWS for 1.4.2 · 805d9e5a
  Jean-Baptiste Kempf authored 10 months ago
  
  1.4.2
  
  805d9e5a
May 20, 2024
- ARM64: Minor improvement to symbol decode · 3623543c
  Kyle Siefring authored 10 months ago and Jean-Baptiste Kempf committed 10 months ago
```
Use a slightly shorter series of instructions to compute cdf update
rate.
```
  3623543c
- tests: Verify dav1d command line in dav1d_argon.bash · bb948769
  Henrik Gramner authored 10 months ago
```
Error out early instead of producing bogus mismatch errors in case
of an incorrect cpu mask for example.
```
  bb948769
May 19, 2024

arm64: msac: Explicitly use the ldur instruction · 9469e184

Martin Storsjö authored 10 months ago

The ldr instruction can take an immediate offset which is a multiple
of the loaded element size. If the ldr instruction is given an
immediate offset which isn't a multiple of the element size,
most assemblers implicitly generate a "ldur" instruction instead.

Older versions of MS armasm64.exe don't do this, but instead error
out with "error A2518: operand 2: Memory offset must be aligned".
(Current versions don't do this but correctly generate "ldur"
implicitly.)

Switch this instruction to an explicit "ldur", like we do elsewhere,
to fix building with these older tools.

9469e184

May 18, 2024

CI: Update Android image · 37155c11

Matthias Dressel authored 11 months ago and

Jean-Baptiste Kempf committed 10 months ago

NDK 26 dropped support for API versions 19 and 20 (KitKat, Android 4.4).
The minimum supported API is now 21 (Lollipop, Android 5.0).

37155c11

May 14, 2024

ARM64: Various optimizations for symbol decode · 7f68f23c

Kyle Siefring authored 11 months ago

Changes stem from redesigning the reduction stage of the multisymbol
decode function.
* No longer use adapt4 for 5 possible symbol values
* Specialize reduction for 4/8/16 decode functions
* Modify control flow

+------------------------+--------------+--------------+---------------+
|                        |  Neoverse V1 |  Neoverse N1 |   Cortex A72  |
|                        | (Graviton 3) | (Graviton 2) |  (Graviton 1) |
+------------------------+-------+------+-------+------+-------+-------+
|                        |  Old  |  New |  Old  |  New |  Old  |  New  |
+------------------------+-------+------+-------+------+-------+-------+
| decode_bool_neon       |  13.0 | 12.9 |  14.9 | 14.0 |  39.3 |  29.0 |
+------------------------+-------+------+-------+------+-------+-------+
| decode_bool_adapt_neon |  15.4 | 15.6 |  17.5 | 16.8 |  41.6 |  33.5 |
+------------------------+-------+------+-------+------+-------+-------+
| decode_bool_equi_neon  |  11.3 | 12.0 |  14.0 | 12.2 |  35.0 |  26.3 |
+------------------------+-------+------+-------+------+-------+-------+
| decode_hi_tok_c        |  73.7 | 57.8 |  73.4 | 60.5 | 130.1 | 103.9 |
+------------------------+-------+------+-------+------+-------+-------+
| decode_hi_tok_neon     |  63.3 | 48.2 |  65.2 | 51.2 | 119.0 | 105.3 |
+------------------------+-------+------+-------+------+-------+-------+
| decode_symbol_\        |  28.6 | 22.5 |  28.4 | 23.5 |  67.8 |  55.1 |
| adapt4_neon            |       |      |       |      |       |       |
+------------------------+-------+------+-------+------+-------+-------+
| decode_symbol_\        |  29.5 | 26.6 |  29.0 | 28.8 |  76.6 |  74.0 |
| adapt8_neon            |       |      |       |      |       |       |
+------------------------+-------+------+-------+------+-------+-------+
| decode_symbol_\        |  31.6 | 31.2 |  33.3 | 33.0 |  77.5 |  68.1 |
| adapt16_neon           |       |      |       |      |       |       |
+------------------------+-------+------+-------+------+-------+-------+

7f68f23c

AArch64: Optimize prep_neon function · d835c6bf

Arpad Panyik authored 10 months ago and

Martin Storsjö committed 10 months ago

Optimize the widening copy part of subpel filters (the prep_neon
function). In this patch we combine widening shifts with widening
multiplications in the inner loops to get maximum throughput.

The change will increase .text by 36 bytes.

Relative performance of micro benchmarks (lower is better):

Cortex-A55:
  mct_w4:   0.795x
  mct_w8:   0.913x
  mct_w16:  0.912x
  mct_w32:  0.838x
  mct_w64:  1.025x
  mct_w128: 1.002x

Cortex-A510:
  mct_w4:   0.760x
  mct_w8:   0.636x
  mct_w16:  0.640x
  mct_w32:  0.854x
  mct_w64:  0.864x
  mct_w128: 0.995x

Cortex-A72:
  mct_w4:   0.616x
  mct_w8:   0.854x
  mct_w16:  0.756x
  mct_w32:  1.052x
  mct_w64:  1.044x
  mct_w128: 0.702x

Cortex-A76:
  mct_w4:   0.837x
  mct_w8:   0.797x
  mct_w16:  0.841x
  mct_w32:  0.804x
  mct_w64:  0.948x
  mct_w128: 0.904x

Cortex-A78:
  mct_w16:  0.542x
  mct_w32:  0.725x
  mct_w64:  0.741x
  mct_w128: 0.745x

Cortex-A715:
  mct_w16:  0.561x
  mct_w32:  0.720x
  mct_w64:  0.740x
  mct_w128: 0.748x

Cortex-X1:
  mct_w32:  0.886x
  mct_w64:  0.882x
  mct_w128: 0.917x

Cortex-X3:
  mct_w32:  0.835x
  mct_w64:  0.803x
  mct_w128: 0.808x

d835c6bf

AArch64: Optimize jump table calculation of prep_neon · f0e779bc
Arpad Panyik authored 10 months ago and Martin Storsjö committed 10 months ago
```
Save a complex arithmetic instruction in the jump table address
calculation of prep_neon function.
```
f0e779bc

AArch64: Optimize BTI landing pads of prep_neon · 1790e132

Arpad Panyik authored 10 months ago and

Martin Storsjö committed 10 months ago

Move the BTI landing pads out of the inner loops of prep_neon
function. Only the width=4 and width=8 cases are affected.

If BTI is enabled, moving the AARCH64_VALID_JUMP_TARGET out of the
inner loops we get better execution speed on Cortex-A510 relative to
the original (lower is better):
  w4: 0.969x
  w8: 0.722x

Out-of-order cores are not affected.

1790e132

x86: Update x86inc.asm · 84185303

Henrik Gramner authored 10 months ago

https://code.videolan.org/videolan/x86inc.asm/-/commit/b6ba1e3045d758fd6c6e24591dac21a3dc812e1d

84185303

May 13, 2024

AArch64: Optimize put_neon function · 8141546d

Arpad Panyik authored 10 months ago

Optimize the copy part of subpel filters (the put_neon function).
For small block sizes (<16) the usage of general purpose registers
is usually the best way to do the copy.

Relative performance of micro benchmarks (lower is better):

Cortex-A55:
  w2:   0.991x
  w4:   0.992x
  w8:   0.999x
  w16:  0.875x
  w32:  0.775x
  w64:  0.914x
  w128: 0.998x

Cortex-A510:
  w2:   0.159x
  w4:   0.080x
  w8:   0.583x
  w16:  0.588x
  w32:  0.966x
  w64:  1.111x
  w128: 0.957x

Cortex-A76:
  w2:   0.903x
  w4:   0.683x
  w8:   0.944x
  w16:  0.948x
  w32:  0.919x
  w64:  0.855x
  w128: 0.991x

Cortex-A78:
  w32:  0.867x
  w64:  0.820x
  w128: 1.011x

Cortex-A715:
  w32:  0.834x
  w64:  0.778x
  w128: 1.000x

Cortex-X1:
  w32:  0.809x
  w64:  0.762x
  w128: 1.000x

Cortex-X3:
  w32: 0.733x
  w64: 0.720x
  w128: 0.999x

8141546d

AArch64: Optimize jump table calculation of put_neon · 645d1f9f
Arpad Panyik authored 10 months ago
```
Save a complex arithmetic instruction in the jump table address
calculation of put_neon function.
```
645d1f9f

AArch64: Optimize BTI landing pads of put_neon · 83452c6e

Arpad Panyik authored 10 months ago

Move the BTI landing pads out of the inner loops of put_neon
function, the only exception is the width=16 case where it is already
outside of the loops.

When BTI is enabled, the relative performance of omitting the
AARCH64_VALID_JUMP_TARGET from the inner loops on Cortex-A510 (lower
is better):
  w2:   0.981x
  w4:   0.991x
  w8:   0.612x
  w32:  0.687x
  w64:  0.813x
  w128: 0.892x

Out-of-order CPUs are mostly unaffected.

83452c6e

checkasm: Eliminate unreachable code in the Windows exception handler · cc1137c8
Henrik Gramner authored 10 months ago

cc1137c8

checkasm: Avoid UB in setjmp() invocations · 471549f2

Henrik Gramner authored 10 months ago

Both POSIX and the C standard places several environmental limits on
setjmp() invocations, with essentially anything beyond comparing the
return value with a constant as a simple branch condition being UB.

We were previously performing a function call using the setjmp()
return value as an argument, which is technically not allowed
even though it happened to work correctly in practice.

Some systems may loosen those restrictions and allow for more
flexible usage, but we shouldn't be relying on that.

471549f2

May 12, 2024

AArch64: Optimize the init of DotProd+ 2D subpel filters · a6d57b11

Arpad Panyik authored 10 months ago and

Jean-Baptiste Kempf committed 10 months ago

Removed some unnecessary vector register copies from the initial
horizontal filter parts of the HV subpel filters. The performance
improvements are better for the smaller filter block sizes.

The narrowing shifts were also rewritten at the end of the *filter8*
because it was only beneficial for the Cortex-A55 among the DotProd
capable CPU cores. On other out-of-order or newer CPUs the UZP1+SHRN
instruction combination is better.

Relative performance of micro benchmarks (lower is better):

Cortex-A55:
  mct regular w4:  0.980x
  mct regular w8:  1.007x
  mct regular w16: 1.007x

  mct sharp w4:    0.983x
  mct sharp w8:    1.012x
  mct sharp w16:   1.005x

Cortex-A510:
  mct regular w4:  0.935x
  mct regular w8:  0.984x
  mct regular w16: 0.986x

  mct sharp w4:    0.927x
  mct sharp w8:    0.983x
  mct sharp w16:   0.987x

Cortex-A78:
  mct regular w4:  0.974x
  mct regular w8:  0.988x
  mct regular w16: 0.991x

  mct sharp w4:    0.971x
  mct sharp w8:    0.987x
  mct sharp w16:   0.979x

Cortex-715:
  mct regular w4:  0.958x
  mct regular w8:  0.993x
  mct regular w16: 0.998x

  mct sharp w4:    0.974x
  mct sharp w8:    0.991x
  mct sharp w16:   0.997x

Cortex-X1:
  mct regular w4:  0.983x
  mct regular w8:  0.993x
  mct regular w16: 0.996x

  mct sharp w4:    0.974x
  mct sharp w8:    0.990x
  mct sharp w16:   0.995x

Cortex-X3:
  mct regular w4:  0.953x
  mct regular w8:  0.993x
  mct regular w16: 0.997x

  mct sharp w4:    0.981x
  mct sharp w8:    0.993x
  mct sharp w16:   0.995x

a6d57b11

May 10, 2024
- ppc: Loopfilter targeting pwr9 · 2d2c6c65
  Luca Barbato authored 1 year ago
```
It relies on vec_absd and vec_xst_len.
```
  2d2c6c65
- ppc: Imply vsx when asking for pwr9 · 7df050a6
  Luca Barbato authored 11 months ago
  
  7df050a6
- ppc: Add pwr9 flag · 700c36a6
  Luca Barbato authored 1 year ago
```
Will be used to gate code using vec_absd and other useful instructions.
```
  700c36a6
May 09, 2024

AArch64: Optimize 2D i8mm subpel filters · 643195f5

Arpad Panyik authored 10 months ago

Rewrite the accumulator initializations of the horizontal part of the
2D filters with zero register fills. It can improve the performance
on out-of-order CPUs which can fill vector registers by zero with
zero latency. Zeroed accumulators imply the usage of the rounding
shifts at the end of filters.

The only exception is the very short *hv_filter4*, where the longer
latency of rounding shift could decrease the performance.

The *filter8* function uses a different (alternating) dot product
computation order for DotProd+ feature level, it gives a better
overall performance for out-of-order and some in-order CPU cores.

The i8mm version does not need to use bias for the loaded samples, so
a different instruction scheduling is beneficial mostly affecting the
order of TBL instructions in the 8-tap case.

Relative performance of micro benchmarks (lower is better):

Cortex-X3:
  mct_8tap_regular_w16_hv_8bpc_i8mm:  0.982x
  mct_8tap_sharp_w16_hv_8bpc_i8mm:    0.979x
  mct_8tap_regular_w8_hv_8bpc_i8mm:   0.972x
  mct_8tap_sharp_w8_hv_8bpc_i8mm:     0.969x
  mct_8tap_regular_w4_hv_8bpc_i8mm:   0.942x
  mct_8tap_sharp_w4_hv_8bpc_i8mm:     0.935x
  mc_8tap_regular_w16_hv_8bpc_i8mm:   0.988x
  mc_8tap_sharp_w16_hv_8bpc_i8mm:     0.982x
  mc_8tap_regular_w8_hv_8bpc_i8mm:    0.981x
  mc_8tap_sharp_w8_hv_8bpc_i8mm:      0.975x
  mc_8tap_regular_w4_hv_8bpc_i8mm:    0.998x
  mc_8tap_sharp_w4_hv_8bpc_i8mm:      0.996x
  mc_8tap_regular_w2_hv_8bpc_i8mm:    1.006x
  mc_8tap_sharp_w2_hv_8bpc_i8mm:      0.993x

Cortex-A715:
  mct_8tap_regular_w16_hv_8bpc_i8mm:  0.883x
  mct_8tap_sharp_w16_hv_8bpc_i8mm:    0.931x
  mct_8tap_regular_w8_hv_8bpc_i8mm:   0.882x
  mct_8tap_sharp_w8_hv_8bpc_i8mm:     0.928x
  mct_8tap_regular_w4_hv_8bpc_i8mm:   0.969x
  mct_8tap_sharp_w4_hv_8bpc_i8mm:     0.934x
  mc_8tap_regular_w16_hv_8bpc_i8mm:   0.881x
  mc_8tap_sharp_w16_hv_8bpc_i8mm:     0.925x
  mc_8tap_regular_w8_hv_8bpc_i8mm:    0.879x
  mc_8tap_sharp_w8_hv_8bpc_i8mm:      0.925x
  mc_8tap_regular_w4_hv_8bpc_i8mm:    0.917x
  mc_8tap_sharp_w4_hv_8bpc_i8mm:      0.976x
  mc_8tap_regular_w2_hv_8bpc_i8mm:    0.915x
  mc_8tap_sharp_w2_hv_8bpc_i8mm:      0.972x

Cortex-A510:
  mct_8tap_regular_w16_hv_8bpc_i8mm:  0.994x
  mct_8tap_sharp_w16_hv_8bpc_i8mm:    0.949x
  mct_8tap_regular_w8_hv_8bpc_i8mm:   0.987x
  mct_8tap_sharp_w8_hv_8bpc_i8mm:     0.947x
  mct_8tap_regular_w4_hv_8bpc_i8mm:   1.002x
  mct_8tap_sharp_w4_hv_8bpc_i8mm:     0.999x
  mc_8tap_regular_w16_hv_8bpc_i8mm:   0.989x
  mc_8tap_sharp_w16_hv_8bpc_i8mm:     1.003x
  mc_8tap_regular_w8_hv_8bpc_i8mm:    0.986x
  mc_8tap_sharp_w8_hv_8bpc_i8mm:      1.000x
  mc_8tap_regular_w4_hv_8bpc_i8mm:    1.007x
  mc_8tap_sharp_w4_hv_8bpc_i8mm:      1.000x
  mc_8tap_regular_w2_hv_8bpc_i8mm:    1.005x
  mc_8tap_sharp_w2_hv_8bpc_i8mm:      1.000x

643195f5

May 08, 2024

AArch64: Optimize vertical i8mm subpel filters · b2eca1ac

Arpad Panyik authored 11 months ago

Replace the accumulator initializations of the vertical subpel
filters with register fills by zeros (which are usually zero latency
operations in this feature class), this implies the usage of rounding
shifts at the end in the prep cases. Out-of-order CPU cores can
benefit from this change.

The width=16 case uses a simpler register duplication scheme that
relies on MOV instructions for the subsequent shuffles. This approach
uses a different register to load the data into for better instruction
scheduling and data dependency chain.

Relative performance of micro benchmarks (lower is better):

Cortex-X3:
mct_8tap_sharp_w16_v_8bpc_i8mm:	0.910x
mct_8tap_sharp_w8_v_8bpc_i8mm: 	0.986x

mc_8tap_sharp_w16_v_8bpc_i8mm: 	0.864x
mc_8tap_sharp_w8_v_8bpc_i8mm:  	0.882x
mc_8tap_sharp_w4_v_8bpc_i8mm:  	0.933x
mc_8tap_sharp_w2_v_8bpc_i8mm:  	0.926x

Cortex-A715:
mct_8tap_sharp_w16_v_8bpc_i8mm:	0.855x
mct_8tap_sharp_w8_v_8bpc_i8mm: 	0.784x
mct_8tap_sharp_w4_v_8bpc_i8mm:  1.069x

mc_8tap_sharp_w16_v_8bpc_i8mm: 	0.850x
mc_8tap_sharp_w8_v_8bpc_i8mm:  	0.779x
mc_8tap_sharp_w4_v_8bpc_i8mm:  	0.971x
mc_8tap_sharp_w2_v_8bpc_i8mm:  	0.975x

Cortex-A510:
mct_8tap_sharp_w16_v_8bpc_i8mm: 1.001x
mct_8tap_sharp_w8_v_8bpc_i8mm: 	0.979x
mct_8tap_sharp_w4_v_8bpc_i8mm: 	0.998x

mc_8tap_sharp_w16_v_8bpc_i8mm: 	0.998x
mc_8tap_sharp_w8_v_8bpc_i8mm:   1.004x
mc_8tap_sharp_w4_v_8bpc_i8mm:   1.003x
mc_8tap_sharp_w2_v_8bpc_i8mm:  	0.996x

b2eca1ac

AArch64: Optimize horizontal i8mm prep filters · d1bdf4f1

Arpad Panyik authored 11 months ago and

Jean-Baptiste Kempf committed 10 months ago

Replace the accumulator initializations of the horizontal prep
filters with register fills by zeros. Most i8mm capable CPUs can do
these with zero latency, but we also need to use rounding shifts at
the end of the filter. We can see better performance with this
change on out-of-order CPUs.

Relative performance of micro benchmarks (lower is better):

Cortex-X3:
mct_8tap_sharp_w32_h_8bpc_i8mm:  0.914x
mct_8tap_sharp_w16_h_8bpc_i8mm:  0.906x
mct_8tap_sharp_w8_h_8bpc_i8mm:   0.877x

Cortex-A715:
mct_8tap_sharp_w32_h_8bpc_i8mm:  0.819x
mct_8tap_sharp_w16_h_8bpc_i8mm:  0.805x
mct_8tap_sharp_w8_h_8bpc_i8mm:   0.779x

Cortex-A510:
mct_8tap_sharp_w32_h_8bpc_i8mm:  0.999x
mct_8tap_sharp_w16_h_8bpc_i8mm:  1.001x
mct_8tap_sharp_w8_h_8bpc_i8mm:   0.996x
mct_8tap_sharp_w4_h_8bpc_i8mm:   0.915x

d1bdf4f1

May 06, 2024
- riscv: Check for standards compliant RVV 1.0+ · fc4763c5
  Nathan E. Egge authored 1 year ago
  
  fc4763c5
May 01, 2024
- CI: Improve coverage for argon samples using different thread counts · c7df9a3e
  Matthias Dressel authored 11 months ago and Jean-Baptiste Kempf committed 10 months ago
```
Similar to 4796b59f.
```
  c7df9a3e
- CI: Add dotprod to argon tests · 0f504bf5
  Matthias Dressel authored 11 months ago and Jean-Baptiste Kempf committed 10 months ago
  
  0f504bf5
Apr 29, 2024
- x86: Add 6-tap variants of high bit-depth mc AVX-512 (Ice Lake) functions · 22390124
  Henrik Gramner authored 11 months ago
  
  22390124
- x86: Add minor high bit-depth mc AVX-512 improvements · 8ff97b3a
  Henrik Gramner authored 11 months ago
  
  8ff97b3a
Apr 26, 2024

tools: Make ARM cpu flags imply relevant lower level flags · 236e1d19

Martin Storsjö authored 11 months ago

The --cpumask flag only takes one single flag name, one can't set
a combination like neon+dotprod.

Therefore, apply the same pattern as for x86, by adding mask values
that contain all the implied lower level flags.

This is somewhat complicated, as the set of features isn't entirely
linear - in particular, SVE doesn't imply either dotprod or i8mm,
and SVE2 only implies dotprod, but not i8mm.

This makes sure that "dav1d --cpumask dotprod" actually uses any
SIMD at all, as it previously only set the dotprod flag but not
neon, which essentially opted out from all SIMD.

236e1d19

AArch64: Add basic i8mm support for convolutions · 1776c45a

Arpad Panyik authored 11 months ago

Add an Armv8.6-A i8mm code path for standard bitdepth convolutions.
Only horizontal-vertical (HV) convolutions have 6-tap specialisations
of their vertical passes. All other convolutions are 4- or 8-tap
filters which fit well with the 4-element USDOT instruction.

Benchmarks show 4-9% FPS increase relative to the Armv8.4-A
code path depending on the input video and the CPU used.

This patch will increase the .text by around 5.7 KiB.

Relative performance to the C reference on some Cortex CPU cores:

                       Cortex-A715   Cortex-X3  Cortex-A510
regular w4 hv neon:          7.20x      11.20x        4.40x
regular w4 hv dotprod:      12.77x      18.35x        6.21x
regular w4 hv i8mm:         14.50x      21.42x        6.16x

  sharp w4 hv neon:          6.24x       9.77x        3.96x
  sharp w4 hv dotprod:       9.76x      14.02x        5.20x
  sharp w4 hv i8mm:         10.84x      16.09x        5.42x

regular w8 hv neon:          2.17x       2.46x        3.17x
regular w8 hv dotprod:       3.04x       3.11x        3.03x
regular w8 hv i8mm:          3.57x       3.40x        3.27x

  sharp w8 hv neon:          1.72x       1.93x        2.75x
  sharp w8 hv dotprod:       2.49x       2.54x        2.62x
  sharp w8 hv i8mm:          2.80x       2.79x        2.70x

regular w16 hv neon:         1.90x       2.17x        2.02x
regular w16 hv dotprod:      2.59x       2.64x        1.93x
regular w16 hv i8mm:         3.01x       2.85x        2.05x

  sharp w16 hv neon:         1.51x       1.72x        1.74x
  sharp w16 hv dotprod:      2.17x       2.22x        1.70x
  sharp w16 hv i8mm:         2.42x       2.42x        1.72x

regular w32 hv neon:         1.80x       1.96x        1.81x
regular w32 hv dotprod:      2.43x       2.36x        1.74x
regular w32 hv i8mm:         2.83x       2.51x        1.83x

  sharp w32 hv neon:         1.42x       1.54x        1.56x
  sharp w32 hv dotprod:      2.07x       2.00x        1.55x
  sharp w32 hv i8mm:         2.29x       2.16x        1.55x

regular w64 hv neon:         1.82x       1.89x        1.70x
regular w64 hv dotprod:      2.43x       2.25x        1.65x
regular w64 hv i8mm:         2.84x       2.39x        1.73x

  sharp w64 hv neon:         1.43x       1.47x        1.49x
  sharp w64 hv dotprod:      2.08x       1.91x        1.49x
  sharp w64 hv i8mm:         2.30x       2.07x        1.48x

regular w128 hv neon:        1.77x       1.84x        1.75x
regular w128 hv dotprod:     2.37x       2.18x        1.70x
regular w128 hv i8mm:        2.76x       2.33x        1.78x

  sharp w128 hv neon:        1.40x       1.45x        1.42x
  sharp w128 hv dotprod:     2.04x       1.87x        1.43x
  sharp w128 hv i8mm:        2.24x       2.02x        1.42x

regular w8 h neon:           3.16x       3.51x        3.43x
regular w8 h dotprod:        4.97x       7.43x        4.95x
regular w8 h i8mm:           7.28x      10.38x        5.69x

  sharp w8 h neon:           2.71x       2.77x        3.10x
  sharp w8 h dotprod:        4.92x       7.14x        4.94x
  sharp w8 h i8mm:           7.21x      10.11x        5.70x

regular w16 h neon:          2.79x       2.76x        3.53x
regular w16 h dotprod:       3.81x       4.77x        3.13x
regular w16 h i8mm:          5.21x       6.04x        3.56x

  sharp w16 h neon:          2.31x       2.38x        3.12x
  sharp w16 h dotprod:       3.80x       4.74x        3.13x
  sharp w16 h i8mm:          5.20x       5.98x        3.56x

regular w64 h neon:          2.49x       2.46x        2.94x
regular w64 h dotprod:       3.17x       3.60x        2.41x
regular w64 h i8mm:          4.22x       4.40x        2.72x

  sharp w64 h neon:          2.07x       2.06x        2.60x
  sharp w64 h dotprod:       3.16x       3.58x        2.40x
  sharp w64 h i8mm:          4.20x       4.38x        2.71x

regular w8 v neon:           6.11x       8.05x        4.07x
regular w8 v dotprod:        5.45x       8.15x        4.01x
regular w8 v i8mm:           7.30x       9.46x        4.19x

  sharp w8 v neon:           4.23x       5.46x        3.09x
  sharp w8 v dotprod:        5.43x       7.96x        4.01x
  sharp w8 v i8mm:           7.26x       9.12x        4.19x

regular w16 v neon:          3.44x       4.33x        2.40x
regular w16 v dotprod:       3.20x       4.53x        2.85x
regular w16 v i8mm:          4.09x       5.27x        2.87x

  sharp w16 v neon:          2.50x       3.14x        1.82x
  sharp w16 v dotprod:       3.20x       4.52x        2.86x
  sharp w16 v i8mm:          4.09x       5.15x        2.86x

regular w64 v neon:          2.74x       3.11x        1.53x
regular w64 v dotprod:       2.63x       3.30x        1.84x
regular w64 v i8mm:          3.31x       3.73x        1.84x

  sharp w64 v neon:          2.01x       2.29x        1.16x
  sharp w64 v dotprod:       2.61x       3.27x        1.83x
  sharp w64 v i8mm:          3.29x       3.68x        1.84x

1776c45a

Apr 25, 2024

AArch64: Simplify DotProd path of 2D subpel filters · fbf23637

Arpad Panyik authored 11 months ago

Simplify the DotProd code path of the 2D (horizontal-vertical) subpel
filters. It contains some instruction reordering and some macro
simplifications to be more similar to the upcoming i8mm version.

These changes have negligible effect on performance.

Cortex-A510:
mc_8tap_regular_w2_hv_8bpc_dotprod:   8.3769 ->  8.3380
mc_8tap_sharp_w2_hv_8bpc_dotprod:     9.5441 ->  9.5457
mc_8tap_regular_w4_hv_8bpc_dotprod:   8.3422 ->  8.3444
mc_8tap_sharp_w4_hv_8bpc_dotprod:     9.5441 ->  9.5367
mc_8tap_regular_w8_hv_8bpc_dotprod:   9.9852 ->  9.9666
mc_8tap_sharp_w8_hv_8bpc_dotprod:    12.5554 -> 12.5314

Cortex-A55:
mc_8tap_regular_w2_hv_8bpc_dotprod:  6.4504  ->  6.4892
mc_8tap_sharp_w2_hv_8bpc_dotprod:    7.5732  ->  7.6078
mc_8tap_regular_w4_hv_8bpc_dotprod:  6.5088  ->  6.4760
mc_8tap_sharp_w4_hv_8bpc_dotprod:    7.5796  ->  7.5763
mc_8tap_regular_w8_hv_8bpc_dotprod:  9.3384  ->  9.3078
mc_8tap_sharp_w8_hv_8bpc_dotprod:   11.1159  -> 11.1401

Cortex-A78:
mc_8tap_regular_w2_hv_8bpc_dotprod:  1.4122  ->  1.4250
mc_8tap_sharp_w2_hv_8bpc_dotprod:    1.7696  ->  1.7821
mc_8tap_regular_w4_hv_8bpc_dotprod:  1.4243  ->  1.4243
mc_8tap_sharp_w4_hv_8bpc_dotprod:    1.7866  ->  1.7863
mc_8tap_regular_w8_hv_8bpc_dotprod:  2.5304  ->  2.5171
mc_8tap_sharp_w8_hv_8bpc_dotprod:    3.0815  ->  3.0632

Cortex-X1:
mc_8tap_regular_w2_hv_8bpc_dotprod:  0.8195  ->  0.8194
mc_8tap_sharp_w2_hv_8bpc_dotprod:    1.0092  ->  1.0081
mc_8tap_regular_w4_hv_8bpc_dotprod:  0.8197  ->  0.8166
mc_8tap_sharp_w4_hv_8bpc_dotprod:    1.0089  ->  1.0068
mc_8tap_regular_w8_hv_8bpc_dotprod:  1.5230  ->  1.5166
mc_8tap_sharp_w8_hv_8bpc_dotprod:    1.8683  ->  1.8625

fbf23637

AArch64: Simplify loads in *hv_filter* of DotProd path · a40301b3

Arpad Panyik authored 11 months ago

Simplify the load sequences in *hv_filter* functions (ldr + add -> ld1)
to be more uniform and smaller. Performance is not affected.

a40301b3

AArch64: Simplify TBL usage in 2D DotProd filters · b0685c38

Arpad Panyik authored 11 months ago

Simplify the TBL usages in small block size (2, 4) parts of the 2D
(horizontal-vertical) put subpel filters. The 2-register TBLs are
replaced with the 1-register form because we only need the lower
64-bits of the result and it can be extracted from only one source
register. Performance is not affected by this change.

b0685c38

AArch64: Simplify DotProd path of horizontal subpel filters · ad7938d5

Arpad Panyik authored 11 months ago

Simplify the inner loops of the DotProd code path of horizontal
subpel filters to avoid using 2-register TBL instructions. The
store part of block size 16 of the horizontal put case is also
simplified (str + add -> st1). This patch can improve performance
mostly on small cores like Cortex-A510 and newer. Other CPUs are
mostly unaffected.

Cortex-A510:
mct_8tap_sharp_w16_h_8bpc_dotprod:  2.77x -> 3.13x
mct_8tap_sharp_w32_h_8bpc_dotprod:  2.32x -> 2.56x

Cortex-A55:
mct_8tap_sharp_w16_h_8bpc_dotprod:  3.89x -> 3.89x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.35x -> 3.35x

Cortex-A715:
mct_8tap_sharp_w16_h_8bpc_dotprod:  3.79x -> 3.78x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.30x -> 3.30x

Cortex-A78:
mct_8tap_sharp_w16_h_8bpc_dotprod:  4.30x -> 4.31x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.79x -> 3.80x

Cortex-X3:
mct_8tap_sharp_w16_h_8bpc_dotprod:  4.74x -> 4.75x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.89x -> 3.91x

Cortex-X1:
mct_8tap_sharp_w16_h_8bpc_dotprod:  4.61x -> 4.62x
mct_8tap_sharp_w32_h_8bpc_dotprod:  3.67x -> 3.66x

ad7938d5

AArch64: Simplify DotProd path of vertical subpel filters · 317a94c6

Arpad Panyik authored 11 months ago

Simplify the accumulator initializations of the DotProd code path of
vertical subpel filters. This also makes it possible for some CPUs to
use zero latency vector register moves. The load is also simplified
(ldr + add -> ld1) in the inner loop of vertical filter for block
size 16.

317a94c6

AArch64: Add \dot parameter to filter_8tap_fn macro · 7eee4a20

Arpad Panyik authored 11 months ago

Add \dot parameter to filter_8tap_fn macro in preparation to extend
it with i8mm code path. This patch also contains string fixes and
some instruction reorderings along with some register renaming to
make it more uniform. These changes don't affect performance but
simplifies the code a bit.

7eee4a20

Apr 22, 2024

aarch64: Avoid unaligned jump tables · cb8151c9

Martin Storsjö authored 11 months ago

Manually add a padding 0 entry to make the odd number of .hword
entries align with the instruction size.

This fixes assembling with GAS, with the --gdwarf2 option, where
it previously produced the error message "unaligned opcodes detected
in executable segment".

The message is slightly misleading, as the error is printed even
if there actually are no opcodes that are misaligned, as the jump
table is the last thing within the .text section. The issue can
be reproduced with an input as small as this, assembled with
"as --gdwarf2 -c test.s".

        .text
        nop
        .hword 0

See a6228f47 for earlier cases of
the same error - although in those cases, we actually did have more
code and labels following the unaligned jump tables.

This error is present with binutils 2.39 and earlier; in
binutils 2.40, this input no longer is considered an error, fixed
in https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=6f6f5b0adc9efd103c434fd316e8c880a259775d.

cb8151c9

Apr 21, 2024

ARM64: Minor msac improvements · a9feab9b

Kyle Siefring authored 11 months ago and

Jean-Baptiste Kempf committed 11 months ago

One addressing optimization and fix some missing changes to a previous
commit that ported improvements from hi tok to other decode tok
functions.

a9feab9b

Apr 16, 2024

CI: Move llvm crossfiles from image to project · 58519017

Matthias Dressel authored 11 months ago

Since dav1d was the only user of these crossfiles, it was agreed upon to
remove them from the image [0] and move to dav1d directly. [1]

[0] docker-images!293
[1] docker-images!294 (comment 434720)

58519017

Apr 15, 2024

ARM64: Port msac improvements to more functions · 37d52435

Kyle Siefring authored 11 months ago and

Henrik Gramner committed 11 months ago

Port improvements from the hi token functions to the rest of the symbol
adaption functions. These weren't originally ported since they didn't
work with arbitrary padding. In practice, zero padding is already used
and only the tests need to be updated.

Results - Neoverse N1

Old:
msac_decode_symbol_adapt4_c:         41.4 ( 1.00x)
msac_decode_symbol_adapt4_neon:      31.0 ( 1.34x)
msac_decode_symbol_adapt8_c:         54.5 ( 1.00x)
msac_decode_symbol_adapt8_neon:      32.2 ( 1.69x)
msac_decode_symbol_adapt16_c:        85.6 ( 1.00x)
msac_decode_symbol_adapt16_neon:     37.5 ( 2.28x)

New:
msac_decode_symbol_adapt4_c:         41.5 ( 1.00x)
msac_decode_symbol_adapt4_neon:      27.7 ( 1.50x)
msac_decode_symbol_adapt8_c:         55.7 ( 1.00x)
msac_decode_symbol_adapt8_neon:      30.1 ( 1.85x)
msac_decode_symbol_adapt16_c:        82.4 ( 1.00x)
msac_decode_symbol_adapt16_neon:     35.2 ( 2.34x)

37d52435

x86: Add 6-tap variants of 8bpc mc AVX-512 (Ice Lake) functions · 5b539991

Henrik Gramner authored 11 months ago

6-tap filtering is only performed vertically due to use of VNNI
instructions processing 4 pixels per instruction horizontally.

5b539991