Commits · master · Cameron Cawley / dav1d

Sep 06, 2024

Improve density of group context setting macros · 4385e7e1

Kyle Siefring authored 5 months ago and

Ronald S. Bultje committed 4 months ago

Shared object binary size reduction:
x84_64           : 16112 bytes
ARM64            : 16008 bytes
ARM64(+Os)       : 21592 bytes
ARMv7(+Os+mthumb): 18480 bytes

Size reduction of symbols:
x84_64           : 15712 bytes
ARM64            : 18688 bytes
ARM64(+Os)       : 18404 bytes
ARMv7(+Os+mthumb): 17322 bytes

Compiles were done with clang version 18.1.8 and symbol sizes were
obtained using nm on the shared object.

Provides speed ups on older ARM64 cpus with very little impact on other
cpus.

Speedup:

c7i (skylake)
 Nature1080p      : x0.999
 Chimera          : x0.998

odroid C4
 Nature1080p      : x1.007
 Chimera          : x1.016
 Models1080p      : x1.005
 MountainBike1080p: x1.009
 Balloons1080p    : x1.008

Raspberry Pi 4
 Nature1080p      : x1.005
 Chimera          : x0.999
 Models1080p      : x0.999
 MountainBike1080p: x1.004
 Balloons1080p    : x1.003

Raspberry Pi 2 (Cortex-A7):
 (using size optimized build)
 Nature1080p      : x1.003
 Models1080p      : x0.997

4385e7e1

tests: Add an option to dav1d_argon.bash for using a wrapper tool · 166e1df5
Martin Storsjö authored 4 months ago
```
This allows executing all the tools within e.g. valgrind.

This matches the "meson test --wrap <tool>" feature.
```
166e1df5

AArch64: New method for calculating sgr table · 79db1624

Kyle Siefring authored 5 months ago and

Martin Storsjö committed 4 months ago

For the 3x3 part, double the width of the vertical loop. This is done to
provide more latency in the new sgr calculation.

Initial (master): Cortex A53 A55 A72 A73 A76 Apple M1
sgr_3x3_8bpc_neon: 387702.8 383154.2 295742.4 302100.1 185420.7 472.2
sgr_5x5_8bpc_neon: 261725.1 256919.8 194205.1 197585.6 128311.3 332.9
sgr_mix_8bpc_neon: 628085.0 593664.2 453551.8 450553.8 281956.0 711.2

Current:
sgr_3x3_8bpc_neon: 368331.4 363949.7 275499.0 272056.3 169614.4 432.7
sgr_5x5_8bpc_neon: 257866.7 255265.5 195962.5 199557.8 120481.3 319.2
sgr_mix_8bpc_neon: 598234.1 572896.4 418500.4 438910.7 258977.7 659.3

Include a minor improvement that gets rid of a dup instruction.

79db1624

AArch64: Optimize lane load/store in MC functions · ec5c3052

Arpad Panyik authored 4 months ago and

Martin Storsjö committed 4 months ago

Partial register writes can create long dependency chains, which can
reduce performance on out-of-order CPUs. This patch removes most of
these kinds of problems in MC functions by filling the full register
before other lane loading instructions.

Most lane extracting stores can also be optimized using FP scalar
stores when the 0th lane would be extracted.

Relative runtime of micro benchmarks after this patch on some Neoverse
and Cortex CPU cores:

8bpc neon V2 V1 X3 X1 A715 A78 A76
avg w8: 0.942x 1.030x 0.936x 0.935x 1.000x 0.877x 0.976x
w_avg w8: 0.908x 0.913x 0.919x 0.914x 0.999x 0.905x 0.910x
mask w8: 0.937x 0.905x 0.929x 0.907x 1.009x 0.921x 0.868x
w_mask 420 w4: 0.969x 0.968x 0.951x 0.962x 0.995x 0.976x 0.958x
w_mask 420 w8: 0.979x 0.935x 0.936x 0.935x 0.996x 0.948x 0.959x
blend w4: 0.721x 0.841x 0.764x 0.822x 0.772x 0.826x 0.883x
blend w8: 0.692x 0.733x 0.686x 0.730x 0.828x 0.723x 0.762x
blend h w2: 0.738x 0.776x 0.746x 0.775x 0.683x 0.827x 0.851x
blend h w4: 0.858x 0.942x 0.880x 0.933x 0.784x 0.924x 0.965x
blend h w8: 0.804x 0.807x 0.806x 0.805x 0.814x 0.810x 0.748x
blend v w2: 0.898x 0.931x 0.903x 0.949x 0.784x 0.867x 0.875x
blend v w4: 0.935x 0.905x 0.933x 0.922x 0.763x 0.777x 0.807x
blend v w8: 0.803x 0.802x 0.804x 0.815x 0.674x 0.677x 0.678x

16bpc neon V2 V1 X3 X1 A715 A78 A76
avg w4: 0.899x 0.967x 0.897x 0.948x 1.002x 0.901x 0.884x
w_avg w4: 0.952x 0.951x 0.936x 0.946x 0.997x 0.937x 0.925x
mask w4: 0.893x 0.958x 0.887x 0.948x 1.003x 0.938x 0.934x
w_mask 420 w4: 0.933x 0.932x 0.932x 0.939x 1.000x 0.910x 0.955x
w_mask 420 w8: 0.966x 0.962x 0.967x 0.961x 1.000x 0.990x 1.010x
blend w4: 0.367x 0.361x 0.370x 0.352x 0.418x 0.394x 0.476x
blend h w2: 0.365x 0.445x 0.369x 0.437x 0.416x 0.576x 0.699x
blend h w4: 0.343x 0.402x 0.342x 0.398x 0.418x 0.525x 0.603x
blend v w2: 0.464x 0.460x 0.460x 0.447x 0.494x 0.446x 0.503x
blend v w4: 0.432x 0.424x 0.437x 0.416x 0.433x 0.427x 0.534x
blend v w8: 0.936x 0.847x 0.949x 0.848x 1.007x 0.811x 0.785x

bilinear 8bpc neon V2 V1 X3 X1 A715 A78 A76
mct w4 0: 0.982x 0.983x 0.955x 1.029x 0.784x 0.817x 0.814x
mc w2 h: 0.277x 0.333x 0.275x 0.325x 0.299x 0.435x 0.518x
mct w4 h: 0.835x 0.862x 0.814x 0.887x 1.074x 0.899x 0.884x
mc w2 v: 0.887x 0.966x 0.894x 0.945x 0.808x 0.953x 0.997x
mc w4 v: 0.762x 0.899x 0.766x 0.867x 0.695x 0.915x 1.017x
mct w4 v: 0.700x 0.812x 0.740x 0.777x 0.777x 0.824x 0.853x
mc w2 hv: 0.928x 0.985x 0.929x 0.978x 0.789x 0.969x 1.010x
mct w4 hv: 0.887x 0.913x 0.912x 0.920x 1.001x 0.922x 0.937x

bilinear 16bpc neon V2 V1 X3 X1 A715 A78 A76
mc w2 0: 0.991x 1.032x 0.993x 0.970x 0.878x 0.925x 0.999x
mct w4 0: 0.811x 0.730x 0.797x 0.680x 0.808x 0.711x 0.805x
mc w4 h: 0.885x 0.901x 0.895x 0.905x 1.003x 0.909x 0.910x
mct w4 h: 0.902x 0.914x 0.898x 0.896x 1.000x 0.897x 0.934x
mc w2 v: 0.888x 0.966x 0.913x 0.955x 0.824x 0.958x 1.005x
mc w4 v: 0.897x 0.894x 0.903x 0.902x 1.001x 0.895x 0.895x
mct w4 v: 0.924x 0.908x 0.921x 0.901x 1.001x 0.904x 0.918x
mc w4 hv: 0.927x 0.925x 0.924x 0.933x 1.000x 0.936x 0.959x
mct w4 hv: 0.923x 0.944x 0.923x 0.944x 0.999x 0.931x 0.956x

8tap 8bpc neon V2 V1 X3 X1 A715 A78 A76
mct regular w4 0: 0.829x 0.854x 0.735x 0.861x 0.769x 0.766x 0.840x
mc regular w2 h: 0.984x 1.008x 0.983x 1.012x 0.986x 0.989x 0.995x
mc sharp w2 h: 0.987x 1.008x 0.986x 1.011x 0.985x 0.989x 0.995x
mc regular w4 h: 0.907x 0.911x 0.916x 0.908x 0.997x 0.936x 0.932x
mc sharp w4 h: 0.916x 0.914x 0.918x 0.913x 0.999x 0.939x 0.905x
mct regular w4 h: 0.992x 0.979x 0.993x 0.971x 1.000x 0.986x 0.976x
mct sharp w4 h: 0.991x 0.979x 0.989x 0.984x 1.001x 0.979x 0.983x
mc regular w2 v: 1.002x 1.001x 1.005x 1.000x 1.000x 0.998x 0.983x
mc sharp w2 v: 1.005x 1.001x 1.009x 0.998x 0.994x 0.997x 0.989x
mc regular w4 v: 0.985x 0.998x 0.991x 0.998x 1.000x 1.000x 0.983x
mc sharp w4 v: 1.005x 1.002x 1.006x 1.002x 0.998x 0.991x 0.999x
mct regular w4 v: 0.966x 0.967x 0.961x 0.974x 0.996x 0.954x 0.982x
mct sharp w4 v: 0.970x 0.944x 0.967x 0.944x 0.997x 0.951x 0.966x
mc regular w2 hv: 0.993x 0.993x 0.994x 0.987x 0.993x 0.985x 0.999x
mc sharp w2 hv: 0.994x 0.996x 0.992x 0.998x 0.997x 0.999x 0.999x
mc regular w4 hv: 0.964x 0.958x 0.964x 0.960x 0.982x 0.938x 0.958x
mc sharp w4 hv: 0.982x 0.981x 0.980x 0.982x 0.995x 0.986x 0.941x
mct regular w4 hv: 0.993x 0.994x 0.992x 0.994x 0.996x 0.992x 0.988x
mct sharp w4 hv: 0.993x 0.996x 0.991x 0.996x 0.954x 0.992x 1.011x

8tap 16bpc neon V2 V1 X3 X1 A715 A78 A76
mc regular w2 0: 0.869x 1.059x 0.874x 0.956x 0.883x 0.932x 1.000x
mct regular w4 0: 0.348x 0.369x 0.354x 0.377x 0.560x 0.409x 0.648x
mc regular w2 h: 0.996x 0.988x 0.992x 0.985x 0.989x 0.991x 1.006x
mc sharp w2 h: 0.996x 0.989x 0.979x 0.991x 0.987x 0.988x 0.997x
mc regular w4 h: 0.957x 0.937x 0.957x 0.948x 0.961x 0.927x 0.994x
mc sharp w4 h: 0.966x 0.940x 0.962x 0.954x 0.985x 0.929x 0.970x
mct regular w4 h: 0.922x 0.942x 0.932x 0.933x 1.007x 0.938x 0.905x
mct sharp w4 h: 0.919x 0.943x 0.919x 0.931x 0.971x 0.943x 0.929x
mc regular w2 v: 1.000x 0.997x 1.001x 1.003x 1.001x 0.999x 0.984x
mc sharp w2 v: 1.000x 0.999x 1.000x 0.999x 1.000x 1.000x 0.993x
mc regular w4 v: 0.936x 0.941x 0.936x 0.939x 0.999x 0.928x 0.981x
mc sharp w4 v: 0.955x 0.961x 0.949x 0.956x 0.999x 0.947x 0.953x
mct regular w4 v: 0.977x 0.966x 0.979x 0.968x 0.990x 0.972x 0.972x
mct sharp w4 v: 0.973x 0.965x 0.981x 0.963x 0.994x 0.977x 0.974x
mc regular w2 hv: 0.995x 1.001x 0.995x 0.995x 0.995x 1.000x 0.981x
mc sharp w2 hv: 0.993x 1.012x 0.993x 0.988x 0.996x 0.992x 1.008x
mc regular w4 hv: 0.938x 0.943x 0.939x 0.943x 0.986x 0.943x 0.997x
mc sharp w4 hv: 0.969x 0.959x 0.970x 0.974x 0.986x 0.993x 0.997x
mct regular w4 hv: 0.942x 0.970x 0.951x 0.960x 0.977x 0.958x 1.018x
mct sharp w4 hv: 0.923x 0.958x 0.934x 0.955x 0.973x 0.946x 0.986x

ec5c3052

AArch64: Optimize Armv8.0 Neon path of SBD H/HV 6-tap filters · a992a9be

Arpad Panyik authored 4 months ago and

Martin Storsjö committed 4 months ago

The 6-tap horizontal and the horizontal parts of 6-tap HV subpel
filters can be further improved by some pointer arithmetic and saving
some instructions (EXTs) in their data rearrangement codes.

Relative runtime of micro benchmarks after this patch on Cortex CPU
cores:

SBD mct h         X1     A78     A76     A72     A55
 regular  w8:  0.878x  0.894x  0.990x  0.923x  0.944x
 regular w16:  0.962x  0.931x  0.943x  0.949x  0.949x
 regular w32:  0.937x  0.937x  0.972x  0.938x  0.947x
 regular w64:  0.920x  0.965x  0.992x  0.936x  0.944x

SBD mct hv        X1     A78     A76     A72     A55
 regular  w8:  0.931x  0.970x  0.951x  0.950x  0.971x
 regular w16:  0.940x  0.971x  0.941x  0.952x  0.967x
 regular w32:  0.943x  0.972x  0.946x  0.961x  0.974x
 regular w64:  0.943x  0.973x  0.952x  0.944x  0.975x

a992a9be

AArch64: Optimize Armv8.0 Neon path of HBD HV 6-tap filters · 2d808de1

Arpad Panyik authored 4 months ago and

Martin Storsjö committed 4 months ago

The horizontal parts of 6-tap HV subpel filters can be further
improved by some pointer arithmetic and saving some instructions
(EXTs) in their data rearrangement codes.

Relative runtime of micro benchmarks after this patch on Cortex CPU
cores:

HBD mct hv        X1     A78     A76     A72     A55
 regular  w8:  0.952x  0.989x  0.924x  0.973x  0.976x
 regular w16:  0.961x  0.993x  0.928x  0.952x  0.971x
 regular w32:  0.964x  0.996x  0.930x  0.973x  0.972x
 regular w64:  0.963x  0.997x  0.930x  0.969x  0.974x

2d808de1

AArch64: Optimize Armv8.0 Neon path of HBD horizontal 6-tap filters · 93339ce8

Arpad Panyik authored 4 months ago and

Martin Storsjö committed 4 months ago

The 6-tap horizontal subpel filters can be further improved by some
pointer arithmetic and saving some instructions (EXTs) in their data
rearrangement codes.

Relative runtime of micro benchmarks after this patch on some Cortex
CPU cores:

regular:     X1      A78      A76      A55
 mc  w8:  0.915x   0.937x   0.900x   0.982x
 mc w16:  0.917x   0.947x   0.911x   0.971x
 mc w32:  0.914x   0.938x   0.873x   0.961x
 mc w64:  0.918x   0.932x   0.882x   0.964x

93339ce8

AArch64: Optimize Armv8.0 Neon path of HBD horizontal filters · 109b2427

Arpad Panyik authored 4 months ago and

Martin Storsjö committed 4 months ago

The reduction parts of the horizontal HBD MC filters use SRSHL+SQXTUN+
SRSHL instruction sequences. In the horizontal case this can be
rewritten using a single SQSHRUN instruction with an additional
rounding value (34 for 10-bit and 40 for 12-bit).

Relative runtime of micro benchmarks after this patch on some Cortex
CPU cores:

regular:     X1      A78      A76      A55
 mc  w2:  0.847x   0.864x   0.822x   0.859x
 mc  w4:  0.889x   0.994x   0.868x   0.917x
 mc  w8:  0.857x   0.911x   0.915x   0.978x
 mc w16:  0.890x   0.982x   0.868x   0.974x
 mc w32:  0.904x   0.991x   0.873x   0.967x
 mc w64:  0.919x   1.003x   0.860x   0.970x

109b2427

Sep 05, 2024

Support using C11 aligned_alloc for dav1d_alloc_aligned · d2687884
Cameron Cawley authored 4 months ago and Ronald S. Bultje committed 4 months ago

d2687884

meson: fix include directories when building as subproject · 7629402b

Kacper Michajłow authored 4 months ago and

Ronald S. Bultje committed 4 months ago

This makes `#include <dav1d/dav1d.h>` work correctly as we point to the
parent include directory, same as in the normal installation.

Also fixes conflict of including "version.h" which may already exist in
parent project or another subproject. Be more specific about the
headers. Normally it works, but when building as subproject version.h is
generated in build directory, so it no longer is prioritized when
including from dav1d.h and other header with the same name may be
included.

7629402b

Sep 04, 2024
- Allow software renderers with placebo-gl · 507b697e
  Cameron Cawley authored 4 months ago and Ronald S. Bultje committed 4 months ago
  
  507b697e
- Disable the mouse cursor in dav1dplay · 312972d6
  Cameron Cawley authored 4 months ago and Ronald S. Bultje committed 4 months ago
  
  312972d6
- Allow quitting dav1dplay with the escape key · b9cc27d5
  Cameron Cawley authored 4 months ago and Ronald S. Bultje committed 4 months ago
  
  b9cc27d5
- Allow playing videos in full-screen mode · 2f9fc727
  Cameron Cawley authored 4 months ago and Ronald S. Bultje committed 4 months ago
  
  2f9fc727
- dav1dplay: Ensure that SDL is shut down when the application quits · 4e1a8b45
  Cameron Cawley authored 1 year ago and Ronald S. Bultje committed 4 months ago
  
  4e1a8b45
Sep 01, 2024
- Allow getopt fallback to compile on non-Windows platforms · cc6eb3d5
  Cameron Cawley authored 4 months ago
  
  cc6eb3d5
Aug 30, 2024
- picture: copy HDR10+ and T35 metadata only to visible frames · bdef2997
  Cosmin Stejerean authored 6 months ago and James Almer committed 4 months ago
  
  bdef2997
Aug 29, 2024

Check for sys/types.h before using it · 6b3c489a
Cameron Cawley authored 4 months ago and Martin Storsjö committed 4 months ago

6b3c489a
Only include unistd.h and pthread.h when necessary · 7490d986
Cameron Cawley authored 4 months ago and Martin Storsjö committed 4 months ago

7490d986
Remove unused sys/stat.h includes · a796f66e
Cameron Cawley authored 4 months ago and Martin Storsjö committed 4 months ago

a796f66e
Allow compile time CPU detection to be used when trim_dsp is disabled · 41040189
Cameron Cawley authored 4 months ago and Martin Storsjö committed 4 months ago

41040189

aarch64: Split the jump tables to a separate const section · 41511bf1

Martin Storsjö authored 6 months ago

This should allow executing in environments where the executable
memory isn't readable.

Use 4 byte entries instead of 2; most object file formats support
relocations for a 4 byte symbol difference across sections, which
allows keeping the rest of the table lookup code similar to what
it was before.

Referencing a symbol in an arbitrary location in the executable
requires a two instruction sequence (adrp+add, via the movrel
macro).

Thus, the cost of this rewrite is doubling the size of the jump
tables (which were quite small so far), and adding one instruction
in each jump table setup prologue. On an ELF build, the .text section
shrinks by 1176 bytes, and the .rodata section grows by 3136 bytes,
i.e. a 1960 byte increase.

While refactoring, prefer doing sign extension during the load
(using ldrsw rather than ldr, to avoid using the "sxtw" modifier on
the add instruction), as extending ALU arithmetics have a higher
latency.

MS armasm64 doesn't seem to support calculating symbol differences
across sections (see [1]), so keep the jump tables in the text
section there, to let the assembler calculate it at assembly time
instead. (Keeping the condition as _WIN32 for simplicity, as we don't
interact directly with armasm64, but it is wrapped in gas-preprocessor.)

[1] https://developercommunity.visualstudio.com/t/armasm64-unable-to-create-cross-section/10722340

41511bf1

Fix the macro parameter name for the CHECK_SIZE macro · 0d8abee5
Martin Storsjö authored 4 months ago

0d8abee5
Ensure that the refmvs_refpair union is packed · 0255c2b2
Cameron Cawley authored 4 months ago and Ronald S. Bultje committed 4 months ago

0255c2b2
Detect availability of pthread_setname_np and pthread_set_name_np · 033a0909
Cameron Cawley authored 4 months ago and Ronald S. Bultje committed 4 months ago

033a0909

Aug 26, 2024

aarch64: Enable detection of SVE/SVE2 on Windows · ccb02ddf

Martin Storsjö authored 4 months ago

WinSDK 10.0.26100 added these processor feature constants.

Unfortunately, no constant was added for I8MM, but if SVE_I8MM
is available, we can at least be sure that regular I8MM is
available too.

ccb02ddf

Aug 24, 2024

aarch64: Fix a label typo · 27491dd9

Martin Storsjö authored 4 months ago

Apparently, this case isn't actually ever executed, at least in most
checkasm runs, but some tools could complain about the relocation
against 160b, which pointed elsewhere than intended.

27491dd9

Aug 23, 2024

aarch64: Avoid looping through the BTI instructions · e560d2ba
Martin Storsjö authored 4 months ago
```
This does the same optimizations as
3329f8d1 and
1790e132 on the rest of the
code.
```
e560d2ba

aarch64: ipred: Use the right fill width loop in ipred_z3_fill_padding_neon · 5a33c5c6

Martin Storsjö authored 4 months ago

This makes the code behave as intended, when filling a rectangle
with arbitrary width (filling with the largest power of two width
until filled); previously, it accidentally fell back on writing 4
pixel wide stripes immediately.

No measurable effect on checkasm benchmarks though.

5a33c5c6

Aug 22, 2024

AArch64: SVE MS armasm64 fix of HBD subpel filters · 472b31f8

Arpad Panyik authored 5 months ago and

Martin Storsjö committed 5 months ago

MS armasm64 cannot compile some SVE instructions with immediate
operands, e.g.:
  sub  z0.h, z0.h, #8192

The proper form is:
  sub  z0.h, z0.h, #32, lsl #8

This patch contains the needed fixes.

472b31f8

aarch64: mc16: Optimize the BTI landing pads in put/prep_neon · 3329f8d1

Martin Storsjö authored 5 months ago

Don't include the BTI landing pad instruction in the loops.

If built with BTI enabled, AARCH64_VALID_JUMP_TARGET expands to
a no-op instruction that indicates that indirect jumps can land
there. But there's no need for the loops to include that instruction.

3329f8d1

AArch64: Add HBD subpel filters using 128-bit SVE2 · 01558f3f

Arpad Panyik authored 5 months ago and

Martin Storsjö committed 5 months ago

Add an Armv9.0-A SVE2 code path for high bitdepth convolutions. Only
2D convolutions have 6-tap specialisations of their vertical passes.
All other convolutions are 4- or 8-tap filters which fit well with
the 4-element 16-bit SDOT instruction of SVE2.

This patch renames HBD prep/put_neon to prep/put_16bpc_neon and
exports put_16bpc_neon.

Benchmarks show up-to 17% FPS increase depending on the input video
and the CPU used.

This patch will increase the .text by around 8 KiB.

Relative performance to the C reference on some Cortex-A/X CPUs:

    regular     A715    A720      X3      X4    A510    A520
 w4 hv neon:    3.93x   4.10x   5.21x   5.17x   3.57x   5.27x
 w4 hv sve2:    4.99x   5.14x   6.00x   6.05x   4.33x   3.99x
 w8 hv neon:    1.72x   1.67x   1.98x   2.18x   2.95x   2.94x
 w8 hv sve2:    2.12x   2.29x   2.52x   2.62x   2.60x   2.60x
w16 hv neon:    1.59x   1.53x   1.83x   1.89x   2.35x   2.24x
w16 hv sve2:    1.94x   2.12x   2.33x   2.18x   2.06x   2.06x
w32 hv neon:    1.49x   1.50x   1.66x   1.76x   2.10x   2.16x
w32 hv sve2:    1.81x   2.09x   2.11x   2.09x   1.84x   1.87x
w64 hv neon:    1.52x   1.50x   1.55x   1.71x   1.95x   2.05x
w64 hv sve2:    1.84x   2.08x   1.97x   1.98x   1.74x   1.77x

 w4 h neon:     5.35x   5.47x   7.39x   5.78x   3.92x   5.19x
 w4 h sve2:     7.91x   8.35x  11.95x  10.33x   5.81x   5.42x
 w8 h neon:     4.49x   4.43x   6.50x   4.87x   7.18x   6.17x
 w8 h sve2:     6.09x   6.22x   9.59x   7.70x   7.89x   6.83x
w16 h neon:     2.53x   2.52x   2.34x   1.86x   2.71x   2.75x
w16 h sve2:     3.41x   3.47x   3.53x   3.25x   2.89x   2.96x
w32 h neon:     2.07x   2.08x   1.97x   1.56x   2.17x   2.21x
w32 h sve2:     2.76x   2.84x   2.94x   2.75x   2.24x   2.29x
w64 h neon:     1.86x   1.86x   1.76x   1.41x   1.87x   1.88x
w64 h sve2:     2.47x   2.54x   2.65x   2.46x   1.94x   1.94x

 w4 v neon:     5.22x   5.17x   6.36x   5.60x   4.23x   7.30x
 w4 v sve2:     5.86x   5.90x   7.81x   7.16x   4.86x   4.15x
 w8 v neon:     4.83x   4.79x   6.96x   6.45x   4.74x   8.40x
 w8 v sve2:     5.25x   5.23x   7.76x   6.79x   4.84x   4.13x
w16 v neon:     2.59x   2.60x   2.93x   2.47x   1.80x   4.16x
w16 v sve2:     2.85x   2.88x   3.36x   2.73x   1.86x   2.00x
w32 v neon:     2.12x   2.13x   2.33x   2.03x   1.34x   3.11x
w32 v sve2:     2.36x   2.40x   2.73x   2.32x   1.41x   1.48x
w64 v neon:     1.94x   1.92x   2.02x   1.78x   1.12x   2.59x
w64 v sve2:     2.16x   2.15x   2.37x   2.03x   1.17x   1.22x

 w4 0 neon:     1.75x   1.71x   1.44x   1.56x   3.18x   2.87x
 w4 0 sve2:     4.28x   4.39x   5.72x   6.42x   5.50x   4.68x
 w8 0 neon:     3.05x   3.04x   4.44x   4.64x   3.84x   3.52x
 w8 0 sve2:     3.85x   3.80x   5.45x   6.01x   4.92x   4.26x
w16 0 neon:     2.92x   2.93x   3.82x   3.23x   4.58x   4.44x
w16 0 sve2:     4.29x   4.27x   4.25x   4.15x   5.58x   5.29x
w32 0 neon:     2.73x   2.76x   3.50x   2.67x   4.44x   4.26x
w32 0 sve2:     4.09x   4.10x   3.75x   3.39x   5.67x   5.22x
w64 0 neon:     2.73x   2.70x   3.27x   3.14x   4.57x   4.68x
w64 0 sve2:     4.06x   3.97x   3.54x   3.18x   6.36x   6.25x

      sharp     A715    A720      X3      X4    A510    A520
 w4 hv neon:    3.54x   3.64x   4.43x   4.45x   3.03x   4.72x
 w4 hv sve2:    4.30x   4.55x   5.38x   5.26x   4.04x   3.76x
 w8 hv neon:    1.30x   1.25x   1.51x   1.60x   2.44x   2.43x
 w8 hv sve2:    1.86x   2.06x   2.09x   2.18x   2.37x   2.39x
w16 hv neon:    1.19x   1.16x   1.43x   1.36x   1.95x   1.98x
w16 hv sve2:    1.68x   1.91x   1.94x   1.84x   1.89x   1.94x
w32 hv neon:    1.13x   1.12x   1.30x   1.29x   1.75x   1.81x
w32 hv sve2:    1.58x   1.84x   1.75x   1.74x   1.70x   1.76x
w64 hv neon:    1.13x   1.13x   1.21x   1.25x   1.65x   1.69x
w64 hv sve2:    1.57x   1.84x   1.62x   1.67x   1.62x   1.65x

 w4 h neon:     5.38x   5.49x   7.46x   5.74x   3.93x   5.23x
 w4 h sve2:     7.86x   8.37x  11.99x  10.38x   5.81x   5.40x
 w8 h neon:     3.46x   3.49x   5.36x   4.64x   6.40x   5.62x
 w8 h sve2:     5.95x   6.23x   9.61x   7.76x   7.86x   6.89x
w16 h neon:     1.99x   1.97x   2.07x   1.91x   2.43x   2.51x
w16 h sve2:     3.42x   3.46x   3.75x   3.23x   2.89x   2.98x
w32 h neon:     1.67x   1.62x   1.66x   1.63x   1.95x   2.01x
w32 h sve2:     2.86x   2.84x   2.94x   2.72x   2.21x   2.29x
w64 h neon:     1.45x   1.45x   1.51x   1.48x   1.69x   1.70x
w64 h sve2:     2.47x   2.54x   2.64x   2.46x   1.93x   1.95x

 w4 v neon:     4.07x   4.01x   5.15x   4.74x   3.38x   6.56x
 w4 v sve2:     5.88x   5.86x   7.81x   7.15x   4.85x   4.39x
 w8 v neon:     3.64x   3.59x   5.38x   4.92x   3.59x   7.23x
 w8 v sve2:     5.23x   5.19x   7.77x   6.66x   4.81x   4.13x
w16 v neon:     1.93x   1.95x   2.25x   1.92x   1.35x   3.46x
w16 v sve2:     2.85x   2.88x   3.36x   2.71x   1.86x   1.94x
w32 v neon:     1.57x   1.58x   1.78x   1.60x   1.01x   2.67x
w32 v sve2:     2.36x   2.39x   2.73x   2.35x   1.41x   1.50x
w64 v neon:     1.44x   1.42x   1.54x   1.43x   0.85x   2.19x
w64 v sve2:     2.17x   2.15x   2.37x   2.06x   1.18x   1.25x

01558f3f

Aug 21, 2024

AArch64: Add USMMLA impl. for SBD 6-tap H/HV filters · 713c076d

Arpad Panyik authored 5 months ago

Add 6-tap variant of standard bit-depth horizontal subpel filters
using the Armv8.6 I8MM USMMLA matrix multiply instruction. This patch
also extends the HV filter with 6-tap horizontal pass using USMMLA.

Benchmarks show up-to 6-7% FPS increase depending on the input video
and the CPU used.

This patch will increase the .text by around 1.2 KiB.

Relative runtime of micro benchmarks after this patch on Neoverse
and Cortex CPU cores:

regular      V2      V1      X3    A720    A715    A520    A510
  w8 hv:  0.860x  0.895x  0.870x  0.896x  0.896x  0.938x  0.936x
 w16 hv:  0.829x  0.886x  0.865x  0.908x  0.906x  0.946x  0.944x
 w32 hv:  0.837x  0.883x  0.862x  0.914x  0.915x  0.953x  0.949x
 w64 hv:  0.840x  0.883x  0.862x  0.914x  0.914x  0.955x  0.952x

  w8 h:   0.746x  0.754x  0.747x  0.723x  0.724x  0.874x  0.866x
 w16 h:   0.749x  0.764x  0.745x  0.731x  0.731x  0.858x  0.852x
 w32 h:   0.739x  0.754x  0.738x  0.729x  0.729x  0.839x  0.837x
 w64 h:   0.736x  0.749x  0.733x  0.725x  0.726x  0.847x  0.836x

713c076d

Aug 12, 2024

AArch64: Fix typo in SBD 6-tap 2D/HV subpel filter · 287e90a3

Arpad Panyik authored 5 months ago

The macro parameter \xmy of filter_8tap_fn was used incorrectly as a
pointer instead of \lsrc. They refer to the same register but in
different context.

287e90a3

Aug 04, 2024
- decode_coefs: Optimize index offset calculations · 5ef6b241
  Kyle Siefring authored 5 months ago
```
Performance Impact on Sapphire Rapids:

Chimera: 0.46% Faster
```
  5ef6b241
Jun 26, 2024

AArch64: Move constants of DotProd subpel filters to .rodata · 2355eeb8

Arpad Panyik authored 6 months ago

The constants used for the subpel filters were placed in the .text
section for simplicity and peak performance, but this does not work on
systems with execute only .text sections (e.g.: OpenBSD).

The performance cost of moving the constants to the .rodata section
is small and mostly within the measurable noise.

2355eeb8

Jun 25, 2024

aarch64: Explicitly use the ldur instruction where relevant in mc_dotprod.S · 7fbcdc6d

Martin Storsjö authored 6 months ago

The ldr instruction only can handle offsets that are a multiple
of the element size; most assemblers implicitly produce the ldur
instruction when a non-aligned offset is provided.

Older versions of MS armasm64, however, error out on this. Since
MSVC 2022 17.8, armasm64 implicitly can produce ldur, but 2022 17.7
and earlier require explicitly writing the instruction as ldur.

Despite this, even older versions still fail to build the mc_dotprod.S
sources, with errors like this:

    src\libdav1d.a.p\mc_dotprod.obj.asm(556) : error A2513: operand 2: Constant value out of range
        mov             x10, (((0*15-1)<<7)|(3*15-1))

This happens on MSVC 2022 17.1 and older, while 17.2 and newer
accept the negative value expression here.

In practice, HAVE_DOTPROD doesn't get enabled by the Meson configure
script at the moment, as it uses inline assembly to test for external
assembler features.

7fbcdc6d

Add Arm OpenBSD run-time CPU feature detection support · 431f4fb2
Brad Smith authored 7 months ago and Martin Storsjö committed 6 months ago
```
Add run-time CPU feature detection for DotProd and i8mm on AArch64.
```
431f4fb2
x86: Add 6-tap variants of high bit-depth mc SSSE3 functions · 32bf6cde
Henrik Gramner authored 7 months ago

32bf6cde

Jun 17, 2024
- itx: restrict number of columns iterated over based on EOB · ca83ee6d
  Ronald S. Bultje authored 7 months ago
  
  ca83ee6d