- Oct 19, 2023
-
-
- Oct 05, 2023
-
-
Ronald S. Bultje authored
-
- Oct 04, 2023
-
-
- Oct 03, 2023
-
-
Jean-Baptiste Kempf authored
-
- Sep 08, 2023
-
-
André Kempe authored
Amend call type in refmvs. Because these blocks are reached via blr x11, they need to be annotated. Add missing BTI landing pads in ipred.S and ipred16.S. Because the subroutines are called via a br from register, they need annotation with 'bti j' (AARCH64_VALID_JUMP_TARGET).
-
- Aug 18, 2023
-
-
- Jul 25, 2023
-
-
Niklas Haas authored
Avoiding this hard-coded round-and-shift allows FGS to continue working when modifying FG_BLOCK_SIZE (for whatever reason), and is better style (no magic constants).
-
Niklas Haas authored
Makes this (globally available) constant more descriptive.
-
- Jul 18, 2023
-
-
Reduces memory usage (by 3 kB per sb128 for 4:2:0) when decoding streams with subsampled chroma when frame threading is enabled. This also simplifies the logic for calculating cbi indices. Both entropy decoding and reconstruction access the elements in the same order, so calculating block x/y positions is redundant and we can instead just store values sequentially and increase the pointer by one every time it's accessed.
-
- Jul 12, 2023
-
-
Matthias Dressel authored
Integrates --bench-c into --bench to simplify benchmarks.
-
- Jul 07, 2023
-
-
Martin Storsjö authored
Windows RC files can have strings expressed either as narrow chars expressed in a specific codepage, or as wide unicode strings. Regardless of which way they are expressed, they are converted into unicode strings in the compiled resource files. When using narrow strings, even if using escaped chars like \251, those chars are interpreted according to a specific codepage. The codepage can be specified with arguments to the RC/windres tool (or with a pragma, but not all tools support the pragmas), but when no codepage is specified, the exact interpretation varies. llvm-rc uses a hard stance of defaulting to only accepting ANSI chars unless something else has been specified (and pragmas aren't supported). llvm-windres defaults to CP 850 though, for compatibility with what most people probably intend to. However, GNU windres and MS rc.exe actually default to what the system's current default codepage is. That means that if the resource file is built on a machine with e.g. Japanese as the default locale, the file gets built differently, with a different Unicode character than what was intended. By converting the strings to wide strings, it is unambiguous that \251 refers to the Unicode code point u00A9 (octal 0251), i.e. copyright sign. This fixes building the RC files with llvm-rc. With GNU windres, llvm-windres and rc.exe, the files still generate the bitwise exact same output as before.
-
Matthias Dressel authored
-
- Jul 06, 2023
-
-
Regression introduced in 72e9c7c0.
-
-
-
-
Pack two indices into each byte instead of storing them separately. Reduces memory usage by up to 16 kB per sb128 in streams that uses screen content tools when frame-threading is enabled, at the cost of some additional computational overhead for packing/unpacking.
-
Reduces memory usage by 6 kB per sb128 in 8bpc streams that uses screen content tools when frame-threading is enabled.
-
Only one of the sign or no-sign 4:4:4 tables are ever used for any given wedge index, so there's no point in having both. Reduces the table size by around 50 kB.
-
Replace pointers with 16-bit relative offsets and remove entries for unused block sizes (only 8x8..32x32 are relevant). Reduces the table size by around 17 kB.
-
- Jul 01, 2023
-
-
Martin Storsjö authored
Add an explicit align before the jump table; this avoids armasm bugs in how label differences are calculated. This matches how all other jump tables are written in our 32 bit arm assembly.
-
- Jun 30, 2023
-
-
Victorien Le Couviour--Tuffet authored
-
Martin Storsjö authored
Relative speedup compared to C: Cortex A7 A8 A9 A53 A72 A73 save_tmvs_neon: 1.20 1.42 1.25 1.58 1.26 1.99
-
Martin Storsjö authored
Also improve scheduling in the prologue and fix a few cases of inconsistent indentation. Before: Cortex A53 A55 A72 A73 A76 Apple M1 save_tmvs_neon: 73657.2 74470.9 72238.1 56095.4 34135.7 207.9 After: save_tmvs_neon: 72187.2 74434.6 71068.9 56043.9 33237.4 201.0 (The changes to the M1 numbers are mostly measurement noise though.)
-
- Jun 28, 2023
-
-
Martin Storsjö authored
Binutils and LLVM assemblers can infer that this str instruction must be stur (and implicitly assemble it into that instruction), while MS armasm64 errored out with this message: src\libdav1d.a.p\refmvs.obj.asm(673) : error A2518: operand 2: Memory offset must be aligned str q2, [x3, #(8*5-16)]
-
- Jun 26, 2023
-
-
Martin Storsjö authored
Before: Cortex A53 A55 A72 A73 A76 Apple M1 save_tmvs_neon: 79184.7 79889.9 54720.2 54522.6 29919.6 216.4 After: save_tmvs_neon: 73780.0 74339.2 70414.1 59102.0 35028.4 213.9 The benefit from this is marginal on Cortex A53 and A55, and Apple M1, while this change actually makes the code notably slower on Cortex A72, A73 and A76.
-
Martin Storsjö authored
Cortex A53 A55 A72 A73 A76 Apple M1 save_tmvs_c: 116768.4 122653.1 82587.7 90445.0 45386.8 242.1 save_tmvs_neon: 79184.7 79889.9 54720.2 54522.6 29919.6 216.4 Relative speedup compared with C: Cortex A53 A55 A72 A73 A76 Apple M1 save_tmvs_neon: 1.47 1.54 1.51 1.66 1.52 1.12
-
- Jun 22, 2023
-
-
Martin Storsjö authored
Make them operate in a more cache friendly manner, interleaving the various passes, and merging some of the functions that operate on data in similar patterns. This reduces the amount of stack used from 207 KB to 14 KB for sgr_3x3, from 207 KB to 16 KB for sgr_5x5 and from 255 KB to 33 KB for sgr_mix. This does however increase the size of the binary by about 12 KB. (The executable code generated from assembly actually shrinks by a little, but the higher level logic in C is quite nontrivial.) This is somewhat similar to what was done for x86 in fe2bb774. Benchmarks from checkasm: Before: Cortex A53 A55 A72 A73 A76 Apple M1 sgr_3x3_8bpc_neon: 493005.0 483133.2 365056.3 345197.9 202819.1 537.3 sgr_5x5_8bpc_neon: 353152.6 349614.3 268962.2 248431.8 142302.4 385.9 sgr_mix_8bpc_neon: 829903.9 815910.9 622858.5 577238.0 333362.9 881.7 sgr_3x3_10bpc_neon: 504778.6 499851.6 379203.1 346695.2 199738.7 537.0 sgr_5x5_10bpc_neon: 363111.9 362489.7 267903.1 247506.5 138417.2 351.3 sgr_mix_10bpc_neon: 853053.7 846768.8 628349.6 584553.8 328399.5 843.6 After: sgr_3x3_8bpc_neon: 387949.9 384216.4 294423.7 301968.2 184643.1 492.4 sgr_5x5_8bpc_neon: 259854.7 257233.2 193983.7 198388.4 128497.0 341.2 sgr_mix_8bpc_neon: 606401.5 595661.3 457209.7 462721.8 281906.7 738.6 sgr_3x3_10bpc_neon: 392472.7 394100.5 296048.1 304339.4 184271.4 471.3 sgr_5x5_10bpc_neon: 257248.3 257651.1 197552.5 199655.1 130739.7 322.9 sgr_mix_10bpc_neon: 605263.3 611197.4 441789.3 461339.2 286320.1 721.4 Speedup vs before: 27-41% 25-40% 23-42% 13-26% 5-18% 8-19%
-
Martin Storsjö authored
This issue isn't caught by checkasm, since these functions are internal to the SGR implementation, and checkasm only affects the parameters on the external DSP function interface. This could potentially trigger errors with future compilers.
-
- Jun 12, 2023
-
-
James Almer authored
Signed-off-by:
James Almer <jamrial@gmail.com>
-
James Almer authored
Signed-off-by:
James Almer <jamrial@gmail.com>
-
- Jun 09, 2023
-
-
James Almer authored
Missed in 31de9d50.
-
- Jun 07, 2023
-
-
Always-enabled basic sanity checks in API functions is reasonable, but within internal functions assert() is more appropriate when it comes to checking for "should never happen" conditions.
-
-
-
Martin Storsjö authored
After 8f320d59, MSVC started producing this warning: [63/123] Compiling C object src/libdav1d.a.p/obu.c.obj ../src/obu.c(708): warning C4244: '=': conversion from 'uint16_t' to 'uint8_t', possible loss of data
-
-
- Jun 06, 2023
-
-
-
James Almer authored
There's no reason to be so strict by ensuring the tool only works with a library built from the exact same git snapshot, when the only thing that matters is API availability and ABI compatibility. Signed-off-by:
James Almer <jamrial@gmail.com>
-