Commits on Source (91)
-
0d02b5e4
-
bsr has 3 cycles of latency for modern x86 processors. For this function, it's possible to obtain the number of bits to shift by alternative means. I'd estimate about approx -0.2% decrease in cpu usage based on percentages associated with function symbols in perf report. Benchmarks were run on a Ryzen 5 3600 (Zen 2). The used clip was the original 1080p chimera.
5dc55af6 -
5d7aa26e
-
James Almer authored
Ensure both the allocator and release callbacks are pointing to the default functions, and that no cookie was provided. This prevents the user from configuring a mix of custom and default callbacks.
-
Martin Storsjö authored
This corresponds to what the x86 assembly does right now. This allows removing a fair bit of code, and allows marking the stores as aligned. (Previously, the writes of the narrow slice temp buffer were unaligned.)
4d1d479d -
Martin Storsjö authored
This gives a speedup of around one cycle.
6ca0e228 -
Martin Storsjö authored
Also fix the location of one comment, to be consistent with other similar comments.
01842ed3 -
Martin Storsjö authored62f2ec41
-
Martin Storsjö authored2df07ed7
-
Martin Storsjö authoredab4ec8bc
-
Martin Storsjö authored
This might cause a slowdown of around one cycle on some cores, as the instructions were placed in a latency bubble before though, but simplify the code by moving them to the header where they'd normally be.
39dbabeb -
Martin Storsjö authorede08b784e
-
Martin Storsjö authored
Samples of some checkasm benchmarks: Cortex A7 A8 A53 A72 A73 cfl_ac_420_w4_16bpc_neon: 258.2 130.0 187.8 88.1 99.9 cfl_ac_420_w8_16bpc_neon: 396.3 192.3 278.0 134.1 148.1 cfl_ac_420_w16_16bpc_neon: 705.9 341.5 508.4 231.2 263.0 intra_pred_filter_w32_10bpc_neon: 3450.6 3279.7 1505.6 1716.8 1631.0 intra_pred_filter_w32_12bpc_neon: 5075.2 2467.3 2027.9 1605.7 1556.0 intra_pred_paeth_w64_16bpc_neon: 7850.6 4682.9 4538.4 4640.4 4952.4 intra_pred_smooth_w64_16bpc_neon: 6807.7 4044.0 4001.4 3001.9 3131.5 Corresponding numbers for arm64: Cortex A53 A72 A73 cfl_ac_420_w4_16bpc_neon: 154.8 87.1 81.6 cfl_ac_420_w8_16bpc_neon: 235.6 124.8 133.0 cfl_ac_420_w16_16bpc_neon: 428.8 206.5 234.9 intra_pred_filter_w32_10bpc_neon: 1333.2 1485.9 1468.3 intra_pred_filter_w32_12bpc_neon: 1839.1 1429.0 1439.7 intra_pred_paeth_w64_16bpc_neon: 3691.1 3091.8 3289.7 intra_pred_smooth_w64_16bpc_neon: 3776.8 3124.4 2827.1
168c5d5e -
Konstantin Pavlov authored
Also specify amd64 for being future-proof when we have Big Sur+ builders.
864b1995 -
3ccfc25a
-
Remove half of the masks since they are only used for cdef on a 8x8 level of granularity. Load the mask and combine the 16-bit sections into the 32-bit sections outside of the inner cdef loop. This should save some registers. Results in mild performance improvements.
0bd57c6b -
Relative speed-ups compared with gcc-9.2.0: Before After mc_8tap_regular_w2_h_16bpc_c: 276.6 219.9 mc_8tap_regular_w4_h_16bpc_c: 489.5 374.5 mc_8tap_regular_w8_h_16bpc_c: 897.7 686.8 mc_8tap_regular_w16_h_16bpc_c: 2573.7 2314.2 mc_8tap_regular_w32_h_16bpc_c: 7647.3 7012.4 mc_8tap_regular_w64_h_16bpc_c: 28163.8 25057.4 mc_8tap_regular_w128_h_16bpc_c: 77678.4 73570.0
b12229cc -
Victorien Le Couviour--Tuffet authored493d2b91
-
Victorien Le Couviour--Tuffet authored
Closes #203.
63a918b4 -
Victorien Le Couviour--Tuffet authored05d05f97
-
Before: Cortex A53 A55 A72 A73 cdef_dir_8bpc_neon: 400.0 391.2 269.7 282.9 cdef_dir_16bpc_neon: 417.7 413.0 303.8 313.6 After: Cortex A55 cdef_dir_8bpc_neon: 369.0 360.2 248.4 273.4 cdef_dir_16bpc_neon: 388.7 384.0 272.2 290.7
11cb2efa -
Janne Grunau authored
oss-fuzz uses '-Denable_tools=false'.
dd32acea -
On zen 2 and 3, vpermq is slower than vperm2i128. In some assembly, we use the former to swap lanes of a vector when we could be using the latter. On zen 1, the most expensive instruction is swapped, so this patch will be slower on them. On current intel cpus, these instructions are equally expensive, so there should be no impact there.
a0e9a2e3 -
Victorien Le Couviour--Tuffet authored5686e835
-
SGR uses edge detection to decide which pixels to modify, but if the input is pure random noise there isn't going to be many (if any) edges. As a result the entire function call often ends up doing nothing, which isn't ideal when we want test code for correctness. Change the input randomization algorithm to generate a checkerboard pattern with limited noise applied to the flat areas.
f539111b -
f54cf173
-
9463c9f5
-
Matthias Dressel authoredbb3539a9
-
Matthias Dressel authored
Remains from code restructuring in 89ea92ba
ecb00748 -
Victorien Le Couviour--Tuffet authored
Calling dav1d_close already takes care of flushing the internal state. Calling it just before is superfluous.
66c8a1ec -
Victorien Le Couviour--Tuffet authored4db73f11
-
Victorien Le Couviour--Tuffet authored549086e4
-
Martin Storsjö authored52c09394
-
Martin Storsjö authored
Use a variable mask for inserting padding, instead of fixed code paths for different padding widths. This allows simplifying the filtering logic to simply always process 8 pixels at a time. Also improve scheduling of the loop subtract instruction in all these cases.
9c1f276d -
Martin Storsjö authored
This gives a minor speedup on 8 bpc and a bit bigger speedup on 16 bpc. Sample speedups from arm64: Before: Cortex A53 A72 A73 wiener_7tap_8bpc_neon: 143885.7 101571.5 96187.2 wiener_7tap_10bpc_neon: 171210.8 119410.4 122447.8 After: wiener_7tap_8bpc_neon: 142985.0 94400.8 89959.3 wiener_7tap_10bpc_neon: 168818.4 113980.2 116662.0
55e9f7a4 -
Martin Storsjö authored
Only doing this for 8bpc; for higher bitdepths, adding the input coefficients can overflow a signed 16 bit element. Before: Cortex A53 A72 A73 wiener_7tap_8bpc_neon: 142985.0 94400.8 89959.3 After: wiener_7tap_8bpc_neon: 136614.4 88828.3 86997.0
24f9304e -
Replace checks for INTER or SWITCH frames with a simple macro for increased readability and maintainability.
54747d42 -
Should make the code more readable.
6361e88d -
Signed-off-by: James Almer <jamrial@gmail.com>
7c316a70 -
Victorien Le Couviour--Tuffet authored288ed4b8
-
Martin Storsjö authored6660fd00
-
Use mla (8-bit -> 8-bit) instead of smlal (8-bit -> 16-bit). Before: Cortex A53 A72 A73 cdef_filter_4x4_8bpc_neon: 389.7 264.0 261.7 cdef_filter_4x8_8bpc_neon: 687.2 476.2 465.5 cdef_filter_8x8_8bpc_neon: 1152.9 752.1 789.5 After: cdef_filter_4x4_8bpc_neon: 385.2 263.4 259.2 cdef_filter_4x8_8bpc_neon: 677.5 473.8 459.8 cdef_filter_8x8_8bpc_neon: 1134.4 744.6 774.6
38d4d0bd -
Kyle Siefring authored95c43101
-
Avoid moving between 8 and 16-bit vectors where possible.
833382b3 -
- Reorder loads of filters to benifit in order cores. - Use full 128-bit vectors to transpose 8x8 bytes. zip1 is called in the first stage which will hurt performance on some older big cores. - Rework horz stage for 8 bit mode: * Use smull instead of mul * Replace existing narrow and long instructions * Replace mov after calling with right shift Before: Cortex A55 A53 A72 A73 warp_8x8_8bpc_neon: 1683.2 1860.6 1065.0 1102.6 warp_8x8t_8bpc_neon: 1673.2 1846.4 1057.0 1098.4 warp_8x8_16bpc_neon: 1870.7 2031.7 1147.3 1220.7 warp_8x8t_16bpc_neon: 1848.0 2006.2 1121.6 1188.0 After: warp_8x8_8bpc_neon: 1267.2 1446.2 807.0 871.5 warp_8x8t_8bpc_neon: 1245.4 1422.0 810.2 868.4 warp_8x8_16bpc_neon: 1769.8 1929.3 1132.0 1238.2 warp_8x8t_16bpc_neon: 1747.3 1904.1 1101.5 1207.9 Cortex-A55 Before: warp_8x8_8bpc_neon: 1683.2 warp_8x8t_8bpc_neon: 1673.2 warp_8x8_16bpc_neon: 1870.7 warp_8x8t_16bpc_neon: 1848.0 After: warp_8x8_8bpc_neon: 1267.2 warp_8x8t_8bpc_neon: 1245.4 warp_8x8_16bpc_neon: 1769.8 warp_8x8t_16bpc_neon: 1747.3
a3b8157e -
Martin Storsjö authored505e9990
-
If the postfilter tasks allocation fails, a deadlock would occur.
8b1a96e4 -
These functions are not thread-safe on GL, because they are not called from the thread holding the GL context. Work around this by simply disabling it. Not very optimal, but better than crashing.
06e8ed37 -
Upstream libplacebo added support for dav1d integration directly, allowing us to vastly simplify all of this code. In order to take advantage of new optimizations, I had to allow update_frame to unref the Dav1dPicture. (This is fine, since double unref is a no-op) In addition, some of the functions we use were deprecated in recent libplacebo versions, so since we're taking a new dependency we might as well fix the deprecation warnings.
61b65456 -
The current playback loop triggers a repaint on any single event, including spammy events such as SDL_MOUSEMOTION. Fix this by only repainting on SDL_WINDOWEVENT_EXPOSED, which is defined as the event sent when the window was damaged and needs to be repainted, as well as on new frames. Fixes videolan/dav1d#356
eab4ef6a -
The arm32 version is less generic and has a bit more caveats, but still belongs as a shared utility in a header.
2dca9b28 -
Henrik Gramner authoredffc4e01c
-
Janne Grunau authored93319cef
-
Matthias Dressel authored
Verification should not succeed if the given string is too short to be a real hash. Fixes videolan/dav1d#361
061ac9ae -
The required 'xxhash.h' header can either be in system include directory or can be copied to 'tools/output'. The xxh3_128bits based muxer shows no significant slowdown compared to the null muxer. Decoding times Chimera-AV1-8bit-1920x1080-6736kbps.ivf with 4 frame and 4 tile threads on a core i7-8550U (disabled turbo boost): null: 72.5 s md5: 99.8 s xxh3: 73.8 s Decoding Chimera-AV1-10bit-1920x1080-6191kbps.ivf with 6 frame and 4 tile threads on a m1 mc mini: null: 27.8 s md5: 105.9 s xxh3: 28.3 s
e6168525 -
Martin Storsjö authored
This fixes bus errors due to missing alignment, when built with GCC 9 for arm32 with -mfpu=neon.
0a577fd2 -
Martin Storsjö authored
This silences the following warning: tools/output/xxhash.c(127): warning C4244: '=': conversion from 'unsigned long' to 'unsigned char', possible loss of data
95884615 -
We currently run 'git describe --match' to obtain the current version, but meson doesn't properly quote/escape the pattern string on Windows. As a result, "fatal: Not a valid object name .ninja_log" is printed when compiling on Windows systems. Compilation still works, but the warning is annoying and misleading. Currently we don't actually need the pattern matching functionality (which is why things still work), so simply remove it as a workaround.
69268d3a -
Additionally reschedule instructions for loading, to reduce stalls on in order cores. This applies the changes from a3b8157e on the arm32 version. Before: Cortex A7 A8 A9 A53 A72 A73 warp_8x8_8bpc_neon: 3659.3 1746.0 1931.9 2128.8 1173.7 1188.9 warp_8x8t_8bpc_neon: 3650.8 1724.6 1919.8 2105.0 1147.7 1206.9 warp_8x8_16bpc_neon: 4039.4 2111.9 2337.1 2462.5 1334.6 1396.5 warp_8x8t_16bpc_neon: 3973.9 2137.1 2299.6 2413.2 1282.8 1369.6 After: warp_8x8_8bpc_neon: 2920.8 1269.8 1410.3 1767.3 860.2 1004.8 warp_8x8t_8bpc_neon: 2904.9 1283.9 1397.5 1743.7 863.6 1024.7 warp_8x8_16bpc_neon: 3895.5 2060.7 2339.8 2376.6 1331.1 1394.0 warp_8x8t_16bpc_neon: 3822.7 2026.7 2298.7 2325.4 1278.1 1360.8
0477fcf1 -
Change order of multiply accumulates to allow inorder cores to forward the results.
4e869495 -
Make them operate in a more cache friendly manner, interleaving horizontal and vertical filtering (reducing the amount of stack used from 51 KB to 4 KB), similar to what was done for x86 in 78d27b7d. This also adds separate 5tap versions of the filters and unrolls the vertical filter a bit more (which maybe could have been done without doing the rewrite). This does, however, increase the compiled code size by around 3.5 KB. Before: Cortex A53 A72 A73 wiener_5tap_8bpc_neon: 136855.6 91446.2 87363.6 wiener_7tap_8bpc_neon: 136861.6 91454.9 87374.5 wiener_5tap_10bpc_neon: 167685.3 114720.3 116522.1 wiener_5tap_12bpc_neon: 167677.5 114724.7 116511.9 wiener_7tap_10bpc_neon: 167681.6 114738.5 116567.0 wiener_7tap_12bpc_neon: 167673.8 114720.8 116515.4 After: wiener_5tap_8bpc_neon: 87102.1 60460.6 66803.8 wiener_7tap_8bpc_neon: 110831.7 78489.0 82015.9 wiener_5tap_10bpc_neon: 109999.2 90259.0 89238.0 wiener_5tap_12bpc_neon: 109978.3 90255.7 89220.7 wiener_7tap_10bpc_neon: 137877.6 107578.5 103435.6 wiener_7tap_12bpc_neon: 137868.8 107568.9 103390.4
2e73051c -
Martin Storsjö authored
Before: Cortex A53 A72 A73 inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 40.7 23.0 24.0 inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 116.0 71.5 78.2 inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 85.7 50.7 53.8 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 287.0 203.5 215.2 inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 255.7 129.1 140.4 inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 1401.4 1026.7 1039.2 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1913.2 1407.3 1479.6 After: inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 38.7 21.5 22.2 inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 116.0 71.3 77.2 inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 76.7 44.7 43.5 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 278.0 203.0 203.9 inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 236.9 106.2 116.2 inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 1368.7 999.7 1008.4 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 1880.5 1381.2 1459.4
6f9f3391 -
Emmanuel Gil Peyrot authored
Neither --buildtype=plain nor --buildtype=debug set -ffast-math, so llround() is kept as a function call and isn’t optimised out into cvttsd2siq (on amd64), thus requiring the math lib to be linked. Note that even with -ffast-math, it isn’t guaranteed that a call to llround() will always be omitted (I have reproduced this on PowerPC), so this fix is correct even if we ever decide to enable -ffast-math in other build types.
58cb4cf0 -
Henrik Gramner authored
Large stack allocations on Windows need to use stack probing in order to guarantee that all stack memory is committed before accessing it. This is done by ensuring that the guard page(s) at the end of the currently committed pages are touched prior to any pages beyond that.
c36b191a -
Henrik Gramner authored
Split the 5x5, 3x3, and mix cases into separate functions. Shrink some tables. Move some scalar calculations out of the DSP function. Make Wiener and SGR share the same function prototype to eliminate a branch in lr_stripe().
c290c02e -
Henrik Gramner authored
The previous implementation did multiple passes in the horizontal and vertical directions, with the intermediate values being stored in buffers on the stack. This caused bad cache thrashing. By interleaving the all the different passes in combination with a ring buffer for storing only a few rows at a time the performance is improved by a significant amount. Also slightly speed up neighbor calculations by packing the a and b values into a single 32-bit unsigned integer which allows calculations on both values simultaneously.
fe2bb774 -
b44ec453
-
ecf153b1
-
Martin Storsjö authored
On darwin, 32 bit parameters that aren't passed in registers but on the stack, are packed tightly instead of each of them occupying an 8 byte slot.
5cf45058 -
Marvin Scholz authoredb768fdbd
-
Henrik Gramner authored
Looprestoration SIMD code may overread the input buffers by a small amount. Pad the buffer to make sure this memory is valid to access.
f2967b05 -
Cache indices of non-zero coefficients during the AC token decoding loop in order to speed up the sign decoding/dequant loop later.
a92e307f -
Not having a quantizer matrix is the most common case, so it's worth having a separate code path for it that eliminates some calculations and table lookups. Without a qm, not only can we skip calculating dq * qm, but only Exp-Golomb-coded coefficients will have the potential to overflow, so we can also skip clipping for the vast majority of coefficients.
989057fb -
5faff383
-
It's supposed to warn about const-correctness issues, but it doesn't handle arrays of pointers correctly and will cause false positive warnings when using memset() to zero such arrays for example.
d7d125f1 -
Jean-Baptiste Kempf authored54e43f90
-
f1aa1b0e
-
1d6aae47
-
ec95ea52
-
Relative speed-ups over C code (compared with gcc-9.3.0): C ASM cdef_dir_16bpc_avx2: 534.2 72.5 7.36x cdef_dir_16bpc_ssse3: 534.2 104.8 5.10x cdef_dir_16bpc_ssse3 (x86-32): 854.1 116.2 7.35x
bfbee860 -
Jean-Baptiste Kempf authoredbaa92371
-
Nathan E. Egge authoreda3c1c676
-
Nathan E. Egge authored
Relative speed-ups over C code (compared with gcc-9.3.0): C AVX2 wiener_5tap_10bpc: 194892.0 14831.9 13.14x wiener_5tap_12bpc: 194295.4 14828.9 13.10x wiener_7tap_10bpc: 194391.7 19461.4 9.99x wiener_7tap_12bpc: 194136.1 19418.7 10.00x
9ca341fe -
Martin Storsjö authored84555a44
-
Martin Storsjö authored
In these cases, the function wrote a 64 pixel wide output, regardless of the actual width.
be5200c4 -
Martin Storsjö authored0940cb34
-
Martin Storsjö authored
This makes these instances consistent with the rest of similar cases.
bf60da6c -
Martin Storsjö authored27cb9dad
-
Martin Storsjö authored
While these might not be needed in practice, add them for consistency.
7f5b334b -
Martin Storsjö authored
Relative speedup vs C for a few functions: Cortex A7 A8 A9 A53 A72 A73 inv_txfm_add_4x4_dct_dct_0_10bpc_neon: 2.79 5.08 2.99 2.83 3.49 4.44 inv_txfm_add_4x4_dct_dct_1_10bpc_neon: 5.74 9.43 5.72 7.19 6.73 6.92 inv_txfm_add_8x8_dct_dct_0_10bpc_neon: 3.13 3.68 2.79 3.25 3.21 3.33 inv_txfm_add_8x8_dct_dct_1_10bpc_neon: 7.09 10.41 7.00 10.55 8.06 9.02 inv_txfm_add_16x16_dct_dct_0_10bpc_neon: 5.01 6.76 4.56 5.58 5.52 2.97 inv_txfm_add_16x16_dct_dct_1_10bpc_neon: 8.62 12.48 13.71 11.75 15.94 16.86 inv_txfm_add_16x16_dct_dct_2_10bpc_neon: 6.05 8.81 6.13 8.18 7.90 12.27 inv_txfm_add_32x32_dct_dct_0_10bpc_neon: 2.90 3.90 2.16 2.63 3.56 2.74 inv_txfm_add_32x32_dct_dct_1_10bpc_neon: 13.57 17.00 13.30 13.76 14.54 17.08 inv_txfm_add_32x32_dct_dct_2_10bpc_neon: 8.29 10.54 8.05 10.68 12.75 14.36 inv_txfm_add_32x32_dct_dct_3_10bpc_neon: 6.78 8.40 7.60 10.12 8.97 12.96 inv_txfm_add_32x32_dct_dct_4_10bpc_neon: 6.48 6.74 6.00 7.38 7.67 9.70 inv_txfm_add_64x64_dct_dct_0_10bpc_neon: 3.02 4.59 2.21 2.65 3.36 2.47 inv_txfm_add_64x64_dct_dct_1_10bpc_neon: 9.86 11.30 9.14 13.80 12.46 14.83 inv_txfm_add_64x64_dct_dct_2_10bpc_neon: 8.65 9.76 7.60 12.05 10.55 12.62 inv_txfm_add_64x64_dct_dct_3_10bpc_neon: 7.78 8.65 6.98 10.63 9.15 11.73 inv_txfm_add_64x64_dct_dct_4_10bpc_neon: 6.61 7.01 5.52 8.41 8.33 9.69
b4b225d8 -
Jean-Baptiste Kempf authored
include/common/frame.h
0 → 100644
src/arm/32/ipred16.S
0 → 100644
This diff is collapsed.
src/arm/32/itx16.S
0 → 100644
This diff is collapsed.