Small optimizations in prep_neon function

Why this change here? This looks suboptimal for in-order cores like A53.

For this particular change, while it might be good to move the second load earlier, the first instruction updates x1 which has a bit of latency, so for the A53, it's usually good to not do consecutive loads after updating the source register. In this case, moving it after the subs w4, w4, #2 probably is better.

While changing from ldp to ld1 can be beneficial for the reasons mentioned in the commit message, is the scheduling change intentional or unintentional here? I believe it in general would be more efficient to keep the loads where they were, unless you have numbers to the contrary.

If possible, can you get checkasm bench numbers for the change? (It does require activating userspace access to the cycle counter registers.) That usually allows measuring the impact of the scheduling changes, in particular on in-order cores.

The scheduling change is indeed intentional. The reason was to increase distance between 2nd load and instructions that use the result of that load, in original code it is about 3-4 instructions, while arm docs for A75 state that loading of 2 128 bit registers have 6 cycles latency, and since we have big pipeline bubble in begin of function (between first load and use of it result), I thought that we can move load there to increase distance. However the no difference on A53 in avg fps, so I just left it as is.

Is there any guide for checkasm bench? I've compiled and loaded module for enabling pmu, but can't make the bench itself to work. There is no time or anything like that in output, and also no prep_8tap_neon function in the list, while objdump shows proper symbols for that. It would be nice to have more precise metric, since there is no proper public info on A53 and I have to run different versions to see if its actually faster.

You can first run e.g. checkasm --test=mc_8bpc --list-functions and see which functions it prints. In this case, the mapping to checkasm function names is rather nonobvious though.

One way to easily see which one it is, is to introduce a bug into the function you want to measure, running checkasm --test=mc_8bpc will print this:

NEON:
 - mc_8bpc.mc         [OK]
   mct_8tap_regular_w128_0_8bpc_neon (../tests/checkasm/mc.c:155)
   mct_8tap_regular_smooth_w128_0_8bpc_neon (../tests/checkasm/mc.c:155)
   mct_8tap_regular_sharp_w128_0_8bpc_neon (../tests/checkasm/mc.c:155)
   mct_8tap_sharp_regular_w128_0_8bpc_neon (../tests/checkasm/mc.c:155)
   mct_8tap_sharp_smooth_w128_0_8bpc_neon (../tests/checkasm/mc.c:155)
   mct_8tap_sharp_w128_0_8bpc_neon (../tests/checkasm/mc.c:155)
   mct_8tap_smooth_regular_w128_0_8bpc_neon (../tests/checkasm/mc.c:155)
   mct_8tap_smooth_w128_0_8bpc_neon (../tests/checkasm/mc.c:155)
   mct_8tap_smooth_sharp_w128_0_8bpc_neon (../tests/checkasm/mc.c:155)
   mct_bilinear_w128_0_8bpc_neon (../tests/checkasm/mc.c:155)
 - mc_8bpc.mct        [FAILED]

So in this case, all of the mct_8tap_*_0_8bpc cases will be running the same code. For benchmarking, it actually skips most of them and only benchmarks the regular case. So you can do e.g. this to benchmark it: checkasm --test=mc_8bpc --bench --function="mct_8tap_regular_w*_0_8bpc" | grep neon

requested review from @mstorsjo

Also,

Changed ldp to ld1, since they have 2x throughput in equal conditions according to arm docs

According to which docs, for which core? The timings vary a lot between cores. (Not questioning the validity of the statement, just wanting to know which one it relates to.)

It is from A75 software optimization guide. I've tested on A53 and A73 (Odroid N2+), seems to be no impact on A53, and small, but above deviation improve on A73. I understand that on different hardware (like Apple M1) it could be opposite, but unfortunately I only have A73 and bunch of A53's.

An A73 and A53 is generally all you'll need for benchmarking here. The Apple cores have so deep out of order execution and lots of oter stuff, so that most of minor tuning like this make little to no difference at all there - and it's hard to do microbenchmarks on them.

The essential bit is primarily to see how the big cores behave, together with a little in-order core, so those cores that you have are perfect for that.

I've got a checkasm benchmarking setup where I run the benchmarks across a largeish array of various cores, to see whether tunings like this helps or makes things worse. I tested this patch on it.

Here are the numbers before:

                                 Cortex A53      A55      A72      A73      A76   Apple M1
mct_8tap_regular_w4_0_8bpc_neon:       62.8     56.4     42.7     33.1     27.8   0.1
mct_8tap_regular_w8_0_8bpc_neon:      102.8     94.7     56.8     58.3     32.9   0.1
mct_8tap_regular_w16_0_8bpc_neon:     232.4    238.7    154.4    152.9     81.1   0.3
mct_8tap_regular_w32_0_8bpc_neon:     914.7    878.7    424.2    566.6    289.2   1.0
mct_8tap_regular_w64_0_8bpc_neon:    1670.3   1688.5   1020.6   1466.5    568.5   2.4
mct_8tap_regular_w128_0_8bpc_neon:   4489.6   4320.8   3616.1   3940.3   1470.0   6.1

And after your patch:

                                 Cortex A53      A55      A72      A73      A76   Apple M1
mct_8tap_regular_w4_0_8bpc_neon:       62.8     56.8     41.8     33.5     27.7   0.1
mct_8tap_regular_w8_0_8bpc_neon:      102.3     94.6     61.6     59.8     32.8   0.0
mct_8tap_regular_w16_0_8bpc_neon:     225.7    240.5    159.3    152.1     80.7   0.3
mct_8tap_regular_w32_0_8bpc_neon:     949.9    892.2    422.9    521.6    280.3   1.1
mct_8tap_regular_w64_0_8bpc_neon:    2156.6   2101.7   1014.2   1365.1    677.9   2.5
mct_8tap_regular_w128_0_8bpc_neon:   5503.7   5493.3   3617.3   3314.6   1710.0   6.0

Those numbers (consistently, if repeated) show a large slowdown on A53 and A55, no big difference on A72. The A73 numbers seem to be too noisy across runs to make any big conclusion (due to the HW/OS setup on that one, probably). The A76 also consistently seem to show slower numbers here. And the timer we use on M1 is way too coarse to say anything conclusively for these functions.

I've managed to run benchmark, thanks for the help. Strange, but I've got quite opposite results with A73 showing very stable numbers, and A53 having significant jitter with and without patch. Unfortunately investigating into that is beyond the scope of our current tasks for now. We will keep this and other planned patches in our local repository then, since it is working for us and I guess you can close this ticket. Thanks again for help.

The jitteryness of the various cores isn't usually so dependent on the core type, but is more up to the OS and kernel and other things.

While jittery, did you see roughly the same sort of slowdown on A53 with this patch as I mentioned, or was it too jittery to say for sure?

Here are my numbers:

A53
baseline Avg: 7891.27 std dev: 420.88
patch    Avg: 7885.44 std dev: 329.14
A73
baseline Avg: 8388.78 std dev: 2.77
patch    Avg: 8388.15 std dev: 2.30

Numbers based on 10 times of mct_8tap_regular_w64_v_8bpc_neon binded to one core (w32 numbers behave in similar manner). And for A53 I actually picked a bit nicer run, since sometimes jitter even worse. Also, in the same time the _c function having way less jitter than it _neon counterpart, althought its still in same manner worse than on A73.

Regarding the initial claim about peformance boost, all measures was done on a video file (about 3 min +-) using default dav1d metric of avg fps. According to gmon this video is heavily stuck in prep_8tap_neon so this is why I did look into it. And there wasn't any jitter like that in this case.

I just noticed, that my function is different from yours I re-measured on mct_8tap_regular_w64_0_8bpc_neon

A53
baseline Avg: 1707.40 std dev: 81.86
patch    Avg: 2135.44 std dev: 82.91
A73
baseline Avg: 1422.10 std dev: 13.55
patch    Avg: 1418.12 std dev: 8.69

Althougth jitter is still here, now at least I can see slowdown.

Ok, good - yes, the w64_v function is a different codepath. Good that you see the slowdown; as you understand, such a regression on one core isn't quite acceptable unless the gains are even bigger on another core.

Since the patch does show some improvement on some cores, I could also try to see if I can find a compromise that doesn't regress the in-order cores - probably by keeping the loads where they were from the start.

The problem seems to be in different implementation of ldp and ld1 on A53 and other cores. I've just tested change of default placed ld1 in w32 to ldp and got boost from 873 -> 709

As the improvements here were inconclusive (big slowdown on A53, maybe improvement on A73 but hard to say for sure, the checkasm benchmarks didn't show much change at all), the MR as it is isn't beneficial. I propose we close this MR, and we can reopen it if we have other changes which show more of an overall gain.

Agreed.

closed

mentioned in merge request !1632 (merged)

@v.cvetkov FYI, see !1660 (merged) and !1661 (merged) for other attempts at optimizing these functions, which probably will be merged soon.

         add             x8,  x0,  w3, uxtw
 :
         ld1             {v0.16b, v1.16b},  [x1], x2
         ld1             {v2.16b, v3.16b},  [x1], x2

 :
         AARCH64_VALID_JUMP_TARGET
         add             x8,  x0,  #32
         add             x9,  x1,  #32
         mov             x6,  #64
 :
         ldp             q0,  q1,  [x1]
         ld1             {v0.16b, v1.16b},  [x1], x2
         ld1             {v2.16b, v3.16b},  [x9], x2

Small optimizations in prep_neon function

Closed by Jean-Baptiste Kempf 1 year ago (Feb 4, 2024 1:10pm UTC) 1 year ago

Activity

Small optimizations in prep_neon function

Merge request reports

Closed by Jean-Baptiste Kempf 1 year ago (Feb 4, 2024 1:10pm UTC) 1 year ago

Activity