Skip to content

mc_tmpl: w_mask Induces reduction in register usage

Sungjoon Moon requested to merge OctopusET/dav1d:w_mask_c into master

Use distributive law to reduce register usage There's around 7% (9% max) performance gain

Why not have fancy LaTeX? \begin{aligned} & tmp1 \cdot m + tmp2 \cdot (64 - m) \\ &= tmp1 \cdot m + tmp2 \cdot 64 - tmp2 \cdot m \\ &= tmp1 \cdot m - tmp2 \cdot m + 64 \cdot tmp2 \\ &= (tmp1 - tmp2) \cdot m + 64 \cdot tmp2 \end{aligned}

Tested on AMD HX370

I think we don't need last commit actually.

Function                  |       Before |        After |         % |
---------------------------------------------------------------------
w_mask_420_w4_8bpc_c      |        335.3 |        312.6 |      6.78 |
w_mask_420_w4_16bpc_c     |        354.5 |        326.4 |      7.94 |
w_mask_420_w8_8bpc_c      |       1056.4 |        979.3 |      7.30 |
w_mask_420_w8_16bpc_c     |       1068.2 |        996.4 |      6.73 |
w_mask_420_w16_8bpc_c     |       3416.1 |       3169.6 |      7.22 |
w_mask_420_w16_16bpc_c    |       3435.4 |       3218.0 |      6.34 |
w_mask_420_w32_8bpc_c     |      13479.7 |      12550.0 |      6.91 |
w_mask_420_w32_16bpc_c    |      13833.3 |      12632.7 |      8.68 |
w_mask_420_w64_8bpc_c     |      32557.6 |      30166.7 |      7.35 |
w_mask_420_w64_16bpc_c    |      32529.8 |      30407.0 |      6.54 |
w_mask_420_w128_8bpc_c    |      81802.8 |      75856.5 |      7.27 |
w_mask_420_w128_16bpc_c   |      81187.8 |      76133.9 |      6.23 |
w_mask_422_w4_8bpc_c      |        331.3 |        327.1 |      1.27 |
w_mask_422_w4_16bpc_c     |        365.1 |        341.2 |      6.53 |
w_mask_422_w8_8bpc_c      |       1052.7 |       1003.5 |      4.68 |
w_mask_422_w8_16bpc_c     |       1095.9 |       1022.6 |      6.69 |
w_mask_422_w16_8bpc_c     |       3479.8 |       3248.8 |      6.67 |
w_mask_422_w16_16bpc_c    |       3504.2 |       3279.5 |      6.41 |
w_mask_422_w32_8bpc_c     |      13702.5 |      12801.4 |      6.58 |
w_mask_422_w32_16bpc_c    |      13738.9 |      12830.5 |      6.61 |
w_mask_422_w64_8bpc_c     |      32517.9 |      30818.0 |      5.23 |
w_mask_422_w64_16bpc_c    |      33199.4 |      30865.3 |      7.03 |
w_mask_422_w128_8bpc_c    |      82867.1 |      77978.7 |      5.90 |
w_mask_422_w128_16bpc_c   |      84937.9 |      77629.8 |      8.60 |
w_mask_444_w4_8bpc_c      |        340.4 |        315.6 |      7.28 |
w_mask_444_w4_16bpc_c     |        361.6 |        335.0 |      7.35 |
w_mask_444_w8_8bpc_c      |       1057.6 |        988.9 |      6.50 |
w_mask_444_w8_16bpc_c     |       1104.3 |       1030.8 |      6.67 |
w_mask_444_w16_8bpc_c     |       3414.4 |       3180.7 |      6.85 |
w_mask_444_w16_16bpc_c    |       3477.4 |       3182.4 |      8.48 |
w_mask_444_w32_8bpc_c     |      13455.8 |      12469.4 |      7.33 |
w_mask_444_w32_16bpc_c    |      13666.9 |      12378.8 |      9.42 |
w_mask_444_w64_8bpc_c     |      33587.2 |      31239.7 |      7.00 |
w_mask_444_w64_16bpc_c    |      34283.3 |      30969.5 |      9.67 |
w_mask_444_w128_8bpc_c    |      82084.2 |      76206.3 |      7.16 |
w_mask_444_w128_16bpc_c   |      82649.4 |      75166.4 |      8.91 |
---------------------------------------------------------------------
avg                       |            - |            - |      6.95 |
Edited by Sungjoon Moon

Merge request reports

Loading