-
Remove half of the masks since they are only used for cdef on a 8x8 level of granularity. Load the mask and combine the 16-bit sections into the 32-bit sections outside of the inner cdef loop. This should save some registers. Results in mild performance improvements.
0bd57c6b