This is using a slightly adapted version of my GPU-based algorithm. The
major difference to the algorithm suggested by the spec (and implemented
in libaom) is that instead of using a line buffer to hold the previous
row's film grain blocks, we compute each row/block fully independently.
This opens up the door to exploit parallelism in the future, since we
don't have any left->right or top->down dependency except for the PRNG
state. (Which we could pre-compute for a massively parallel / GPU
That being said, it's probably somewhat slower than using a line buffer
for the serial / single CPU case, although most likely not by much
(since the areas with the most redundant work get progressively smaller,
down to a single 2x2 square for the worst case).