-
This is using a slightly adapted version of my GPU-based algorithm. The major difference to the algorithm suggested by the spec (and implemented in libaom) is that instead of using a line buffer to hold the previous row's film grain blocks, we compute each row/block fully independently. This opens up the door to exploit parallelism in the future, since we don't have any left->right or top->down dependency except for the PRNG state. (Which we could pre-compute for a massively parallel / GPU implementation) That being said, it's probably somewhat slower than using a line buffer for the serial / single CPU case, although most likely not by much (since the areas with the most redundant work get progressively smaller, down to a single 2x2 square for the worst case).
cfa986fe