WIP: Implement prefetch of mvs
32 fps -> 33.5 fps on Cortex-A55
I don't think this will help on out of order cores. Would like to see results to see if it hurts too much.
TODO:
- Investigate doing multiple rows
- Optimize for screen edges?
- Benchmark on other cores
- Shorten filter length for 4 tap
- Make Arm specific
- Limit prefetch for large motion vectors
- Prefetch for warped motion and rescale
- More???