arm64: ipred: Add NEON implementation of ipred for 16 bpc
This also contains a number of minor fixups for the existing 8 bpc code.
The FILTER_PRED function is specialcased separately for 10 and 12 bit (but only this function, instead of adding a bpc parameter to the init function which essentially treats all of them as potentially different).