Phenom CPU optimizations
Faster hpel_filter by using unaligned loads instead of emulated PALIGNR Faster hpel_filter on 64-bit by using the 32-bit version (the cost of emulated PALIGNR is high enough that the savings from caching intermediate values is not worth it). Add support for misaligned_mask on Phenom: ~2% faster hpel_filter, ~4% faster width16 multisad, 7% faster width20 get_ref. Replace width12 mmx with width16 sse on Phenom and Nehalem: 32% faster width12 get_ref on Phenom. Merge cpu-32.asm and cpu-64.asm Thanks to Easy123 for contributing a Phenom box for a weekend so I could write these optimizations.
Showing with 189 additions and 68 deletions