Faster bidir_rd plus some bugfixes
Cache chroma MC during refine_bidir_rd and use both the luma and chroma caches to skip MC in macroblock_encode. Fix incorrect call to rd_cost_part; refine_bidir_rd output was incorrect for i8>0. Remove some redundant clips. ~12% faster refine_bidir_rd.
Showing with 58 additions and 30 deletions