1. 19 Nov, 2010 2 commits
    • Oskar Arvidsson's avatar
      Convert X264_HIGH_BIT_DEPTH to HIGH_BIT_DEPTH · 1382552b
      Oskar Arvidsson authored
      Less verbose.
      1382552b
    • Oskar Arvidsson's avatar
      x86 asm for high-bit-depth pixel metrics · abde94f6
      Oskar Arvidsson authored
      Overall speed change from these 6 asm patches: ~4.4x.
      But there's still tons more asm to do -- patches welcome!
      
      Breakdown from this patch:
      ~13x faster SAD than C.
      ~11.5x faster SATD than C (only MMX done).
      ~18.5x faster SA8D than C.
      ~19.2x faster hadamard_ac than C.
      ~8.3x faster SSD than C.
      ~12.4x faster VAR than C.
      ~3-4.2x faster intra SAD than C.
      ~7.9x faster intra SATD than C.
      abde94f6
  2. 31 Oct, 2010 2 commits
  3. 18 Sep, 2010 1 commit
    • Fiona Glaser's avatar
      Update source file headers · 213a99d0
      Fiona Glaser authored
      Update dates, improve file descriptions, make things more consistent.
      Also add information about commercial licensing.
      213a99d0
  4. 15 Jul, 2010 1 commit
    • Loren Merritt's avatar
      Convert x264 to use NV12 pixel format internally · 387828ed
      Loren Merritt authored
      ~1% faster overall on Conroe, mostly due to improved cache locality.
      Also allows improved SIMD on some chroma functions (e.g. deblock).
      This change also extends the API to allow direct NV12 input, which should be a bit faster than YV12.
      This isn't currently used in the x264cli, as swscale does not have fast NV12 conversion routines, but it might be useful for other applications.
      
      Note this patch disables the chroma SIMD code for PPC and ARM until new versions are written.
      387828ed
  5. 04 Jul, 2010 1 commit
    • Oskar Arvidsson's avatar
      Support for 9 and 10-bit encoding · c91f43a4
      Oskar Arvidsson authored
      Output bit depth is specified on compilation time via --bit-depth.
      There is currently almost no assembly code available for high-bit-depth modes, so encoding will be very slow.
      Input is still 8-bit only; this will change in the future.
      
      Note that very few H.264 decoders support >8 bit depth currently.
      Also note that the quantizer scale differs for higher bit depth.  For example, for 10-bit, the quantizer (and crf) ranges from 0 to 63 instead of 0 to 51.
      c91f43a4
  6. 25 Jun, 2010 1 commit
  7. 09 Jun, 2010 3 commits
  8. 02 Jun, 2010 1 commit
  9. 26 May, 2010 1 commit
    • Fiona Glaser's avatar
      Detect Atom CPU, enable appropriate asm functions · 57729402
      Fiona Glaser authored
      I'm not going to actually optimize for this pile of garbage unless someone pays me.
      But it can't hurt to at least enable the correct functions based on benchmarks.
      
      Also save some cache on Intel CPUs that don't need the decimate LUT due to having fast bsr/bsf.
      57729402
  10. 05 Apr, 2010 1 commit
    • Fiona Glaser's avatar
      Massive cosmetic and syntax cleanup · 58d2349d
      Fiona Glaser authored
      Convert all applicable loops to use C99 loop index syntax.
      Clean up most inconsistent syntax in ratecontrol.c, visualize, ppc, etc.
      Replace log(x)/log(2) constructs with log2, and similar with log10.
      Fix all -Wshadow violations.
      Fix visualize support.
      58d2349d
  11. 25 Feb, 2010 1 commit
  12. 17 Nov, 2009 1 commit
    • Fiona Glaser's avatar
      Faster weightp analysis · 63f71477
      Fiona Glaser authored
      Modify pixel_var slightly to return the necessary information and use it for weight analysis instead of sad/ssd.
      Various minor cosmetics.
      63f71477
  13. 24 Aug, 2009 1 commit
  14. 03 Jul, 2009 1 commit
    • Fiona Glaser's avatar
      Early termination for chroma encoding · 205a032c
      Fiona Glaser authored
      Faster chroma encoding by terminating early if heuristics indicate that the block will be DC-only.
      This works because the vast majority of inter chroma blocks have no coefficients at all, and those that do are almost always DC-only.
      Add two new helper DSP functions for this: dct_dc_8x8 and var2_8x8.  mmx/sse2/ssse3 versions of each.
      Early termination is disabled at very low QPs due to it not being useful there.
      Performance increase is ~1-2% without trellis, up to 5-6% with trellis=2.
      Increase is greater with lower bitrates.
      205a032c
  15. 31 Mar, 2009 1 commit
  16. 30 Mar, 2009 2 commits
  17. 17 Mar, 2009 1 commit
    • Fiona Glaser's avatar
      SSE2 zigzag_interleave · d25d50c9
      Fiona Glaser authored
      Replace PHADD with FastShuffle (more accurate naming).
      This flag represents asm functions that rely on fast SSE2 shuffle units, and thus are only faster on Phenom, Nehalem, and Penryn CPUs.
      d25d50c9
  18. 07 Mar, 2009 1 commit
    • Holger Lubitz's avatar
      Vastly faster SATD/SA8D/Hadamard_AC/SSD/DCT/IDCT · 54e38917
      Holger Lubitz authored
      Heavily optimized for Core 2 and Nehalem, but performance should improve on all modern x86 CPUs.
      16x16 SATD: +18% speed on K8(64bit), +22% on K10(32bit), +42% on Penryn(64bit), +44% on Nehalem(64bit), +50% on P4(32bit), +98% on Conroe(64bit)
      Similar performance boosts in SATD-like functions (SA8D, hadamard_ac) and somewhat less in DCT/IDCT/SSD.
      Overall performance boost is up to ~15% on 64-bit Conroe.
      54e38917
  19. 04 Mar, 2009 1 commit
    • Fiona Glaser's avatar
      Slightly faster 8x16 SAD on Penryn Core 2 · b77ea4db
      Fiona Glaser authored
      Same as MMX 8x16 cacheline SAD, but calls SSE2 8x16 SAD in non-cacheline case.
      Only Nehalem benefits from sizes smaller than 8x16, and Nehalem doesn't use cacheline functions, so no smaller versions are included.
      b77ea4db
  20. 11 Feb, 2009 1 commit
  21. 28 Jan, 2009 1 commit
  22. 31 Dec, 2008 1 commit
  23. 24 Dec, 2008 1 commit
    • Fiona Glaser's avatar
      Optimize variance asm + minor changes · 9fe6e5e6
      Fiona Glaser authored
      Remove SAD argument from var, not needed anymore.
      Speed up var asm a bit by eliminating psadbw and instead HADDWing at end.
      Eliminate all remaining warnings on gcc 3.4 on cygwin
      Port another minor optimization from lavc (pskip)
      9fe6e5e6
  24. 25 Nov, 2008 1 commit
    • Fiona Glaser's avatar
      Faster width4 SSD+SATD, SSE4 optimizations · 69e69197
      Fiona Glaser authored
      Do satd 4x8 by transposing the two blocks' positions and running satd 8x4.
      Use pinsrd (SSE4) for faster width4 SSD
      Globally replace movlhps with punpcklqdq (it seems to be faster on Conroe)
      Move mask_misalign declaration to cpu.h to avoid warning in encoder.c.
      These optimizations help on Nehalem, Phenom, and Penryn CPUs.
      69e69197
  25. 23 Nov, 2008 1 commit
    • Fiona Glaser's avatar
      Phenom CPU optimizations · 80ea99c0
      Fiona Glaser authored
      Faster hpel_filter by using unaligned loads instead of emulated PALIGNR
      Faster hpel_filter on 64-bit by using the 32-bit version (the cost of emulated PALIGNR is high enough that the savings from caching intermediate values is not worth it).
      Add support for misaligned_mask on Phenom: ~2% faster hpel_filter, ~4% faster width16 multisad, 7% faster width20 get_ref.
      Replace width12 mmx with width16 sse on Phenom and Nehalem: 32% faster width12 get_ref on Phenom.
      Merge cpu-32.asm and cpu-64.asm
      Thanks to Easy123 for contributing a Phenom box for a weekend so I could write these optimizations.
      80ea99c0
  26. 13 Nov, 2008 1 commit
  27. 10 Nov, 2008 1 commit
    • Fiona Glaser's avatar
      Various cosmetics and minor fixes · ae51235d
      Fiona Glaser authored
      Disable hadamard_ac sse2/ssse3 under stack_mod4
      Fix one MSVC compilation warning
      Fix compilation in debug mode in certain cases on x64
      Remove eval.c from MSVC project
      Fix crash when VBV is used in CQP mode
      Patches by MasterNobody
      ae51235d
  28. 15 Sep, 2008 1 commit
  29. 05 Sep, 2008 2 commits
  30. 16 Aug, 2008 1 commit
  31. 04 Jul, 2008 1 commit
    • Fiona Glaser's avatar
      Update file headers throughout x264 · bdbd4fe7
      Fiona Glaser authored
      Update "Authors" lists based on actual authorship; highest is most important
      Update copyright notices and remove old CVS tags from file headers
      Add file headers to GTK and other sections missing them
      Update FSF address
      Other header-related cosmetics
      bdbd4fe7
  32. 18 Jun, 2008 1 commit
  33. 08 Jun, 2008 2 commits
    • Loren Merritt's avatar
      many changes to which asm functions are enabled on which cpus. · c0c0e1f4
      Loren Merritt authored
      with Phenom, 3dnow is no longer equivalent to "sse2 is slow", so make a new flag for that.
      some sse2 functions are useful only on Core2 and Phenom, so make a "sse2 is fast" flag for that.
      some ssse3 instructions didn't become useful until Penryn, so yet another flag.
      disable sse2 completely on Pentium M and Core1, because it's uniformly slower than mmx.
      enable some sse2 functions on Athlon64 that always were faster and we just didn't notice.
      remove mc_luma_sse3, because the only cpu that has lddqu (namely Pentium 4D) doesn't have "sse2 is fast".
      don't print mmx1, sse1, nor 3dnow in the detected cpuflags, since we don't really have any such functions. likewise don't print sse3 unless it's used (Pentium 4D).
      c0c0e1f4
    • Loren Merritt's avatar
      enable ssse3 phadd satd on Penryn. · f9ad5ee2
      Loren Merritt authored
      f9ad5ee2