1. 23 Apr, 2013 12 commits
    • Fiona Glaser's avatar
      x86-64: cabac_block_residual assembly · a3f5c732
      Fiona Glaser authored
      RDO: ~20% faster than C
      Bitstream: ~50% faster than C
      1-2% faster overall, highest on preset superfast/fast/medium.
    • Steve Borho's avatar
      OpenCL lookahead · f49a1b2e
      Steve Borho authored
      OpenCL support is compiled in by default, but must be enabled at runtime by an
      --opencl command line flag. Compiling OpenCL support requires perl. To avoid
      the perl requirement use: configure --disable-opencl.
      When enabled, the lookahead thread is mostly off-loaded to an OpenCL capable GPU
      device.  Lowres intra cost prediction, lowres motion search (including subpel)
      and bidir cost predictions are all done on the GPU.  MB-tree and final slice
      decisions are still done by the CPU.  Presets which do not use a threaded
      lookahead will not use OpenCL at all (superfast, ultrafast).
      Because of data dependencies, the GPU must use an iterative motion search which
      performs more total work than the CPU would do, so this is not work efficient
      or power efficient. But if there are spare GPU cycles to spare, it can often
      speed up the encode. Output quality when OpenCL lookahead is enabled is often
      very slightly worse in quality than the CPU quality (because of the same data
      x264 must compile its OpenCL kernels for your device before running them, and in
      order to avoid doing this every run it caches the compiled kernel binary in a
      file named x264_lookahead.clbin (--opencl-clbin FNAME to override).  The cache
      file will be ignored if the device, driver, or OpenCL source are changed.
      x264 will use the first GPU device which supports the required cl_image
      features required by its kernels. Most modern discrete GPUs and all AMD
      integrated GPUs will work.  Intel integrated GPUs (up to IvyBridge) do not
      support those necessary features. Use --opencl-device N to specify a number of
      capable GPUs to skip during device detection.
      Switchable graphics environments (e.g. AMD Enduro) are currently not supported,
      as some have bugs in their OpenCL drivers that cause output to be silently
      Developed by MulticoreWare with support from AMD and Telestream.
    • Fiona Glaser's avatar
      weightp: improve scale/offset search, chroma · 2d0c47a5
      Fiona Glaser authored
      Rescale the scale factor if the offset clips. This makes weightp more effective
      in fades to/from white (and an other situation that requires big offsets).
      Search more than 1 scale factor and more than 1 offset, depending on --subme.
      Try to find the optimal chroma denominator instead of hardcoding it.
      Overall improvement: a few percent in fade-heavy clips, such as a sample from
      Avatar: TLA.
    • Fiona Glaser's avatar
      Add slices-max feature · 732e4f7e
      Fiona Glaser authored
      The H.264 spec technically has limits on the number of slices per frame. x264
      normally ignores this, since most use-cases that require large numbers of
      slices prefer it to. However, certain decoders may break with extremely large
      numbers of slices, as can occur with some slice-max-size/mbs settings.
      When set, x264 will refuse to create any slices beyond the maximum number,
      even if slice-max-size/mbs requires otherwise.
    • Fiona Glaser's avatar
      Add slice-min-mbs feature · fdfffa30
      Fiona Glaser authored
      Works in conjunction with slice-max-mbs and/or slice-max-size to avoid overly
      small slices.
      Useful with certain decoders that barf on extremely small slices.
      If slice-min-mbs would be violated as a result of slice-max-size, x264 will
      exceed slice-max-size and print a warning.
    • Anton Mitrofanov's avatar
      Disable mbtree asm with cpu-independent option · 8a3a41de
      Anton Mitrofanov authored
      Results vary between versions because of different rounding results.
    • Anton Mitrofanov's avatar
    • Rodeo's avatar
      lavf input: don't use deprecated AVStream fields · e74287e9
      Rodeo authored
      Fixes building against newer libavcodecs from the Libav project.
    • Anton Mitrofanov's avatar
      Fix y4m input with C420paldv colorspace · aa73459b
      Anton Mitrofanov authored
    • Fiona Glaser's avatar
      x86: correctly check stack alignment for Atom hadamard_ac · 42c500af
      Fiona Glaser authored
      Regression in r2265 (only affected compilers with broken stack alignment,
      like ICL on win32).
    • Loren Merritt's avatar
      x86inc: fix some corner cases of SWAP · bed18d0e
      Loren Merritt authored
      SWAP with >=3 named (rather than numbered) args
      PERMUTE followed by SWAP with 2 named args
      used to produce the wrong permutation
    • Fiona Glaser's avatar
  2. 13 Apr, 2013 1 commit
  3. 01 Mar, 2013 1 commit
  4. 26 Feb, 2013 10 commits
    • Stefan Groenroos's avatar
      ARM: update NEON mc_chroma to work with NV12 and re-enable it · 3a8baa0e
      Stefan Groenroos authored
      Up to 10-15% faster overall.
    • Fiona Glaser's avatar
    • Fiona Glaser's avatar
      quant_4x4x4: quant one 8x8 block at a time · 993c81e9
      Fiona Glaser authored
      This reduces overhead and lets us use less branchy code for zigzag, dequant,
      decimate, and so on.
      Reorganize and optimize a lot of macroblock_encode using this new function.
      ~1-2% faster overall.
      Includes NEON and x86 versions of the new function.
      Using larger merged functions like this will also make wider SIMD, like
      AVX2, more effective.
    • Stephen Hutchinson's avatar
      Add AvxSynth support to the AviSynth input module. · 5ee1d03a
      Stephen Hutchinson authored
      Uses dlopen to load AvxSynth on Linux and OS X.
      Allows the use of --demuxer avs for AvxSynth, though the only source filter it
      can currently use is FFMS2.
      Add a local copy of avxsynth_c.h and its dependent headers in extras/ so that
      users don't need to actually have AvxSynth development headers installed to
      enable support for it (mirroring the AviSynth behavior).
      Based on a patch by 0x09 (tab@lavabit.com)
    • Fiona Glaser's avatar
      Eliminate some branchiness in ME/analysis · 7b1301e9
      Fiona Glaser authored
      Faster, fewer branch mispredictions.
    • Fiona Glaser's avatar
      Fix some store forwarding stalls · 7de9a9aa
      Fiona Glaser authored
      There's quite a few others, but most of them don't help to fix or there's no
      easy way to avoid them.
    • Fiona Glaser's avatar
      x86: faster AVX satd/sa8d/sa8d_satd/hadamard_ac · 68a6268b
      Fiona Glaser authored
      Use Conroe-style movddup in AVX transforms; both Sandy Bridge and Bulldozer
      do movddup in the load unit, so it's totally free this way.
      On Sandy Bridge:
      ~6% faster sa8d_satd
      ~5% faster hadamard_ac
      ~9% faster 32-bit satd
      ~2% faster sa8d
    • Fiona Glaser's avatar
      x86: detect Bobcat, improve Atom optimizations, reorganize flags · 5d60b9c9
      Fiona Glaser authored
      The Bobcat has a 64-bit SIMD unit reminiscent of the Athlon 64; detect this
      and apply the appropriate flags.
      It also has an extremely slow palignr instruction; create a flag for this to
      avoid massive penalties on palignr-heavy functions.
      Improve Atom function selection and document exactly what the SLOW_ATOM flag
      Add Atom-optimized SATD/SA8D/hadamard_ac functions: simply combine the ssse3
      optimizations with the sse2 algorithm to avoid pmaddubsw, which is slow on
      Atom along with other SIMD multiplies.
      Drop TBM detection; it'll probably never be useful for x264.
      Invert FastShuffle to SlowShuffle; it only ever applied to one CPU (Conroe).
      Detect CMOV, to fail more gracefully when run on a chip with MMX2 but no CMOV.
    • Oskar Arvidsson's avatar
      x86: combined SA8D/SATD dsp function · 75d92705
      Oskar Arvidsson authored
      Speedup is most apparent for 8-bit (~30%), but gives some improvements
      for 10-bit too (~12%).
      64-bit only for now.
    • Oskar Arvidsson's avatar
      x86: port SSE2+ SATD functions to high bit depth · 790c648d
      Oskar Arvidsson authored
      Makes SATD 20-50% faster across all partition sizes but 4x4.
  5. 25 Feb, 2013 16 commits