1. 17 Jan, 2018 1 commit
  2. 24 Dec, 2017 3 commits
    • Henrik Gramner's avatar
      Shrink the i4x4_mode cost_table array · 06c8f6ba
      Henrik Gramner authored and Anton Mitrofanov's avatar Anton Mitrofanov committed
      Only 17 elements are actually used. It was originally padded to 64 bytes to
      avoid cache line splits in the x86 assembly, but those haven't really been
      an issue on x86 CPU:s made in the past decade or so.
      
      Benchmarking shows no performance impact from dropping the padding, so
      might as well remove it and save some cache.
      06c8f6ba
    • Anton Mitrofanov's avatar
      Make ref and i4x4_mode costs global instead of static · bdf27e78
      Anton Mitrofanov authored
      Fixes some thread safety doubts and makes code cleaner.
      Downside: slightly higher memory usage when calling multiple encoders from the same application.
      bdf27e78
    • Vittorio Giovara's avatar
      Unify 8-bit and 10-bit CLI and libraries · 71ed44c7
      Vittorio Giovara authored and Anton Mitrofanov's avatar Anton Mitrofanov committed
      Add 'i_bitdepth' to x264_param_t with the corresponding '--output-depth' CLI
      option to set the bit depth at runtime.
      
      Drop the 'x264_bit_depth' global variable. Rather than hardcoding it to an
      incorrect value, it's preferable to induce a linking failure. If applications
      relies on this symbol this will make it more obvious where the problem is.
      
      Add Makefile rules that compiles modules with different bit depths. Assembly
      on x86 is prefixed with the 'private_prefix' define, while all other archs
      modify their function prefix internally.
      
      Templatize the main C library, x86/x86_64 assembly, ARM assembly, AARCH64
      assembly, PowerPC assembly, and MIPS assembly.
      
      The depth and cache CLI filters heavily depend on bit depth size, so they
      need to be duplicated for each value. This means having to rename these
      filters, and adjust the callers to use the right version.
      
      Unfortunately the threaded input CLI module inherits a common.h dependency
      (input/frame -> common/threadpool -> common/frame -> common/common) which
      is extremely complicated to address in a sensible way. Instead duplicate
      the module and select the appropriate one at run time.
      
      Each bitdepth needs different checkasm compilation rules, so split the main
      checkasm target into two executables.
      71ed44c7
  3. 24 Jun, 2017 1 commit
  4. 14 Jun, 2017 2 commits
  5. 21 May, 2017 6 commits
  6. 19 May, 2017 1 commit
    • Henrik Gramner's avatar
      osdep: Rework alignment macros · d13b4c3a
      Henrik Gramner authored
      Drop ALIGNED_N and ALIGNED_ARRAY_N in favor of using explicit alignment.
      
      This will allow us to increase the native alignment without unnecessarily
      increasing the alignment of everything that's currently 32-byte aligned.
      d13b4c3a
  7. 21 Jan, 2017 3 commits
  8. 01 Dec, 2016 1 commit
    • Anton Mitrofanov's avatar
      Cosmetics · b2b39dae
      Anton Mitrofanov authored
      Also make x264_weighted_reference_duplicate() static.
      b2b39dae
  9. 20 Apr, 2016 1 commit
  10. 16 Jan, 2016 1 commit
  11. 18 Aug, 2015 1 commit
  12. 23 Feb, 2015 1 commit
  13. 20 Jul, 2014 1 commit
  14. 24 Feb, 2014 1 commit
  15. 21 Jan, 2014 2 commits
  16. 08 Jan, 2014 1 commit
  17. 30 Oct, 2013 2 commits
  18. 23 Aug, 2013 2 commits
    • Kieran Kunhya's avatar
      AVC-Intra support · 9b94896b
      Kieran Kunhya authored
      This format has been reverse engineered and x264's output has almost exactly
      the same bitstream as Panasonic cameras and encoders produce. It therefore does
      not comply with SMPTE RP2027 since Panasonic themselves do not comply with
      their own specification. It has been tested in Avid, Premiere, Edius and
      Quantel.
      
      Parts of this patch were written by Fiona Glaser and some reverse
      engineering was done by Joseph Artsimovich.
      9b94896b
    • Henrik Gramner's avatar
      Transparent hugepage support · fa1e2b74
      Henrik Gramner authored
      Combine frame and mb data mallocs into a single large malloc.
      Additionally, on Linux systems with hugepage support, ask for hugepages on
      large mallocs.
      
      This gives a small performance improvement (~0.2-0.9%) on systems without
      hugepage support, as well as a small memory footprint reduction.
      
      On recent Linux kernels with hugepage support enabled (set to madvise or
      always), it improves performance up to 4% at the cost of about 7-12% more
      memory usage on typical settings..
      
      It may help even more on Haswell and other recent CPUs with improved 2MB page
      support in hardware.
      fa1e2b74
  19. 03 Jul, 2013 1 commit
  20. 20 May, 2013 1 commit
  21. 23 Apr, 2013 6 commits
    • Henrik Gramner's avatar
      x86: AVX memzero_aligned · 547a6573
      Henrik Gramner authored
      547a6573
    • Henrik Gramner's avatar
      x86: AVX2 high bit-depth predict_16x16_h · 7908dc63
      Henrik Gramner authored
      7908dc63
    • Fiona Glaser's avatar
      x86: more AVX2 framework, AVX2 functions, plus some existing asm tweaks · 0ea5be85
      Fiona Glaser authored
      AVX2 functions:
      mc_chroma
      intra_sad_x3_16x16
      last64
      ads
      hpel
      dct4
      idct4
      sub16x16_dct8
      quant_4x4x4
      quant_4x4
      quant_4x4_dc
      quant_8x8
      SAD_X3/X4
      SATD
      var
      var2
      SSD
      zigzag interleave
      weightp
      weightb
      intra_sad_8x8_x9
      decimate
      integral
      hadamard_ac
      sa8d_satd
      sa8d
      lowres_init
      denoise
      0ea5be85
    • Fiona Glaser's avatar
      x86-64: cabac_block_residual assembly · a3f5c732
      Fiona Glaser authored
      RDO: ~20% faster than C
      Bitstream: ~50% faster than C
      1-2% faster overall, highest on preset superfast/fast/medium.
      a3f5c732
    • Steve Borho's avatar
      OpenCL lookahead · f49a1b2e
      Steve Borho authored
      OpenCL support is compiled in by default, but must be enabled at runtime by an
      --opencl command line flag. Compiling OpenCL support requires perl. To avoid
      the perl requirement use: configure --disable-opencl.
      
      When enabled, the lookahead thread is mostly off-loaded to an OpenCL capable GPU
      device.  Lowres intra cost prediction, lowres motion search (including subpel)
      and bidir cost predictions are all done on the GPU.  MB-tree and final slice
      decisions are still done by the CPU.  Presets which do not use a threaded
      lookahead will not use OpenCL at all (superfast, ultrafast).
      
      Because of data dependencies, the GPU must use an iterative motion search which
      performs more total work than the CPU would do, so this is not work efficient
      or power efficient. But if there are spare GPU cycles to spare, it can often
      speed up the encode. Output quality when OpenCL lookahead is enabled is often
      very slightly worse in quality than the CPU quality (because of the same data
      dependencies).
      
      x264 must co...
      f49a1b2e
    • Fiona Glaser's avatar
      3cdaca1a
  22. 25 Feb, 2013 1 commit
    • Fiona Glaser's avatar
      x86: optimize and clean up predictor checking · 6371c3a5
      Fiona Glaser authored
      Branchlessly handle elimination of candidates in MMX roundclip asm.
      Add a new asm function, similar to roundclip, except without the round part.
      Optimize and organize the C code, and make both subme>=3 and subme<3 consistent.
      Add lots of explanatory comments and try to make things a little more understandable.
      ~5-10% faster with subme>=3, ~15-20% faster with subme<3.
      6371c3a5