1. 16 Jan, 2016 10 commits
    • Henrik Gramner's avatar
      Bump dates to 2016 · d23d1865
      Henrik Gramner authored
      d23d1865
    • Henrik Gramner's avatar
      Simplify threadpool_wait · 24f7705f
      Henrik Gramner authored
      24f7705f
    • Henrik Gramner's avatar
      x86: Avoid some bypass delays and false dependencies · 1637239a
      Henrik Gramner authored
      A bypass delay of 1-3 clock cycles may occur on some CPUs when transitioning
      between int and float domains, so try to avoid that if possible.
      1637239a
    • Henrik Gramner's avatar
      x86: Enable high bit-depth x264_coeff_last64_avx2_lzcnt · 7688814a
      Henrik Gramner authored
      The function existed but was never enabled.
      7688814a
    • Geza Lore's avatar
      x86inc: Add debug symbols indicating sizes of compiled functions · 366fa858
      Geza Lore authored
      Some debuggers/profilers use this metadata to determine which function a
      given instruction is in; without it they get can confused by local labels
      (if you haven't stripped those). On the other hand, some tools are still
      confused even with this metadata. e.g. this fixes `gdb`, but not `perf`.
      
      Currently only implemented for ELF.
      366fa858
    • Henrik Gramner's avatar
      x86inc: Avoid creating unnecessary local labels · 70c3ba42
      Henrik Gramner authored
      The REP_RET workaround is only needed on old AMD cpus, and the labels clutter
      up the symbol table and confuse debugging/profiling tools, so use EQU to
      create SHN_ABS symbols instead of creating local labels. Furthermore, skip
      the workaround completely in functions that definitely won't run on such cpus.
      
      This patch doesn't modify any emitted instructions, and doesn't actually affect
      x264 at all. It's only for other projects that use x86inc.asm without an
      appropriate `strip` command in their buildsystem.
      
      Note that EQU is just creating a local label when using nasm instead of yasm.
      This is probably a bug, but at least it doesn't break anything.
      70c3ba42
    • Henrik Gramner's avatar
      x86inc: Simplify AUTO_REP_RET · 5c3d473a
      Henrik Gramner authored
      cpuflags is never undefined any more, it's set to 0 instead.
      
      Also fix an incorrect comment.
      5c3d473a
    • Henrik Gramner's avatar
      x86inc: Use more consistent indentation · 28d68f09
      Henrik Gramner authored
      28d68f09
    • Henrik Gramner's avatar
      x86inc: Preserve arguments when allocating stack space · 963b99ef
      Henrik Gramner authored
      When allocating stack space with a larger alignment than the known stack
      alignment a temporary register is used for storing the stack pointer.
      Ensure that this isn't one of the registers used for passing arguments.
      963b99ef
    • Henrik Gramner's avatar
      x86inc: Improve FMA instruction handling · 6e503341
      Henrik Gramner authored
       * Correctly handle FMA instructions with memory operands.
       * Print a warning if FMA instructions are used without the correct cpuflag.
       * Simplify the instantiation code.
       * Clarify documentation.
      
      Only the last operand in FMA3 instructions can be a memory operand. When
      converting FMA4 instructions to FMA3 instructions we can utilize the fact
      that multiply is a commutative operation and reorder operands if necessary
      to ensure that a memory operand is used only as the last operand.
      6e503341
  2. 03 Jan, 2016 2 commits
  3. 20 Dec, 2015 2 commits
    • Janne Grunau's avatar
      arm: do not fill mc_weight*_neon tabs for HIGH_BIT_DEPTH · 42453453
      Janne Grunau authored
      The asm is only for 8-bit and function prototypes reflect that. Avoids
      numerous warnings with --bit-depth=9/10.
      42453453
    • Janne Grunau's avatar
      arm: Eliminate text relocations in asm · df51d8ef
      Janne Grunau authored
      Android 6 does not link shared libraries with text relocations.
      
      Make the movrel macro position independent and add movrelx for indirect
      loads of external symbols.
      
      Move the function pointer table for the aligned memcpy variants to the
      data.rel.ro section on Linux/Android.
      df51d8ef
  4. 17 Oct, 2015 1 commit
  5. 11 Oct, 2015 25 commits
    • Anton Mitrofanov's avatar
      ppc: Add detection of AltiVec support for FreeBSD · 75992107
      Anton Mitrofanov authored
      Patch from FreeBSD ports.
      75992107
    • Martin Storsjö's avatar
      arm: Implement x264_mbtree_propagate_{cost, list}_neon · 6f04b146
      Martin Storsjö authored
      The cost function could be simplified to avoid having to clobber
      q4/q5, but this requires reordering instructions which increase
      the total runtime.
      
      checkasm timing       Cortex-A7      A8      A9
      mbtree_propagate_cost_c      63702   155835  62829
      mbtree_propagate_cost_neon   17199   10454   11106
      
      mbtree_propagate_list_c      104203  108949  84532
      mbtree_propagate_list_neon   82035   78348   60410
      6f04b146
    • Martin Storsjö's avatar
      x86: Share the mbtree_propagate_list macro with aarch64 · 3e25eab0
      Martin Storsjö authored
      This avoids having to duplicate the same code for all architectures
      that implement only the internal part of this function in assembler.
      3e25eab0
    • Martin Storsjö's avatar
      arm: Implement luma intra deblocking · 654901df
      Martin Storsjö authored
      checkasm timing       Cortex-A7      A8     A9
      deblock_luma_intra[0]_c      5988    4653   4316
      deblock_luma_intra[0]_neon   3103    2170   2128
      deblock_luma_intra[1]_c      7119    5905   5347
      deblock_luma_intra[1]_neon   2068    1381   1412
      
      This includes extra optimizations by Janne Grunau.
      
      Timings from a separate build, on Exynos 5422:
      
                            Cortex-A7     A15
      deblock_luma_intra[0]_c      6627   3300
      deblock_luma_intra[0]_neon   3059   1128
      deblock_luma_intra[1]_c      7314   4128
      deblock_luma_intra[1]_neon   2038   720
      654901df
    • Martin Storsjö's avatar
      arm: Implement some neon 8x16c intra predict functions · e2696a60
      Martin Storsjö authored
      checkasm timing       Cortex-A7      A8     A9
      intra_predict_8x16c_dct_c    862     540    590
      intra_predict_8x16c_dct_neon 608     511    657
      intra_predict_8x16c_h_c      972     707    719
      intra_predict_8x16c_h_neon   722     656    672
      intra_predict_8x16c_p_c      10183   9819   8655
      intra_predict_8x16c_p_neon   2622    1972   1983
      e2696a60
    • Martin Storsjö's avatar
      arm: Implement x264_plane_copy_neon · 5db8b6b9
      Martin Storsjö authored
      checkasm timing       Cortex-A7      A8     A9
      plane_copy_c                 13124   10925  9106
      plane_copy_neon              7349    5103   8945
      5db8b6b9
    • Jerome Duval's avatar
      Haiku support · 39af8c72
      Jerome Duval authored
      Add Haiku as supported platform in configure.
      Haiku has no nice() function, use the platform specific substitute instead.
      39af8c72
    • Martin Storsjö's avatar
      arm: Implement x284_decimate_score15/16/64_neon · 5c13589b
      Martin Storsjö authored
      checkasm timing       Cortex-A7      A8     A9
      decimate_score15_c           764     736    535
      decimate_score15_neon        487     494    453
      decimate_score16_c           782     727    553
      decimate_score16_neon        487     494    521
      decimate_score64_c           2361    2597   2011
      decimate_score64_neon        1017    802    785
      5c13589b
    • Martin Storsjö's avatar
      arm: Implement chroma intra deblock · 3902ae02
      Martin Storsjö authored
      checkasm timing              Cortex-A7      A8     A9
      deblock_chroma_420_intra_mbaff_c    1469    1276   1181
      deblock_chroma_420_intra_mbaff_neon 981     717    644
      deblock_chroma_intra[1]_c           2954    2402   2321
      deblock_chroma_intra[1]_neon        947     581    575
      deblock_h_chroma_420_intra_c        2859    2509   2264
      deblock_h_chroma_420_intra_neon     1480    1119   1028
      deblock_h_chroma_422_intra_c        6211    5030   4792
      deblock_h_chroma_422_intra_neon     2894    1990   2077
      3902ae02
    • Martin Storsjö's avatar
      arm: Implement x264_pixel_sa8d_satd_16x16_neon · e8b95e92
      Martin Storsjö authored
      This requires spilling some registers to the stack,
      contray to the aarch64 version.
      
      checkasm timing        Cortex-A7      A8     A9
      sa8d_satd_16x16_neon          12936   6365   7492
      sa8d_satd_16x16_separate_neon 14841   6605   8324
      e8b95e92
    • Martin Storsjö's avatar
      arm: Implement x264_deblock_h_chroma_mbaff_neon · 6bbaa275
      Martin Storsjö authored
      checkasm timing        Cortex-A7      A8     A9
      deblock_chroma_420_mbaff_c    1944    1706   1526
      deblock_chroma_420_mbaff_neon 1210    873    865
      6bbaa275
    • Martin Storsjö's avatar
      arm: Implement x264_deblock_h_chroma_422_neon · 3c66591e
      Martin Storsjö authored
      checkasm timing       Cortex-A7      A8     A9
      deblock_h_chroma_422_c       6953    6269   5145
      deblock_h_chroma_422_neon    3905    2569   2551
      3c66591e
    • Martin Storsjö's avatar
      arm: Implement integral_init4/8h/v_neon · 5265b927
      Martin Storsjö authored
      checkasm timing       Cortex-A7      A8     A9
      integral_init4h_c            10466   8590   6161
      integral_init4h_neon         3021    1494   1800
      integral_init4v_c            16250   13590  13628
      integral_init4v_neon         3473    2073   3291
      integral_init8h_c            10100   8275   5705
      integral_init8h_neon         4403    2344   2751
      integral_init8v_c            6403    4632   4999
      integral_init8v_neon         1184    783    1306
      5265b927
    • Martin Storsjö's avatar
      arm: Implement x264_denoise_dct_neon · b08403b5
      Martin Storsjö authored
      checkasm timing       Cortex-A7      A8     A9
      denoise_dct_c                6604    5510   5858
      denoise_dct_neon             1774    1139   1614
      b08403b5
    • Martin Storsjö's avatar
      arm: Add x264_nal_escape_neon · ceee976b
      Martin Storsjö authored
      checkasm timing      Cortex-A7      A8      A9
      nal_escape_c                852758  879566  655497
      nal_escape_neon             376831  450678  371673
      ceee976b
    • Martin Storsjö's avatar
      arm: Add neon versions of vsad, asd8 and ssd_nv12_core · 8feb733e
      Martin Storsjö authored
      These are straight translations of the aarch64 versions.
      
      checkasm timing      Cortex-A7      A8      A9
      vsad_c                      16234   10984   9850
      vsad_neon                   2132    1020    789
      
      asd8_c                      5859    3561    3543
      asd8_neon                   1407    1279    1250
      
      ssd_nv12_c                  608096  591072  426285
      ssd_nv12_neon               72752   33549   41347
      8feb733e
    • Janne Grunau's avatar
      aarch64: Skip deblocking in 264_deblock_h_chroma_422_neon · 3d86abab
      Janne Grunau authored
      If the parameters (alpha, beta, tc0[]) indicated that the deblocking
      should have been skipped, every 2nd chrome line would have deblocked
      anyway.
      
      deblock_h_chroma_422_neon: 2259 (before)
      deblock_h_chroma_422_neon: 2192 (after)
      3d86abab
    • Janne Grunau's avatar
      aarch64: Optimize various intra_predict asm functions · aec81efd
      Janne Grunau authored
      Make them at least as fast as the compiled C version (tested on
      cortex-a53 vs. gcc 4.9.2).
      
                              C     NEON (before)   NEON (after)
      intra_predict_4x4_dc:   260   335             260
      intra_predict_4x4_dct:  210   265             200
      intra_predict_8x8c_dc:  497   548             493
      intra_predict_8x8c_v:   232   309             179 (arm64)
      intra_predict_8x16c_dc: 795   830             790
      aec81efd
    • Janne Grunau's avatar
      aarch64: Faster intra_predict_4x4_h · b16268ac
      Janne Grunau authored
      Use multiplication with 0x01010101 for splats.
      
      On a cortex-a53:
                           gcc 4.9.2   llvm 3.6   neon (before)   neon (after)
      intra_predict_4x4_h: 162         147        160/155         139/135
      b16268ac
    • Janne Grunau's avatar
      aarch64: Fix coeff_level_run* macros with LLVM's assembler · f2a6be92
      Janne Grunau authored
      LLVM's integrated assembler does not treat symbols as integer constants.
      f2a6be92
    • Janne Grunau's avatar
      592e92e9
    • Martin Storsjö's avatar
      arm: Implement x264_sub8x16_dct_dc_neon · 6efb57ad
      Martin Storsjö authored
      checkasm timing      Cortex-A7      A8     A9
      sub8x16_dct_dc_c            6386    3901   4080
      sub8x16_dct_dc_neon         1491    698    917
      6efb57ad
    • Martin Storsjö's avatar
      arm: Optimize x264_deblock_h_chroma_neon · 89439b2c
      Martin Storsjö authored
      Shuffle both chroma components together as a 16 bit unit, and
      don't write the unchanged columns (like in x264_deblock_h_luma_neon
      and in the aarch64 version of the function).
      
      This causes a minor slowdown for x264_deblock_v_chroma_neon, but
      it is negligible compared to the speedup.
      
      checkasm timing      Cortex-A7    A8    A9
      deblock_chroma[1]_c         4817  4057  3601
      deblock_chroma[1]_neon      1249  716   817   (before)
      deblock_chroma[1]_neon      1249  766   845   (after)
      
      deblock_h_chroma_420_c      3699  3275  2830
      deblock_h_chroma_420_neon   2068  1414  1400  (before)
      deblock_h_chroma_420_neon   1838  1355  1291  (after)
      89439b2c
    • Martin Storsjö's avatar
      ff71457d
    • Martin Storsjö's avatar
      aarch64: Simplify the decimate_score functions · ef603481
      Martin Storsjö authored
      After doing a left shift by the number of bits returned by clz,
      only bits set to zero can be shifted out, so if the register
      was nonzero to start with (which is checked), it can't become
      zero here.
      ef603481