Skip to content
Snippets Groups Projects
  1. May 14, 2024
    • Henrik Gramner's avatar
      Improve ELF PIC support for external function calls · b6ba1e30
      Henrik Gramner authored
      PLT/GOT indirections are required in some cases. Most commonly when
      calling functions from other shared libraries, but also in some
      scenarios when calling functions with default symbol visibility
      even within the same component on certain elf64 platforms.
      
      On elf64 we can simply use PLT relocations for all calls to external
      functions. Since the linker is able to eliminate unnecessary PLT
      indirections with the final output binary being identical to non-PLT
      relocations there isn't really any downside to doing so. This mimics
      what regular compilers normally do for calls to external functions.
      
      On elf32 with PIC we can use a function pointer from the GOT when
      calling external functions, similar to what regular compilers do when
      using -fno-plt. Since this both introduces overhead and clobbers one
      register, which could potentially have been used for custom calling
      conventions when calling other asm functions within the same library,
      it's only performed for functions declared using 'cextern_naked'.
      b6ba1e30
  2. Mar 15, 2024
    • Henrik Gramner's avatar
      Improve XMM-spilling functionality on 64-bit Windows · 04f14f43
      Henrik Gramner authored
      Prior to this change dealing with the scenario where the number of
      XMM registers spilled depends on if a branch is taken or not was
      complicated to handle well. There was essentially three options:
      
      1) Always spill the largest number of XMM register. Results in
         unnecessary spills.
      
      2) Do the spilling after the branch. Results in code duplication
         for the shared subset of spills.
      
      3) Do the spilling manually. Optimal, but overly complex and vexing.
      
      This adds additional optional arguments to the WIN64_SPILL_XMM and
      WIN64_PUSH_XMM macros to make it possible to allocate space for a
      certain number of registers but initially only push a subset of
      those, with the option of pushing additional register later.
      04f14f43
    • Henrik Gramner's avatar
      Restore the stack state between stack allocations · 8494a52b
      Henrik Gramner authored
      Allows the use of multiple independent stack allocations within
      a function without having to manually fiddle with stack offsets.
      8494a52b
  3. Feb 22, 2024
  4. Feb 20, 2024
  5. Feb 21, 2022
    • Henrik Gramner's avatar
      Add REPX macro to repeat instructions/operations · 2c087c14
      Henrik Gramner authored
      When operating on large blocks of data it's common to repeatedly use
      an instruction on multiple registers. Using the REPX macro makes it
      easy to quickly write dense code to achieve this without having to
      explicitly duplicate the same instruction over and over.
      
      For example,
      
          REPX {paddw x, m4}, m0, m1, m2, m3
          REPX {mova [r0+16*x], m5}, 0, 1, 2, 3
      
      will expand to
      
          paddw       m0, m4
          paddw       m1, m4
          paddw       m2, m4
          paddw       m3, m4
          mova [r0+16*0], m5
          mova [r0+16*1], m5
          mova [r0+16*2], m5
          mova [r0+16*3], m5
      2c087c14
    • Henrik Gramner's avatar
      Fix edge case in forced VEX-encoding · d66fddf5
      Henrik Gramner authored
      Correctly handle emulation of 4-operand instructions (e.g. 'shufps')
      where src1 is a memory operand.
      d66fddf5
    • Henrik Gramner's avatar
      Enable 4-operand emulation for variable blend instructions · 67efddc8
      Henrik Gramner authored
      With legacy encoding the last operand (the index) must be xmm0,
      but aside from that emulating non-destructive forms works the
      same as any other instruction.
      67efddc8
  6. Aug 31, 2021
  7. Jun 15, 2021
    • Henrik Gramner's avatar
      Support memory operands in src1 in 3-operand instructions · 3c738118
      Henrik Gramner authored
      Particularly in code that makes heavy use of macros it's possible
      to end up with 3-operand instructions with a memory operand in src1.
      In the case of SSE this works fine due to automatic move insertions,
      but in AVX that fails since memory operands are only allowed in src2.
      
      The main purpose of this feature is to minimize the amount of code
      changes required to facilitate conversion of existing SSE code to AVX.
      3c738118
  8. Feb 11, 2021
    • Henrik Gramner's avatar
      Add stack probing on Windows · e69f24cc
      Henrik Gramner authored
      Large stack allocations on Windows need to use stack probing in order
      to guarantee that all stack memory is committed before accessing it.
      This is done by ensuring that the guard page(s) at the end of the
      currently committed pages are touched prior to any pages beyond that.
      e69f24cc
  9. Jan 28, 2021
  10. Aug 21, 2020
  11. Jun 24, 2020
  12. Jun 09, 2020
  13. Oct 21, 2019
    • Victorien Le Couviour--Tuffet's avatar
      Fix LOAD_MM_PERMUTATION for AVX-512 · 822745ae
      Victorien Le Couviour--Tuffet authored and Henrik Gramner's avatar Henrik Gramner committed
      Pre-permuting the registers in INIT_*MM avx512 (AVX512_MM_PERMUTATION)
      is redundant. It causes the register mapping to be the same as without
      the initial AVX512_MM_PERMUTATION, with the user SWAPs applied.
      
      For example...
      
      INIT_YMM avx512
      SWAP m0, m16
      SAVE_MM_PERMUTATION
      ; do whatever
      LOAD_MM_PERMUTATION
      
      ... would result in m0 mapping to ymm16 instead of ymm0 and m1 to ymm1
      instead of ymm17.
      822745ae
  14. Oct 19, 2019
  15. Mar 06, 2019
  16. Aug 06, 2018
    • Henrik Gramner's avatar
      Improve SAVE/LOAD_MM_PERMUTATION macros · f7d8b77e
      Henrik Gramner authored
          
      Use register numbers instead of copying the full register names.
      This makes it possible to change register widths in the middle of
      a function and keep the mmreg permutations intact which can be
      useful for code that only needs larger vectors for parts of the
      function in combination with macros etc.
      
      Also change the LOAD_MM_PERMUTATION macro to use the same default
      name as the SAVE macro. This simplifies swapping from ymm to xmm
      registers or vice versa:
      
          SAVE_MM_PERMUTATION
          INIT_XMM <cpuflags>
          LOAD_MM_PERMUTATION
      f7d8b77e
    • Henrik Gramner's avatar
      Optimize VEX instruction encoding · 6e54d400
      Henrik Gramner authored
      Most VEX-encoded instructions require an additional byte to encode
      when src2 is a high register (e.g. x|ymm8..15). If the instruction
      is commutative we can swap src1 and src2 when doing so reduces the
      instruction length, e.g.
      
          vpaddw xmm0, xmm0, xmm8 -> vpaddw xmm0, xmm8, xmm0
      6e54d400
    • Henrik Gramner's avatar
      Fix VEX -> EVEX instruction conversion · d3200075
      Henrik Gramner authored
      There's an edge case that wasn't properly handled.
      d3200075
  17. Jan 17, 2018
  18. Dec 24, 2017
  19. Jun 24, 2017
Loading