- May 14, 2024
-
-
Henrik Gramner authored
PLT/GOT indirections are required in some cases. Most commonly when calling functions from other shared libraries, but also in some scenarios when calling functions with default symbol visibility even within the same component on certain elf64 platforms. On elf64 we can simply use PLT relocations for all calls to external functions. Since the linker is able to eliminate unnecessary PLT indirections with the final output binary being identical to non-PLT relocations there isn't really any downside to doing so. This mimics what regular compilers normally do for calls to external functions. On elf32 with PIC we can use a function pointer from the GOT when calling external functions, similar to what regular compilers do when using -fno-plt. Since this both introduces overhead and clobbers one register, which could potentially have been used for custom calling conventions when calling other asm functions within the same library, it's only performed for functions declared using 'cextern_naked'.
-
- Mar 15, 2024
-
-
Henrik Gramner authored
Prior to this change dealing with the scenario where the number of XMM registers spilled depends on if a branch is taken or not was complicated to handle well. There was essentially three options: 1) Always spill the largest number of XMM register. Results in unnecessary spills. 2) Do the spilling after the branch. Results in code duplication for the shared subset of spills. 3) Do the spilling manually. Optimal, but overly complex and vexing. This adds additional optional arguments to the WIN64_SPILL_XMM and WIN64_PUSH_XMM macros to make it possible to allocate space for a certain number of registers but initially only push a subset of those, with the option of pushing additional register later.
-
Henrik Gramner authored
Allows the use of multiple independent stack allocations within a function without having to manually fiddle with stack offsets.
-
- Feb 22, 2024
-
-
Henrik Gramner authored
-
- Feb 20, 2024
-
-
Henrik Gramner authored
Automatically flag x86-64 asm object files as SHSTK-compatible. Shadow Stack (SHSTK) is a part of Control-flow Enforcement Technology (CET) which is a feature aimed at defending against ROP attacks by verifying that 'call' and 'ret' instructions are correctly matched. For well-written code this works transparently without any code changes, as return addresses popped from the shadow stack should match return addresses popped from the normal stack for performance reasons anyway.
-
Henrik Gramner authored
-
Henrik Gramner authored
Also make the GFNI cpu flag imply the presence of both AESNI and CLMUL.
-
Henrik Gramner authored
-
- Feb 21, 2022
-
-
Henrik Gramner authored
When operating on large blocks of data it's common to repeatedly use an instruction on multiple registers. Using the REPX macro makes it easy to quickly write dense code to achieve this without having to explicitly duplicate the same instruction over and over. For example, REPX {paddw x, m4}, m0, m1, m2, m3 REPX {mova [r0+16*x], m5}, 0, 1, 2, 3 will expand to paddw m0, m4 paddw m1, m4 paddw m2, m4 paddw m3, m4 mova [r0+16*0], m5 mova [r0+16*1], m5 mova [r0+16*2], m5 mova [r0+16*3], m5
-
Henrik Gramner authored
Correctly handle emulation of 4-operand instructions (e.g. 'shufps') where src1 is a memory operand.
-
Henrik Gramner authored
With legacy encoding the last operand (the index) must be xmm0, but aside from that emulating non-destructive forms works the same as any other instruction.
-
- Aug 31, 2021
-
-
Henrik Gramner authored
-
- Jun 15, 2021
-
-
Henrik Gramner authored
Particularly in code that makes heavy use of macros it's possible to end up with 3-operand instructions with a memory operand in src1. In the case of SSE this works fine due to automatic move insertions, but in AVX that fails since memory operands are only allowed in src2. The main purpose of this feature is to minimize the amount of code changes required to facilitate conversion of existing SSE code to AVX.
-
- Feb 11, 2021
-
-
Henrik Gramner authored
Large stack allocations on Windows need to use stack probing in order to guarantee that all stack memory is committed before accessing it. This is done by ensuring that the guard page(s) at the end of the currently committed pages are touched prior to any pages beyond that.
-
- Jan 28, 2021
-
-
- Aug 21, 2020
-
-
Henrik Gramner authored
-
- Jun 24, 2020
-
-
Henrik Gramner authored
-
Henrik Gramner authored
Broadcasting a memory operand is binary flag, you either broadcast or you don't, and there's only a single possible element size for any given instruction. The instruction syntax however requires the broadcast semanticts to be explicitly defined, which is an issue when using macros to template code for multiple register widths. Add some helper defines to alleviate the issue.
-
- Jun 09, 2020
-
-
Henrik Gramner authored
-
This allows for AVX-512 code to issue vzeroupper automatically in RET when the number of vector registers used is specified through WIN64_SPILL_XMM instead of through cglobal.
-
-
- Oct 21, 2019
-
-
Pre-permuting the registers in INIT_*MM avx512 (AVX512_MM_PERMUTATION) is redundant. It causes the register mapping to be the same as without the initial AVX512_MM_PERMUTATION, with the user SWAPs applied. For example... INIT_YMM avx512 SWAP m0, m16 SAVE_MM_PERMUTATION ; do whatever LOAD_MM_PERMUTATION ... would result in m0 mapping to ymm16 instead of ymm0 and m1 to ymm1 instead of ymm17.
-
- Oct 19, 2019
-
-
Henrik Gramner authored
-
- Mar 06, 2019
-
-
-
Warn when the following are used without the appropriate cpuflag: * YMM and ZMM registers * 'pextrw' with a memory operand * GPR instruction set extensions
-
Allows for marking symbols as having limited global scope, similar to using 'hidden' symbol visibility on ELF.
-
-
-
-
-
- Aug 06, 2018
-
-
Henrik Gramner authored
Use register numbers instead of copying the full register names. This makes it possible to change register widths in the middle of a function and keep the mmreg permutations intact which can be useful for code that only needs larger vectors for parts of the function in combination with macros etc. Also change the LOAD_MM_PERMUTATION macro to use the same default name as the SAVE macro. This simplifies swapping from ymm to xmm registers or vice versa: SAVE_MM_PERMUTATION INIT_XMM <cpuflags> LOAD_MM_PERMUTATION
-
Henrik Gramner authored
Most VEX-encoded instructions require an additional byte to encode when src2 is a high register (e.g. x|ymm8..15). If the instruction is commutative we can swap src1 and src2 when doing so reduces the instruction length, e.g. vpaddw xmm0, xmm0, xmm8 -> vpaddw xmm0, xmm8, xmm0
-
Henrik Gramner authored
There's an edge case that wasn't properly handled.
-
- Jan 17, 2018
-
-
Henrik Gramner authored
-
Henrik Gramner authored
-
- Dec 24, 2017
-
-
On ELF platforms such symbols needs to be flagged as functions with the correct visibility to please certain linkers in some scenarios.
-
The standard section for read-only data on Windows is .rdata. Nasm will flag non-standard sections as executable by default which isn't ideal.
-
-
There are 32 pseudo-instructions for each floating-point comparison instruction, but only 8 of them are actually valid in legacy-encoded mode. The remaining 24 requires the use of VEX-encoded (v-prefixed) instructions and can therefore be disregarded for this purpose.
-
- Jun 24, 2017
-
-