Commits · master · VideoLAN / x86inc.asm

May 14, 2024

Improve ELF PIC support for external function calls · b6ba1e30

Henrik Gramner authored 8 months ago

PLT/GOT indirections are required in some cases. Most commonly when
calling functions from other shared libraries, but also in some
scenarios when calling functions with default symbol visibility
even within the same component on certain elf64 platforms.

On elf64 we can simply use PLT relocations for all calls to external
functions. Since the linker is able to eliminate unnecessary PLT
indirections with the final output binary being identical to non-PLT
relocations there isn't really any downside to doing so. This mimics
what regular compilers normally do for calls to external functions.

On elf32 with PIC we can use a function pointer from the GOT when
calling external functions, similar to what regular compilers do when
using -fno-plt. Since this both introduces overhead and clobbers one
register, which could potentially have been used for custom calling
conventions when calling other asm functions within the same library,
it's only performed for functions declared using 'cextern_naked'.

b6ba1e30

Mar 15, 2024

Improve XMM-spilling functionality on 64-bit Windows · 04f14f43

Henrik Gramner authored 10 months ago

Prior to this change dealing with the scenario where the number of
XMM registers spilled depends on if a branch is taken or not was
complicated to handle well. There was essentially three options:

1) Always spill the largest number of XMM register. Results in
   unnecessary spills.

2) Do the spilling after the branch. Results in code duplication
   for the shared subset of spills.

3) Do the spilling manually. Optimal, but overly complex and vexing.

This adds additional optional arguments to the WIN64_SPILL_XMM and
WIN64_PUSH_XMM macros to make it possible to allocate space for a
certain number of registers but initially only push a subset of
those, with the option of pushing additional register later.

04f14f43

Restore the stack state between stack allocations · 8494a52b

Henrik Gramner authored 10 months ago

Allows the use of multiple independent stack allocations within
a function without having to manually fiddle with stack offsets.

8494a52b

Feb 22, 2024
- Fix warnings with old nasm versions · 520b9681
  Henrik Gramner authored 10 months ago
  
  520b9681
Feb 20, 2024

Add support for ELF CET properties · c7cf926c

Henrik Gramner authored 11 months ago

Automatically flag x86-64 asm object files as SHSTK-compatible.

Shadow Stack (SHSTK) is a part of Control-flow Enforcement Technology
(CET) which is a feature aimed at defending against ROP attacks by
verifying that 'call' and 'ret' instructions are correctly matched.

For well-written code this works transparently without any code changes,
as return addresses popped from the shadow stack should match return
addresses popped from the normal stack for performance reasons anyway.

c7cf926c

Add the crc32 SSE4.2 GPR instruction · 2b05747e
Henrik Gramner authored 11 months ago

2b05747e
Add CLMUL cpu flag · ae734e50
Henrik Gramner authored 11 months ago
```
Also make the GFNI cpu flag imply the presence of both AESNI and CLMUL.
```
ae734e50
Add template defines for AVX512-FP16 broadcasts · 141d7d33
Henrik Gramner authored 11 months ago

141d7d33

Feb 21, 2022

Add REPX macro to repeat instructions/operations · 2c087c14

Henrik Gramner authored 2 years ago

When operating on large blocks of data it's common to repeatedly use
an instruction on multiple registers. Using the REPX macro makes it
easy to quickly write dense code to achieve this without having to
explicitly duplicate the same instruction over and over.

For example,

    REPX {paddw x, m4}, m0, m1, m2, m3
    REPX {mova [r0+16*x], m5}, 0, 1, 2, 3

will expand to

    paddw       m0, m4
    paddw       m1, m4
    paddw       m2, m4
    paddw       m3, m4
    mova [r0+16*0], m5
    mova [r0+16*1], m5
    mova [r0+16*2], m5
    mova [r0+16*3], m5

2c087c14

Fix edge case in forced VEX-encoding · d66fddf5

Henrik Gramner authored 2 years ago

Correctly handle emulation of 4-operand instructions (e.g. 'shufps')
where src1 is a memory operand.

d66fddf5

Enable 4-operand emulation for variable blend instructions · 67efddc8

Henrik Gramner authored 2 years ago

With legacy encoding the last operand (the index) must be xmm0,
but aside from that emulating non-destructive forms works the
same as any other instruction.

67efddc8

Aug 31, 2021
- Add an option for forcing VEX-encoding in non-AVX functions · 15cc2291
  Henrik Gramner authored 3 years ago
  
  15cc2291
Jun 15, 2021

Support memory operands in src1 in 3-operand instructions · 3c738118

Henrik Gramner authored 3 years ago

Particularly in code that makes heavy use of macros it's possible
to end up with 3-operand instructions with a memory operand in src1.
In the case of SSE this works fine due to automatic move insertions,
but in AVX that fails since memory operands are only allowed in src2.

The main purpose of this feature is to minimize the amount of code
changes required to facilitate conversion of existing SSE code to AVX.

3c738118

Feb 11, 2021

Add stack probing on Windows · e69f24cc

Henrik Gramner authored 3 years ago

Large stack allocations on Windows need to use stack probing in order
to guarantee that all stack memory is committed before accessing it.
This is done by ensuring that the guard page(s) at the end of the
currently committed pages are touched prior to any pages beyond that.

e69f24cc

Jan 28, 2021
- Properly fix LOAD_MM_PERMUTATION for AVX-512 · eb816b59
  Anton Mitrofanov authored 3 years ago and Henrik Gramner committed 3 years ago
  
  eb816b59
Aug 21, 2020
- Fix additional warnings when using nasm 2.15 · b318d571
  Henrik Gramner authored 4 years ago
  
  b318d571
Jun 24, 2020

Properly sort instructions in alphabetical order · 58eb6436
Henrik Gramner authored 4 years ago

58eb6436

Add template defines for EVEX broadcasts · 4cb18a06

Henrik Gramner authored 4 years ago

Broadcasting a memory operand is binary flag, you either broadcast
or you don't, and there's only a single possible element size for
any given instruction.

The instruction syntax however requires the broadcast semanticts
to be explicitly defined, which is an issue when using macros to
template code for multiple register widths.

Add some helper defines to alleviate the issue.

4cb18a06

Jun 09, 2020
- Fix warnings when using nasm 2.15 · 7ba26e7e
  Henrik Gramner authored 4 years ago
  
  7ba26e7e
- Save xmm_regs_used in WIN64_SPILL_XMM on non-Win64 · eb466624
  Victorien Le Couviour--Tuffet authored 5 years ago and Henrik Gramner committed 4 years ago
```
This allows for AVX-512 code to issue vzeroupper automatically in
RET when the number of vector registers used is specified through
WIN64_SPILL_XMM instead of through cglobal.
```
  eb466624
- Add support for Ice Lake AVX-512 subset · 26c974d7
  Victorien Le Couviour--Tuffet authored 5 years ago and Henrik Gramner committed 4 years ago
  
  26c974d7
Oct 21, 2019

Fix LOAD_MM_PERMUTATION for AVX-512 · 822745ae

Victorien Le Couviour--Tuffet authored 5 years ago and

Henrik Gramner committed 5 years ago

Pre-permuting the registers in INIT_*MM avx512 (AVX512_MM_PERMUTATION)
is redundant. It causes the register mapping to be the same as without
the initial AVX512_MM_PERMUTATION, with the user SWAPs applied.

For example...

INIT_YMM avx512
SWAP m0, m16
SAVE_MM_PERMUTATION
; do whatever
LOAD_MM_PERMUTATION

... would result in m0 mapping to ymm16 instead of ymm0 and m1 to ymm1
instead of ymm17.

822745ae

Oct 19, 2019
- Reword some x264-specific comments · dcac538c
  Henrik Gramner authored 5 years ago
  
  dcac538c
Mar 06, 2019
- Add support for GFNI instructions · 6a73fe1e
  Henrik Gramner authored 5 years ago and Anton Mitrofanov committed 5 years ago
  
  6a73fe1e
- Improve warnings for use of unsupported instructions · 454c7de1
  Henrik Gramner authored 5 years ago and Anton Mitrofanov committed 5 years ago
```
Warn when the following are used without the appropriate cpuflag:
 * YMM and ZMM registers
 * 'pextrw' with a memory operand
 * GPR instruction set extensions
```
  454c7de1
- Support N_PEXT bit on Mach-O · 331f62f7
  Henrik Gramner authored 5 years ago and Anton Mitrofanov committed 5 years ago
```
Allows for marking symbols as having limited global scope, similar to
using 'hidden' symbol visibility on ELF.
```
  331f62f7
- Make 'non-adjacent' default in the TAIL_CALL macro · 699d171e
  Henrik Gramner authored 5 years ago and Anton Mitrofanov committed 5 years ago
  
  699d171e
- Add x86-32 PIC support macros · f285b0ef
  Henrik Gramner authored 5 years ago and Anton Mitrofanov committed 5 years ago
  
  f285b0ef
- Turn 'movsxd' into 'movifnidn' on x86-32 · 67092e97
  Henrik Gramner authored 5 years ago and Anton Mitrofanov committed 5 years ago
  
  67092e97
- Bump copyright date to 2019 · 2f699b95
  Henrik Gramner authored 5 years ago and Anton Mitrofanov committed 5 years ago
  
  2f699b95
Aug 06, 2018

Improve SAVE/LOAD_MM_PERMUTATION macros · f7d8b77e

Henrik Gramner authored 6 years ago

    
Use register numbers instead of copying the full register names.
This makes it possible to change register widths in the middle of
a function and keep the mmreg permutations intact which can be
useful for code that only needs larger vectors for parts of the
function in combination with macros etc.

Also change the LOAD_MM_PERMUTATION macro to use the same default
name as the SAVE macro. This simplifies swapping from ymm to xmm
registers or vice versa:

    SAVE_MM_PERMUTATION
    INIT_XMM <cpuflags>
    LOAD_MM_PERMUTATION

f7d8b77e

Optimize VEX instruction encoding · 6e54d400

Henrik Gramner authored 6 years ago

Most VEX-encoded instructions require an additional byte to encode
when src2 is a high register (e.g. x|ymm8..15). If the instruction
is commutative we can swap src1 and src2 when doing so reduces the
instruction length, e.g.

    vpaddw xmm0, xmm0, xmm8 -> vpaddw xmm0, xmm8, xmm0

6e54d400

Fix VEX -> EVEX instruction conversion · d3200075
Henrik Gramner authored 6 years ago
```
There's an edge case that wasn't properly handled.
```
d3200075

Jan 17, 2018
- Correctly set mmreg variables · 16db6298
  Henrik Gramner authored 7 years ago
  
  16db6298
- Bump copyright date to 2018 · eac6e507
  Henrik Gramner authored 7 years ago
  
  eac6e507
Dec 24, 2017
- Support creating global symbols from local labels · 5cc636ef
  Henrik Gramner authored 7 years ago and Anton Mitrofanov committed 7 years ago
```
On ELF platforms such symbols needs to be flagged as functions with the
correct visibility to please certain linkers in some scenarios.
```
  5cc636ef
- Use .rdata instead of .rodata on Windows · e88d8eb5
  Henrik Gramner authored 7 years ago and Anton Mitrofanov committed 7 years ago
```
The standard section for read-only data on Windows is .rdata. Nasm will
flag non-standard sections as executable by default which isn't ideal.
```
  e88d8eb5
- Set the correct cpuflag for AES-NI instructions · cd192734
  Henrik Gramner authored 7 years ago and Anton Mitrofanov committed 7 years ago
  
  cd192734
- Enable AVX emulation for floating-point pseudo-instructions · 4cd684f6
  Henrik Gramner authored 7 years ago and Anton Mitrofanov committed 7 years ago
```
There are 32 pseudo-instructions for each floating-point comparison
instruction, but only 8 of them are actually valid in legacy-encoded
mode. The remaining 24 requires the use of VEX-encoded (v-prefixed)
instructions and can therefore be disregarded for this purpose.
```
  4cd684f6
Jun 24, 2017
- Add aesni cpuflag · baa06223
  James Darnley authored 7 years ago and Henrik Gramner committed 7 years ago
  
  baa06223