- 21 Feb, 2022 2 commits
-
-
Henrik Gramner authored
When operating on large blocks of data it's common to repeatedly use an instruction on multiple registers. Using the REPX macro makes it easy to quickly write dense code to achieve this without having to explicitly duplicate the same instruction over and over. For example, REPX {paddw x, m4}, m0, m1, m2, m3 REPX {mova [r0+16*x], m5}, 0, 1, 2, 3 will expand to paddw m0, m4 paddw m1, m4 paddw m2, m4 paddw m3, m4 mova [r0+16*0], m5 mova [r0+16*1], m5 mova [r0+16*2], m5 mova [r0+16*3], m5
-
Henrik Gramner authored
Correctly handle emulation of 4-operand instructions (e.g. 'shufps') where src1 is a memory operand.
-
- 19 Feb, 2022 1 commit
-
-
Henrik Gramner authored
With legacy encoding the last operand (the index) must be xmm0, but aside from that emulating non-destructive forms works the same as any other instruction.
-
- 24 Jan, 2022 1 commit
-
-
Anton Mitrofanov authored
-
- 25 Aug, 2021 1 commit
-
-
Henrik Gramner authored
-
- 14 Jun, 2021 1 commit
-
-
Henrik Gramner authored
Particularly in code that makes heavy use of macros it's possible to end up with 3-operand instructions with a memory operand in src1. In the case of SSE this works fine due to automatic move insertions, but in AVX that fails since memory operands are only allowed in src2. The main purpose of this feature is to minimize the amount of code changes required to facilitate conversion of existing SSE code to AVX.
-
- 11 Feb, 2021 1 commit
-
-
Large stack allocations on Windows need to use stack probing in order to guarantee that all stack memory is committed before accessing it. This is done by ensuring that the guard page(s) at the end of the currently committed pages are touched prior to any pages beyond that.
-
- 27 Jan, 2021 1 commit
-
-
Anton Mitrofanov authored
-
- 24 Jan, 2021 1 commit
-
-
Anton Mitrofanov authored
-
- 25 Oct, 2020 1 commit
-
-
- 02 Jul, 2020 1 commit
-
-
Henrik Gramner authored
-
- 10 Jun, 2020 1 commit
-
-
- 29 Feb, 2020 1 commit
-
-
Anton Mitrofanov authored
-
- 06 Mar, 2019 7 commits
-
-
-
Warn when the following are used without the appropriate cpuflag: * YMM and ZMM registers * 'pextrw' with a memory operand * GPR instruction set extensions
-
Allows for marking symbols as having limited global scope, similar to using 'hidden' symbol visibility on ELF.
-
-
-
-
-
- 06 Aug, 2018 3 commits
-
-
Henrik Gramner authored
Use register numbers instead of copying the full register names. This makes it possible to change register widths in the middle of a function and keep the mmreg permutations intact which can be useful for code that only needs larger vectors for parts of the function in combination with macros etc. Also change the LOAD_MM_PERMUTATION macro to use the same default name as the SAVE macro. This simplifies swapping from ymm to xmm registers or vice versa: SAVE_MM_PERMUTATION INIT_XMM <cpuflags> LOAD_MM_PERMUTATION
-
Henrik Gramner authored
Most VEX-encoded instructions require an additional byte to encode when src2 is a high register (e.g. x|ymm8..15). If the instruction is commutative we can swap src1 and src2 when doing so reduces the instruction length, e.g. vpaddw xmm0, xmm0, xmm8 -> vpaddw xmm0, xmm8, xmm0
-
Henrik Gramner authored
There's an edge case that wasn't properly handled.
-
- 17 Jan, 2018 2 commits
-
-
Henrik Gramner authored
-
Henrik Gramner authored
-
- 24 Dec, 2017 4 commits
-
-
On ELF platforms such symbols needs to be flagged as functions with the correct visibility to please certain linkers in some scenarios.
-
The standard section for read-only data on Windows is .rdata. Nasm will flag non-standard sections as executable by default which isn't ideal.
-
-
There are 32 pseudo-instructions for each floating-point comparison instruction, but only 8 of them are actually valid in legacy-encoded mode. The remaining 24 requires the use of VEX-encoded (v-prefixed) instructions and can therefore be disregarded for this purpose.
-
- 24 Jun, 2017 1 commit
-
-
James Darnley authored
Upstreaming this from FFmpeg. Unused in x264.
-
- 21 May, 2017 4 commits
-
-
Henrik Gramner authored
AVX-512 consists of a plethora of different extensions, but in order to keep things a bit more manageable we group together the following extensions under a single baseline cpu flag which should cover SKL-X and future CPUs: * AVX-512 Foundation (F) * AVX-512 Conflict Detection Instructions (CD) * AVX-512 Byte and Word Instructions (BW) * AVX-512 Doubleword and Quadword Instructions (DQ) * AVX-512 Vector Length Extensions (VL) On x86-64 AVX-512 provides 16 additional vector registers, prefer using those over existing ones since it allows us to avoid using `vzeroupper` unless more than 16 vector registers are required. They also happen to be volatile on Windows which means that we don't need to save and restore existing xmm register contents unless more than 22 vector registers are required. Also take the opportunity to drop X264_CPU_CMOV and X264_CPU_SLOW_CTZ while we're breaking API by messing with the cpu flags since they weren't really used for anything. Big thanks to Intel for their support.
-
Henrik Gramner authored
This is required to support AVX-512. Drop `-Worphan-labels` from ASFLAGS since it's enabled by default in nasm. Also change alignmode from `k8` to `p6` since it's more similar to `amdnop` in yasm, e.g. use long nops without excessive prefixes.
-
Henrik Gramner authored
Simplifies writing assembly code that depends on available instructions. LZCNT implies SSE2 BMI1 implies AVX+LZCNT AVX2 implies BMI2 Skip printing LZCNT under CPU capabilities when BMI1 or BMI2 is available, and don't print FMA4 when FMA3 is available.
-
Anton Mitrofanov authored
The use of rsp was pretty much hardcoded there and probably didn't work otherwise with stack_size > 0.
-
- 19 May, 2017 3 commits
-
-
Henrik Gramner authored
Due to a peculiarity in the ModR/M addressing encoding, the r12 and r13 registers sometimes requires an additional byte when used as a base register. r14 and r15 doesn't have that issue, so prefer using them.
-
Henrik Gramner authored
There's no point in emitting a rep prefix before ret on modern CPUs.
-
Henrik Gramner authored
We overload the `call` instruction with a macro, but it would misbehave when the macro argument wasn't a valid identifier. Fix it by explicitly checking if the argument is an identifier.
-
- 21 Jan, 2017 1 commit
-
-
Henrik Gramner authored
-
- 01 Dec, 2016 1 commit
-
-
Henrik Gramner authored
When allocating stack space with an alignment requirement that is larger than the current stack alignment we need to store a copy of the original stack pointer in order to be able to restore it later. If we chose to use another register for this purpose we should not pick eax/rax since it can be overwritten as a return value.
-
- 12 Apr, 2016 1 commit
-
-
Anton Mitrofanov authored
Allows emulation to work when dst is equal to src2 as long as the instruction is commutative, e.g. `addps m0, m1, m0`.
-