- Jun 10, 2020
-
-
Anton Mitrofanov authored
checkasm10 with seed=511142008 failed on win32 gcc builds.
-
- Apr 09, 2020
-
-
Anton Mitrofanov authored
-
Anton Mitrofanov authored
-
- Feb 29, 2020
-
-
Anton Mitrofanov authored
-
- Nov 05, 2019
-
-
Anton Mitrofanov authored
-
- Jul 17, 2019
-
-
Simplifies a lot of code and avoids having to export public asm functions. Note that the force_align_arg_pointer function attribute is broken in clang versions prior to 6.0.1 which may result in crashes, so make sure to either use a newer clang version or a different compiler.
-
Anton Mitrofanov authored
-
- Mar 06, 2019
-
-
-
Allows for automatic command line completion for both options and values. Options such as --input-csp and --input-fmt will dynamically retrieve supported values from libavformat when compiled with lavf support. Execute 'source tools/bash-autocomplete.sh' in bash to enable.
-
- Mar 03, 2019
-
-
Henrik Gramner authored
-
- Aug 06, 2018
-
-
- Jun 02, 2018
-
-
Henrik Gramner authored
Clang emits aligned AVX stores for things like zeroing stack-allocated variables when using -mavx even with -fno-tree-vectorize set which can result in crashes if this occurs before we've realigned the stack. Previously we only ensured that the stack was realigned before calling assembly functions that accesses stack-allocated buffers but this is not sufficient. Fix the issue by changing the stack realignment to instead occur immediately in all CLI, API and thread entry points.
-
- Jan 18, 2018
-
-
- Jan 17, 2018
-
-
Henrik Gramner authored
-
- Dec 24, 2017
-
-
This version supports converting aarch64 assembly for MS armasm64.exe.
-
Takes advantage of opmasks to avoid having to use scalar code for the tail. Also make some slight improvements to the checkasm test.
-
Add 'i_bitdepth' to x264_param_t with the corresponding '--output-depth' CLI option to set the bit depth at runtime. Drop the 'x264_bit_depth' global variable. Rather than hardcoding it to an incorrect value, it's preferable to induce a linking failure. If applications relies on this symbol this will make it more obvious where the problem is. Add Makefile rules that compiles modules with different bit depths. Assembly on x86 is prefixed with the 'private_prefix' define, while all other archs modify their function prefix internally. Templatize the main C library, x86/x86_64 assembly, ARM assembly, AARCH64 assembly, PowerPC assembly, and MIPS assembly. The depth and cache CLI filters heavily depend on bit depth size, so they need to be duplicated for each value. This means having to rename these filters, and adjust the callers to use the right version. Unfortunately the threaded input CLI module inherits a common.h dependency (input/frame -> common/threadpool -> common/frame -> common/common) which is extremely complicated to address in a sensible way. Instead duplicate the module and select the appropriate one at run time. Each bitdepth needs different checkasm compilation rules, so split the main checkasm target into two executables.
-
-
-
-
- Jun 24, 2017
-
-
Henrik Gramner authored
-
Henrik Gramner authored
Uses gathers and scatters in combination with conflict detections to vectorize the scalar part. Also improve the checkasm test to try different mb_y values and check for out-of-bounds writes.
-
- Jun 14, 2017
-
-
These levels were added in the 2016-10 revision of the H.264 specification and improves support for content with high resolutions and/or high frame rates. Level 6.2 supports 8K resolution at 120 fps. Also shrink the x264_levels array by using smaller data types.
-
- May 23, 2017
-
-
Prior to this, this loop hasn't run at all. The condition has been the same since it was introduced in 5b0cb86f. This issue was pointed out by a clang warning.
-
- May 21, 2017
-
-
Henrik Gramner authored
Covers all variants: 4x4, 4x8, 4x16, 8x4, 8x8, 8x16, 16x8, and 16x16.
-
Henrik Gramner authored
The functions are only ever called with pointers to fenc and fdec and the strides are always constant so there's no point in having them as parameters. Cover both the U and V planes in a single function call. This is more efficient with SIMD, especially with the wider vectors provided by AVX2 and AVX-512, even when accounting for losing the possibility of early termination. Drop the MMX and XOP implementations, update the rest of the x86 assembly to match the new behavior. Also enable high bit-depth in the AVX2 version. Comment out the ARM, AARCH64, and MIPS MSA assembly for now.
-
Henrik Gramner authored
Also drop the MMX version and make some slight improvements to the SSE2, SSSE3, AVX, and AVX2 versions.
-
Henrik Gramner authored
-
Henrik Gramner authored
Reorder some elements in the x264_t.mb.pic struct to reduce the amount of padding required. Also drop the MMX implementation in favor of SSE.
-
Henrik Gramner authored
Reorder some elements in the x264_mb_analysis_list_t struct to reduce the amount of padding required. Also drop the MMX implementation in favor of SSE.
-
Henrik Gramner authored
-
Henrik Gramner authored
Also make the AVX and AVX2 implementations slightly faster.
-
Henrik Gramner authored
-
Henrik Gramner authored
The vperm* instructions ignores unused bits, so we can pack the permutation indices together to save cache and just use a shift to get the right values.
-
Henrik Gramner authored
-
Henrik Gramner authored
YMM and ZMM registers on x86 are turned off to save power when they haven't been used for some period of time. When they are used there will be a "warmup" period during which performance will be reduced and inconsistent which is problematic when trying to benchmark individual functions. Periodically issue "dummy" instructions that uses those registers to prevent them from being powered down. The end result is more consitent benchmark results.
-
Henrik Gramner authored
AVX-512 consists of a plethora of different extensions, but in order to keep things a bit more manageable we group together the following extensions under a single baseline cpu flag which should cover SKL-X and future CPUs: * AVX-512 Foundation (F) * AVX-512 Conflict Detection Instructions (CD) * AVX-512 Byte and Word Instructions (BW) * AVX-512 Doubleword and Quadword Instructions (DQ) * AVX-512 Vector Length Extensions (VL) On x86-64 AVX-512 provides 16 additional vector registers, prefer using those over existing ones since it allows us to avoid using `vzeroupper` unless more than 16 vector registers are required. They also happen to be volatile on Windows which means that we don't need to save and restore existing xmm register contents unless more than 22 vector registers are required. Also take the opportunity to drop X264_CPU_CMOV and X264_CPU_SLOW_CTZ while we're breaking API by messing with the cpu flags since they weren't really used for anything. Big thanks to Intel for their support.
-
Henrik Gramner authored
Simplifies writing assembly code that depends on available instructions. LZCNT implies SSE2 BMI1 implies AVX+LZCNT AVX2 implies BMI2 Skip printing LZCNT under CPU capabilities when BMI1 or BMI2 is available, and don't print FMA4 when FMA3 is available.
-
Henrik Gramner authored
Packed YUV is arguably more common than planar YUV when dealing with raw 4:2:2 content. We can utilize the existing plane_copy_deinterleave() functions with some additional minor constraints (we cannot assume any particular alignment or overread the input buffer). Enables assembly optimizations on x86.
-
Set up the right gas-preprocessor as assembler frontend in these cases, using armasm as actual assembler. Don't try to add the -mcpu -mfpu options in this case. Check whether the compiler actually supports inline assembly. Check for the ARMv7 features in a different way for the MSVC compiler.
-