AVX-512, Intel SIMD Instructions for AI and Multimedia

AVX instructions were first implemented in Intel processors, replacing the older SSE instructions. Since then, they have become the standard SIMD instructions for x86 processors in their two variants, 128-bit and 256-bit, also adopted by AMD. On the other hand, if we are talking about the AVX-512 instructions, the situation is different and they are only used in Intel processors.

What is a SIMD unit?

A SIMD unit is a type of thread that is intended to execute the same instruction on multiple data at the same time. Consequently, its accumulator register is longer than a traditional instruction, because it must group together the different data available to it to operate with this same instruction.

SIMD units have traditionally been used to speed up so-called multimedia processes in which it is necessary to manipulate various data under the same instructions. The SIMD units make it possible to parallelize the execution of the program in these parts and speed up the execution time.

In each processor, in order to separate the SIMD threads from the traditional ones, they have their own instruction subset which is normally a mirror of the scalar instructions or with a single operand. Although there are cases which are not possible with a scalar unit and which are exclusive to SIMD units.

The history of the AVX-512

AVX instructions, Advanced Vector eXtensions, have been in Intel processors for years, but the origin of AVX-512 instructions is different from others. The reason? Its origin is the Intel Larrabee Project, an attempt by Intel in the late 2000s to create a GPU that eventually became the Xeon Phi accelerators. A series of processors for high performance computing that Intel released a few years ago.

The Xeon Phi / Larrabee architecture included a special version of the AVX instructions with a size in their accumulator register of 512 bits, which means they can work with up to 16 32-bit data. The reason for this amount is because the typical ratio of operations per texel for a GPU is usually 16: 1. Let’s not forget that the AVX-512 instructions come from the failed Larrabee project and were brought in from there at Xeon Phi.

As of today, the Xeon Phi no longer exists, the reason is that the same can be done through a traditional GPU for computing. This caused Intel to transfer these instructions to its main line of processors.

The gibberish that is the AVX-512’s instructions

The AVX-512 instructions are not a 100% implemented homogeneous block, but rather have various extensions which, depending on the type of processor, have been added or not. All processors are referred to as AVX512F, but there are additional instructions that are not part of the original instruction set that Intel has added over time.

The AVX512 extensions are as follows:

AVX-512-CD: Conflict detection, allows loops to be vectorized and therefore vectorized. They were first added in Skylake-X or Skylake-SP.
AVX-512-ER: Reciprocal and exponential instructions, designed for the implementation of transcendental operations. They were added in a Xeon Phi range called Knights Landing.
AVX-512-PF: Another inclusion in Knights Landing, this time to increase the precautionary or prefetech capabilities of the instructions.
AVX-512-BW: Instructions at byte level (8 bits) and at word level (16 bits). This extension allows you to work with 8-bit and 16-bit data.
AVX-512-DQ: Add new instructions with 32-bit and 64-bit data.
AVX-512-VL: Allows AVX instructions to run on the XMM (128 bit) and YMM (256 bit) accumulator registers
AVX-512-IFMA: Fused Multiply Add, which is colloquially an A * (B + C) instruction, with 52-bit integer precision.
AVX-512-VBMI: Byte-level vector manipulation instructions are an extension of the AVX-512-BW.
AVX-512-VNNI: The Vector Neural Network instructions are a series of instructions added to speed up deep learning algorithms, used in applications related to artificial intelligence.

Why hasn’t AMD implemented it on their processors yet?

The reason is very simple, AMD is committed to using its CPU and GPU together when accelerating certain types of applications. Let’s not forget the origin of the AVX-512 in a faulty GPU from Intel and AMD thanks to their Radeon GPUs, they don’t need to use the AVX-512 instructions.

This is why the AVX-512 instructions are exclusive to Intel processors, not for total exclusivity, but because AMD has no interest in using this type of instruction in its processors, since its intention is to sell its products. GPUs, especially the all-new AMD Instinct. high performance computing with CDNA architecture.

Do AVX-512 instructions have a future?

Well, we don’t know, it depends on the success of the Intel Xe, in particular the Xe-HPC, which will provide Intel with a GPU architecture at the level of AMD and NVIDIA. This means a conflict between the Intel Xe and AVX-512 instructions to resolve the same issues.

The problem with the AVX-512 is that turning on the part of the processor that uses it ends up affecting the processor clock speed, reducing it by about 25% in a program that uses these instructions at times. specific. Moreover, its instructions are intended for high performance computing and AI applications that are not important in what a home processor is and the appearance of specialized units makes it a waste of transistors and space.

In reality, domain-specific accelerators or processors are slowly replacing SIMD units in processors, as they can do the same while taking up less space and with minuscule power consumption in comparison.