VLIW processors, CPU architecture and functionality

VLIW type processors have a number of advantages and disadvantages over other processors and have not only been used in processors, but also as shader units for GPUs and also in DSPs.

Today, VLIW designs seem to have disappeared from PC hardware, but they remain a valid option in designing new processors for different areas of the hardware market despite their obsolescence.

How does a VLIW processor work?

In a conventional superscalar or ILP processor, instructions are captured and processed individually during each instruction cycle. Whether it is an execution in order or out of order. In the case of a VLIW processor, this involves grouping several instructions into one and sending them together to the different units available in the processor.

To get this VLIW processors rely heavily on the compiler to generate binary code, el cual agrupará las diferentes instrucciones en una sola instrucción, siempre teniendo en cuenta el nivel de ocupación de cada una de las unidades de ejecución en cada momento de la ejecución, lo cual dependerá de la cantidad de ciclos de reloa de requiere que the instructions.

Since the instructions can have different degrees of duration in terms of clock cycles, this is a performance issue because for several clock cycles we will have threads that do nothing and which will execute an NOP instruction, which means that during this cycle clock, said unit does not perform any operation. This makes VLIW processors heavily dependent on the compiler for maximum efficiency.

Advantages and Disadvantages of a VLIW Design

The main advantages it brings are as follows:

The hardware in charge of decoding the instructions is much simpler than an ILP or TLP CPU, this allows to leave more free space on the chip for the execution units and therefore to be able to execute more instructions at the same time .
Having more space also allows you to place a larger number of registers, which is ideal for facilitating the speculative execution typical of failed processors without the need for a sort buffer.

Regarding its drawbacks, the first of them is that a much more complex compiler is required, the second being the one we mentioned earlier and which is based on the fact that there is a plus great waste of different threads, since most of them will have a good time unoccupied.

To better understand this, imagine that you have grouped together in a VLIW 3 instructions that require the execution of the first 4 cycles, the 7 second cycles and the 10 third cycles. The execution unit responsible for executing the first instruction will do 6 clock cycles without doing anything, the second will do 3 and all because the third will need 10 cycles to operate.

On the other hand, we must add the fact that although at the level of the instruction the binary does not change, during the development of a new CPU it is possible that an instruction already exists increases or decreases the number of cycles. . This makes a different compiler necessary even for new iterations of a new processor, which makes it difficult to release more advanced versions of a processor and in many cases requires the creation of a binary-to-binary compiler, which rearranges them. instructions for the new processor.

Generation of instructions by the compiler

So that you can understand it better, we have prepared a few lists, the first is running in a superscalar or called ILP processor, the second is a VLIW type processor.

Starting from an ILP type processor, a list of its instructions would be as follows:

Load A1
Load B1
Load A2
Invoice B2
Multiply the values of A1 and B1
Add the values of A2 and B2
Add A1 and A2
A3 cargo
Cargo B3
Multiple A3 by B3
Add B1 and B2.

On the other hand, a VLIW processor will group several instructions into one:

A2 and B2 are charged simultaneously
Load A2 and B2, multiply A1 and B1, add A2 and B2.
Load A3, B3, multiply A3 by B3 and add B1 and B2.

The fact that we have managed to group the 11 instructions into just 3 very large instructions means that the time required for each of the VLIW instructions will be at most the time it takes for the most complex instruction in the group of instructions.

Memory access for this type of processors

As we saw earlier, VLIW processors are compiler dependent and often add NOP statements to code during compilation. The reason for this is that creating a VLIW processor with variable size instructions is extremely complex, so it is made to create a fixed size of bits at which the processor reads the instructions and retrieves that amount of data from memory. at each cycle. . and instructions.

This means that VLIW processors require much larger data buses than conventional CPUs because they bundle a large number of bits each time they capture new instructions to be executed. This being its big Achilles heel, because in ILP processors, common in PC processors, narrower data widths and therefore simpler memory controllers are used.

The normal thing in VLIW processors is that they capture the following instructions to be executed while the current VLIW instruction is being executed. Since by grouping several instructions into one, the time to capture each of them separately is reduced.