The main concern in the design of processors is often not to get the most power, but the best performance when executing instructions. We understand performance as approaching the theoretical ideal of how a processor works. You don’t need to have the most powerful processor if, due to limitations, all it has is the potential to be and isn’t.
Two ways to manage parallelism
There are two ways to deal with parallelism in program code, these are thread level parallelism or ILP and pedagogical parallelism or TLP.
In TLP, the code is divided into several subroutines, which are independent of the others and work asynchronously, which means that each of them does not depend on the code of the rest. When we are in a TLP processor, the key is that if an execution halt occurs for some reason, the TLP processor takes another of the execution threads and puts the idle one on hold.
ILP processors are different, their parallelism is at the instruction level and therefore in the same thread of execution, so they cannot cheat by putting the main thread on hold. Today in processors the two types of execution are combined, but ILP is still exclusive to processors and this is where they get a big advantage in terms of serial code over fully parallelizable code.
We cannot forget that according to Amdahl’s law, a code is composed of parts in series, which can only be executed by one processor, and in parallel, which can be executed by several processors. However, not everything can be parallelized and some parts of serial code require serial operation.
Over the past 15 years, the concept has been developed in which parallel algorithms are executed in GPUs, the cores of which are TLP type, while serial code is executed in ILP type CPUs.
Execution in order of instructions
Execution in order is classic instruction execution, its name is because the instructions are executed in the order in which they appear in the code and the next instruction cannot continue until the previous one has not been resolved.
The biggest difficulty in executing in order is with conditional and jump statements, as they will be executed when the condition occurs, which significantly slows down the speed of code execution. This is a huge problem when the number of stages in a processor is extremely high, which happens when a processor is running at high clock speeds.
The trap for achieving high clock speeds is to segment the maximum instruction resolution with a large number of instruction cycle substeps. When a jump or an erroneous condition occurs, a considerable number of instruction cycles are lost.
Out of service, acelerando el ILP
Disorder or out of order execution is how most advanced processors execute code and it is believed to prevent execution from stopping. As its name suggests, it consists of executing the instructions of a processor in an order different from those indicated in the code.
The reason this is done is that each type of instruction has a type of thread assigned to it. Depending on the type of instruction, the CPU uses one type of execution unit or another, but these are limited. This can cause execution to stop, so what is done is to advance the next instruction in its execution, pointing to an internal memory or register which is the actual order of the instructions, once they have gone through. been executed, they are returned in the original order they were in the code.
Using disorder allows you to increase the average number of solved instructions per cycle and bring it closer to the performance ideal. For example, the first Intel Pentium had in-order execution and was a processor capable of working with two instructions against the 486 which could only work with one, but despite this its performance due to downtime was only an additional 40%.
Additional steps in case of failure
Implementing an out-of-order execution adds additional steps to the instruction cycle, which we have already discussed in the article titled This is how your CPU executes the instructions that the software gives it, which you can find in HardZone.
In fact, only the central part of the execution of the instruction varies from the execution in order, these changes occur before the execution step, so the first two which are retrieved and decoded are not affected, but two new steps are added, which occur before and after the instructions are executed.
The first step is the standby stations, in which the hardware waits for the threads to be free. Its implementation is complex, as it relies on a mechanism that not only monitors when a thread is free, but also counts the average time in clock cycles of each running instruction to find out how it should. rearrange the instructions.
The second step is the reordering buffer, which sorts the instructions in order of exit. Keep in mind that in order to speed up the output of instructions into out-of-order execution, all speculative instruction branches of the code are executed. The speculative instruction is that which is given when there is a conditional jump, whether the condition is met or not. It is therefore at this stage that unconfirmed execution branches are discarded.