Load-Store, description and utility of these units in CPU and GPU

The communication of the CPU with the memory is important, here at HardZone we have made several articles to explain the different elements and now it is the turn of the Load-Store units, which are essential and therefore indispensable in any architecture both CPU and GPU.

What are Load-Store units?

It is an execution unit in a CPU, the execution units are those used to resolve an instruction once it has been decoded. Let us recall in passing that there are other types of execution units:

ALU: are different types of units that are responsible for performing different types of arithmetic operations. They can work with a single number, a string of numbers, or even in a matrix.
Jump unit: these units take the jump instructions in the code, that is, the execution moves to another part of the memory.

The Load / Store units, for their part, are responsible for executing the instructions related to accessing the system’s RAM memory, whether in read or write. There is no L / S unit, but there are two types of units that work in parallel and manage data access.

The simplest description of how it works is: one load unit is responsible for storing information from RAM memory to registers, and one storage unit does so in the opposite direction. To function, they have their own memory for this type of unit, where they store the memory requests for each instruction.

Where are the Load-Store units located?

The first thing we can think of is that the load / storage units are as close to the processor as possible, but despite their job being to move data from RAM to registers, they don’t have a direct access to RAM. , but rather that another mechanism is in charge which we have already discussed in: “This is how the CPU accesses RAM so quickly

“, Where we are talking about the communication of the memory interface of the CPU with the RAM.

In its simplest design, the load / storage units communicate with the interfaces that communicate the processor with the RAM memory, in particular with the MAR and MDR registers, and are the only units authorized to manipulate said registers, as well as to transfer the data. to the various registers for the execution of certain instructions.

Therefore, the load / store units are not located in the closest part of the memory, but are located halfway between the registers of the registers of the different execution units and the memory interface used in each processor. located at the perimeter.

Adding a cache hierarchy

The cache is nothing more than the internal memory of the processor which copies the data closest to where the code execution is taking place at that moment. Each new level in the hierarchy has more storage capacity, but at the same time it is slower and has higher latency. Instead, in reverse, each cache level contains only part of the previous one, but it is faster and with lower latency.

In current processors, all levels contain information about instructions and data in the same memory, except for one level, which is the lowest level cache. Where there is a cache for the instructions and another for the data. Load / storage units never interact with the instruction cache, but with the data cache.

When the load units in each kernel need data, the first thing they do is “ask” the data cache if it contains the information for a certain memory address. The operation is read-only, so if they find it, they’ll copy it from the cache to the corresponding registry. If in a cache level it cannot find it, it will go down level by level. Think of him as someone looking for a document in a pyramid office building, where each level has more files to search.

On the other hand, Store units are a little more complex, they also look for a memory address in the cache, but from the moment we speak of modifying the data they host there must be a consistency that changes the reference to that memory address throughout the cache hierarchy and in RAM itself.

¿RISC = Load-Store?

Once we have learned what load / storage units do, we need to give them historical context and that is, they are not the only way a processor can access system RAM to load and store data.

The Load-Store concept is related to sets of RISC type registers and instructions, where the set of instructions is reduced and one way to do this is to separate the memory access process from the different instructions in another instruction, such as multiple instructions. They will have a similar memory access process that uses load / store units to perform this part.

The consequences are already known to us, the binary code of programs for CISC instruction sets ends up having a more compact and smaller binary, while RISC units have it larger. Keep in mind that in the early days of computing RAM was very expensive and scarce, and it was important to keep binary code as small as possible. Today, all x86 processors are post-RISC, because when decoding x86 instructions, they do so in a series of micro-instructions that allow the processor to function as if it were a RISC processor.

LSU on GPU

Yes, GPUs also have load / storage units, which are in the compute units and are responsible for finding the data that the ALUs need to run. Remember that AMD’s compute units, Intel’s sub-slices, or NVIDIA’s background stream multiprocessors are different meanings for the same thing, the GPU cores where their programs run, known colloquially as shaders.

Different ALUs in a computing unit tend to operate at register level most of the time, this means the instruction comes with the data to work directly, but some instructions refer to data that is not found in registers, so it is necessary to search for them in caches.

The data retrieval system is the same as in processors, it first examines the data cache of each compute unit and works until it reaches the end of the memory hierarchy as far as the GPU can access. This is essential when accessing large data such as textures.

Fixed functionality on GPUs and Load-Store disks

Some of the units located in the compute units use the load storage units to communicate with the GPU, these units are not ALUs, but independent units of fixed function or accelerators. Today, there are two types of units that use the load / storage units in a GPU:

Texture filter units
The unit in charge of calculating the intersection of rays in Ray Tracing

Since these units need to access the data cache to get the same as input parameters to perform their function. The number of load / store units in a compute unit is variable, but it is usually equal to or greater than 16 because we have 4 texture units which require 4 data to perform the bilinear filter.

In the same way, the data of the nodes of the BVH trees is stored in the different cache levels. In some specific cases, such as NVIDIA GPUs, Ray Tracing units have an internal LSU that reads from the RT Core’s L0 cache.