Command processors on GPUs and how they affect performance

A GPU is actually an extremely complex type of processor, a heterogeneous system made up of several different types of units that must be coordinated to give a cohesive result. In this article, we will describe the Command Processors, the part of the GPU in charge of this task.

In each GPU there is always a central part which, whatever the architecture and the brand we are talking about, is common to all, these are the control processors, the unit in charge of automatically managing the operation of dozens of different units that exist on a GPU.

What is a command processor?

The control processor of a GPU is a microcontroller responsible for reading the list of screens generated by the CPU, to do this it makes the DMA unit serve in the GPU itself to access not VRAM but to the main RAM of the system where it is stored. command list. After finding the list of screens in RAM, it copies it to the internal memory of the microcontroller.

The list of commands includes all the instructions that the different units of a GPU have to execute to render an image, either in 2D or 3D, but since the arrival of DirectX 11 on the PC, the so-called Compute Shaders have arrived. , these are shader programs that are not associated with the graphics pipeline and that allow the GPU to be used to solve algorithms in which the CPU is less efficient.

Nowadays, a GPU is not only used to render impressive graphics for video games, it has many other uses and is used in several different markets, but the evolution of graphics cards to these markets has gone in parallel. with the evolution of the control processor and its possibilities.

What does asynchronous computation mean?

First of all, it should be noted that Compute Shaders are also used in the case of the graphics pipeline, in particular in post-processing and pre-processing of the image. For example, they are used to calculate lighting in delayed rendering. In these cases, because running the Compute Shaders depends on running the rest of the graphics pipeline, it is said to be in sync, but there are some tasks that benefit from using the GPU and that don’t. part of the scene rendering, so they work asynchronously.

To better visualize it, it suffices to see two different situations:

In the first one we make bread but we find that we run out of flour and therefore we ask someone not to go and get it, it means that we cannot do anything while we are waiting for the flour to be brought to us.
The second situation comes from the first, because we cannot bake bread, so we decide to do the dishes. Something that we can do at any time that has nothing to do with it.

The designers of the different GPUs realized that in all GPUs there were bubbles in the execution where some parts of the GPU of them did nothing in a short period of time. That’s why, a few years ago, they decided to implement asynchronous computing and collaborate on the development of APIs that use them, such as DirectX 12 and Vulkan.

What are order lists?

Today, the CPU itself is in charge of making the different lists of commands, either through a single core, or several cores to create them in parallel. In video games, a core is usually assigned to create the graphics list, which is much more complex than the others and usually comes from a single memory ring. Lists of commands for calculation are much simpler, they seek to have shader units solve a specific problem and provide the solution.

In the case of the lists of commands for the calculation, these generally consist of several different lists, which can be solved simultaneously between them and with respect to the list of screens. The reason is that they are asynchronous and therefore do not depend on each other to function, this makes them completely independent and allows to take advantage of parts of the GPU that would otherwise be wasted due to inactivity.

The other types of commands are those related to the access to the RAM or the VRAM of the system, these commands are executed both in computer science and in graphics. In the case of graphics, memory operations are done only and exclusively in VRAM, while in computer mode data can be imported or exported both in RAM and in VRAM, because in some cases the GPU responds to a request. CPU calculation.

Graphics APIs and command processors

Originally, the graphical list and the calculation list were managed together, which was completely inefficient. It wasn’t until the advent of GPUs with separate control processors for graphics and computing, with the ability to work synchronously and asynchronously with each other, that they were not. able to manage several lists of different orders in parallel.

Command lists are also called ring buffers, the reason is that each command processor is assigned one or more memory addresses in a list, when it reaches the memory address that it can access, the memory restarts. It’s like it’s going around in circles. And that’s why we call it a ring buffer or Ring Bufffer in English. That is why we have represented them as small rings in the diagram above.

Types of control processors

There are different types of command processors, each one has its uses and depends on the type of market for which the graphics card is directed, it uses one type of command processor or another:

Graphics only: It is in complete disuse to this day, as in the past there was only one command processor and that was exclusively for graphics.
With a smart planner: One of the things when handling multiple command lists in parallel, specifically for computation, is that it must be the system’s own processor that usually coordinates the execution of the different command lists. A command processor with an intelligent scheduler is able to rearrange the command list in real time without processor intervention.

Calculate only: Used in scientific and high performance computing, these GPUs cannot generate graphics because they do not have a graphics control processor or are idle. This is the case with CDNA GPUs for AMD Instinct, different NVIDIA Tesla and different graphics cards for computing.
Virtualized: used in data centers, especially for cloud computing. They make it possible to manage several lists of graphic commands at the same time, independent of each other. Each list corresponds to a virtual machine remotely running a different operating system for a different user.

Interaction of the command processor with the rest of the GPU

The Command Processor does not process any programs, but is an excellent organizer who is responsible for distributing tasks among the various units available at all times. If we are talking about the graphics control processor, it will have access not only to the shader units of the GPU, but also to the fixed function units. In computer science, on the other hand, he only has access to shader units and the way of operating the command processors for the calculation is different.

How do the different units coordinate with each other? Well, every fixed function unit and shader unit has some kind of mailbox that can send and receive messages in two different directions:

When exporting data, the shader unit can export to a lower level of the cache, to a fixed function unit, to another shader unit or even to the RAM assigned to it, be it a type RAM or VRAM.
Regarding the import of data, it is the control processor and the sending unit that are responsible for sending the data to the shader unit. Occasionally, the command processor is the one that populates the data and instruction caches of each shader unit with the tasks it will need to perform, as shader units do not have the ability to capture instructions. like a processor.

It goes without saying that in the list of instructions and data that the command processor sends to each unit, there is a final command that tells it where to export this data once it has finished calculating them. Which units receive the lists of data and / or instructions to be processed and where they are sent depends on the command processor, which performs the task without our having to worry.