The reason why dual GPUs have disappeared from the home environment and is the answer to the question of why we no longer see graphics cards that are compatible with NVIDIA SLI or AMD Crossfire is the same, the applications we use on our PCs are programmed to use a single GPU.
In PC video games, when using a dual GPU. Techniques such as alternate image rendering are used, where each GPU renders an alternate image to the other, or split image rendering where the pair of GPUs divide the work of a single frame.
In computing via GPU, this problem does not occur, which is why in systems in which graphics cards are not used to render graphics, we find several working in parallel without problems. Additionally, applications that use GPUs as parallel data processors are already designed to take advantage of GPUs in this way.
The increase in the size of GPUs in recent years
If we look at the evolution of GPUs in recent years, we will see that there has been considerable growth in the field of high-end GPUs from one generation to the next.
The worst of the current scenario? There is no GPU yet that offers the ideal performance for 4K gaming. We must take into account the fact that a native 4K image has 4 times more pixels than a 1080p image and therefore we are talking about a movement of data which is four times that necessary for Full HD.
In the current situation in VRAM, we have the case of GDDR6, said memory uses a 32-bit interface per chip, divided into two 16-bit channels, but, with a clock speed that makes its energy consumption skyrocket, it we look for other solutions to expand the bandwidth.
Extend the bandwidth of VRAM
If we want to expand the bandwidth, there are two options:
- The first is to increase the clock speed of the memory, but you have to take into account that the voltage squares when the MHz of it increases, and with it the power consumption.
- The second is to increase the number of pins, which would be like going from 32-bit to 64-bit.
We also can’t forget things like the PAM-4 used in the GDDR6X, but it was a Micron decision to avoid hitting high clock speeds. We must therefore expect a 64-bit bus per VRAM chip for a possible GDDR7.
We don’t know what VRAM makers are going to do, but increasing clock speed isn’t the option we think they’ll end up adopting on a limited power budget.
We don’t know what VRAM makers are going to do, but increasing the clock speed is not the option we think they will end up adopting.
However, the interfaces between the GPU and VRAM are located outside the perimeter of the GPU itself. Therefore, increasing the number of bits thereof amounts to extending the periphery of said GPU, and therefore to enlarging it.
Which is a serious additional issue due to the high size in terms of cost, it will force graphics card manufacturers to use multiple chips instead of just one, and this is where we get into the so-called fleas.
Chiplet-based GPU types
There are two ways to divide a GPU into Chiplets:
- By dividing a single, massive-sized GPU into multiple chips, the trade-off is that communication between the different parties requires massive bandwidth which may not be possible without the use of special intercoms.
- Use multiple GPUSs in the same space that work together as one.
In the HardZone article titled “This is how Chiplet-based GPUs will look like in the future”, you can read about the configuration of the first type, while the AMD patent for its GPU with Chiplets refers to those of the second. type.
Explore the patent of AMD chips:
The first point that appears in every patent is the usefulness of the invention, which always comes in its background, which concerns us is the following:
Conventional monolithic designs that are increasingly expensive to manufacture. Chipsets have been used successfully in CPU architectures to reduce manufacturing costs and improve efficiencies. Since its heterogeneous computing nature more naturally adapts to separating processor cores into different units that don’t require a lot of pass-through between them.
The mention of CPUs is clear that it refers to AMD Ryzen and is that a good part of the Zen architectures design team has moved to the Radeon Technology Group to work on improving the architecture. RDNA. The concept of chiplets is not the first inherited from Zen, the other is the Infinity Cache, which inherits the concept of “Victim Cache” from Zen.
Second, the intercom problem you are referring to refers to the enormous bandwidth that GPUs need to communicate their elements to each other. What is the brake on the construction of these in chiplets, because of the energy consumed in the transfer of data.
The work of a GPU is parallel in nature. However, the geometry processed by a GPU includes not only parallel working sections, but also works that need to be synchronized in a specific order between the different sections.
The consequence? A programming model for a GPU that distributes work across different threads is often inefficient, because parallelism is difficult to distribute across multiple different workgroups and chipsets, because it is difficult and expensive to synchronize the contents of the memory of the shared resources through the system.
The part we have bolded is the explanation from a software development point of view for which we haven’t seen a chiplet-based GPUs. It is not just a hardware problem but a software problem, so it needs to be simplified.
Also, from a logical point of view, applications are written with the idea that the system has only one GPU. In other words, although a conventional GPU includes many GPU cores, applications are programmed to target a single device. Therefore, it has always been difficult to integrate chip design methodology into GPU architectures.
This part is essential to understand the patent, AMD is not talking about dividing a single GPU into chiplets which it does in its processors, but rather using multiple GPUs in which each is a chiplet, it is important to pay attention to this difference, because AMD’s solution seems more focused on creating a Crossfire in which programmers don’t need to tailor their programs to different GPUs.
Once the problem has been defined, the next point is to talk about the solution offered by the patent.
Explore the AMD Chiplet patent: the solution
The solution to the exposed problem proposed by AMD is as follows:
To improve system performance using GPU chips while maintaining the current programming model, the patent illustrates systems and methods that use high-bandwidth passive cross-links to connect GPU chips to each other.
The important part of the patent are these cross links, which we will talk about later in this article, they are the communication interface between the different chipsets, that is, how information is transmitted between them.
In various implementations, a system includes the central processing unit (CPU) which is connected to the first GPU chiplet in the chain, which is connected to a second chiplet via passive crosslinking. In some implementations, passive crosslinking is a passive interposer that takes care of the communication between the chips..
Basically it comes down to the fact that we now have a dual GPU functioning as one, consisting of two chips interconnected via an interposer that would be located below.
Passive high bandwidth crosslinks
How do the chiplets communicate with the interposer? Using a type of interface that communicates the SDF (Scalable Data Fabric) of each of the chipsets with each other, the SDF in AMD GPUs is the part that normally sits between the GPU’s top-level cache and the interface. Memory, but in this case there is an L3 cache between the SDF of each GPU chipset and the SDF and before that an interface which intercommunicates the two chiplets between them.
In this diagram you can see the example with 4 GPU chipsets, the number of HBX interfaces is always 22 where n is the number of chiplets in the interposer. Looking at the level of the cache hierarchy, L0 (not described in the patent) is local for each computing unit, L1 for each Shader Array, L2 for each GPU chiplet, while the L3 cache would be a novelty, it is described as the last level cache or LCC of the entire GPU.
Currently, various architectures have at least one level of cache that is consistent across the GPU. Here in a chipset-based GPU architecture, it places these physical resources on separate chips and communicates them in such a way that said top-level cache remains consistent across all GPU chipsets. Thus, despite operating in a massively parallelized environment, the L3 cache must be consistent.
During an operation, the request for a memory address from the CPU to the GPU is passed to a single GPU chiplet, which communicates with the high-width passive crosslink link to locate the data. From a processor perspective, it looks like you’re heading towards a single-chip, monolithic GPU. This makes it possible to use a high-capacity GPU, made up of several chips as if it were a single GPU for the application.
This is why AMD’s solution is not to split a GPU into several different chips, but to use several GPUs as if they were one, thus solving one of the problems that AMD Crossfire brought and allowed any software that you can use multiple GPUs at the same time as if they were one and without having to adapt the code.
The other key to passive crosslinks is the fact that contrary to what many of us have assumed, they don’t communicate with the GPU using channels over silicon or TSV, but AMD has created a proprietary pass-through. for building SoC, CPU and GPU. , both in 2.5DIC and 3DIC, which leads us to wonder if the X3D interface which is to replace its Infinity Fabric.
AMD chips are for RDNA 3 and above
The fact that the problem when using multiple GPUs is not a problem of applications designed for computing via GPUs shows very clearly that the solution proposed by AMD in its patent is aimed at the domestic market, in particular the GPUs of the RDNA architectures, there are several clues about this:
- In the chiplet diagrams of the patent, the term WGP appears, which is typical of the RDNA architecture and not of CDNA and / or GCN.
- The mention in part of the patent of the use of GDDR memory, typical of domestic GPUs.
The patent doesn’t tell us about a specific GPU, but we can assume that AMD will release the first chip-based dual GPU when RDNA 3 launches. This will allow AMD to create a single GPU instead of different variants of one. architecture in the form of different chips as was the case today.
AMD’s solution also contrasts with what NVIDIA and Intel claim. From the start we know that Hopper will be its first architecture based on chiplets, but we do not know its target market, so it could well be intended for the market of high performance computing such as gaming.
As for Intel, we know that Intel Xe-HP is a GPU also made up of chips, but without needing a solution like AMD’s, since Intel’s target for said GPU is not the domestic market. .