If we observe the GPUs launched recently both by AMD, but especially by NVIDIA, we can see that the surface they occupy is increasing and if a few years ago a GPU of more than 400 mm2 was perceived as something big, now we have them above 600mm2.
This trend means that there is a risk that the limit of the grid will be reached at some point, said limit is the limit area that a chip can have in a given manufacturing node and in a dangerous manner, and things are coming together. complicate if we indiscriminately increase the number of cores that make up a GPU.
Comparison of stations and trains
Suppose we have a rail network with several stations, each of which is a processor and the trains are the data packets that are sent.
Obviously, if our rail network includes an increasing number of stations, then we will need more and more tracks and more complex infrastructure. Well, in the case of a processor, it is the same because increasing the number of elements means increasing the number of communication channels between the different elements.
The problem is that these extra tracks will also increase the energy consumption, so anyone who is dedicated to the design of the railway network must not only consider how many tracks can be placed in the infrastructure, but also the consumption of it. energy of the same.
Moore’s Law doesn’t fit as much as you think
According to Moore’s Law, the density in number of transistors per area doubled regularly x times, this was accompanied by the Dennard scale, which told us how fast they can scale with each new manufacturing node. The original Dennard scale changed its metric from the 65nm node.
The problem comes when we increase the number of elements / trains and communication paths, we can place twice as many elements but what we cannot do is make sure we have the necessary bandwidth to communicate all these elements at the same time under the same consumption. given, which is impossible and this limits the number of cores, in the case of GPUs the number of computing units.
The solution that has always been taken? Instead of adding more elements, what has been done is to make them more and more complex, for example in the case of GPUs this is the path that NVIDIA took in Turing, instead of ‘increase the number of SM compared to Pascal. What he did was add things like RT Cores, Tensor Cores, and make some deep changes in the units, because increasing the number of cores means increasing the number of interconnects.
We therefore find ourselves with the problem of the energy cost of data / train transmissions, with each new manufacturing node we can increase the number of elements on a chip but we see that the transfer speed that we need is increasing. higher, which increases power consumption, because much of the power that we give to the processor goes more and more to transferring data rather than processing data.
Build a GPU with chips from a known processor
The idea of building a GPU by chiplets is to be able to build a GPU that cannot be made from a monolithic construction and therefore based on a single chip, hence the surface area of a GPU built with Chiplets must be larger than that which would allow the limit of the reticle, because then a GPU of this type would not make sense.
This means that GPUs composed of chiplets will be reserved exclusively for the highest ranges and it is possible that initially we will only see them in the GPU market for high performance computing, HPC, whereas at home we are still a few years with a lot of GPUs simpler configuration and therefore monolithic.
However, we decided to take the Navi 10 chip, with the first generation RDNA architecture as an example to deconstruct it and create our GPU made up of chips, mainly because it is last generation GPUs that we have the most data. on the table. GPUs built by AMD and / or NVIDIA will have a lot more complexity than this example, which is indicative for you to have a mental picture of how a GPU of this type would be built.
The first idea is that each chiplet is a Shader Array, which are the sets of elements that are on the pink boxes, which are connected to the L1 cache, while we will leave the L2 cache in a separate central chip.
But we don’t have the full GPU, because we are missing the middle part of it, which is the command processor, which being a single part we will not duplicate, so we will put it in the middle part of the MCM.
As for the accelerators, we will place them in another chiplet, connected directly to the DMA unit of the central chiplet.
Once the GPU is broken down into several parts, what interests us now is the communication with the external memory, this will be done through the interposer, which will have the memory controller integrated inside. Since Navi 10 uses an 8-chip 256-bit GDDR6 interface, we decided to keep this configuration in our example.
GPU based on chips and power consumption
The interface used to communicate the elements of the different chiplets is the AMD MCM is the IFOP interface, which has a power consumption of 2 pJ / bit, if we look at the technical specifications we will see that the L2 cache has a band bandwidth of 1.95 TB / It is at a speed of 1905 MHz, or approximately 1024 bytes, which corresponds to 16 interfaces of 64 bytes / cycle, 32B / cycle per address.
The first version of the Infinity Fabric used 32B / cycle interfaces with a consumption of 2 pJ / bit, however AMD improved by 27%
The IFOP interface has a power consumption of 1.47 pJ / bit, at a speed of 1333 MHz, if the interface was at 1905 MHz, the power consumption would be much higher because it would not only increase the speed clock but also voltage, but let’s assume our chiplet version of the Navi 10 is running at these speeds of 1333 MHz.
(1.33 * 10 ^ 12) * 8 bits per byte * 1.47 pJ per bit = 1.56 * 10 ^ 13 pJ = 15.6 W
Although the 15.6 W may seem like a low number to us, keep in mind that this is only the data transmission consumption of the peripheral chips with the central chiplet at a speed of 1333 MHz and that the power consumption increases quadratically with clock speed. and the tension also increases with this.
This means that a good chunk of the power consumption goes directly to the power consumption of the chip-to-chip communication, which means AMD and NVIDIA need to fix this before they deploy their chip-based GPUs.
AMD’s EHP as an example of a chipset-based GPU
A few years ago, AMD published an article where they described a chip-based processor with an extremely complex GPU, where they talked about configurations of for example 320 compute units in 8 chips, which translates to 40 units. calculation by chiplet, the equivalent to a complete Navi 10.
In other words, we are talking about an 8 times more complex setup, so imagine a setup with 8 chips, each one being like a Navi 10 / RDNA and running at speeds above 2 GHz with huge power consumption.
This is the reason why AMD and NVIDIA have developed technologies such as X3D and GRS, which are communication interfaces that have a power consumption per transmitted bit 10 times lower than that of the current Infinity Fabric or NVLink because without this type of communication interface is not possible in the future of chip-based GPUs.
Table of Contents