AMD Zen 3 architecture, technical characteristics and specifications

Whether it is a chip-based processor or a monolithic processor, the changes we are going to describe are general for all AMD Zen 3-based processors.

The cores of AMD Zen 3 architecture

In order to understand the reason for the performance changes concerning Zen 2 compared to Zen 3, it should be understood in advance that the 19% increase in CPI is the product of testing two or more processors. performance tests and take an average of the performance of the two architectures.

At first glance, Zen 3 may seem like a slightly improved version of Zen 2, since the changes relate to parts of the processor that are generally invisible when talking about a processor performance, these changes have been made especially in the control unit or front of the processor.

Front-end improvements for Zen 3 cores

Among the novelties in the Front-End of the processor, the component that stands out the most is the new hop prediction unit, Branch predictor, which has been improved to be able to predict more branches per clock cycle. series of changes in the processor, such as a redesign of the Branch Prediction Buffer BTB, which is in both the L1 cache, as the number of entries increased from 512 to 1024, and in the L2 cache, it went from 7K at 6.5K inputs compared to Zen 2.

The new hop prediction unit reduces the number of cycles lost when the processor performs a failed prediction, thus preventing cores from staying for much longer when the hop prediction unit fails.

On another side. We cannot forget that the processors with the Zen 3 architecture use the x86-64 set of registers and instructions, which means that we are dealing with a CISC type ISA. One of the advantages of CISC registers and instruction sets is that they consolidate several simple instructions into one complex, which saves code space and energy when reading from the computer. RAM.

Unfortunately, they have the downside that they are very difficult to segment to create a pipeline, which is why complex instructions must be decoded into simpler internal instructions called uops or micro-operations, that is, it is necessary to decode the x86 instructions. 64 in a simpler type of instructions in order to make it possible to segment them more easily and to be able to implement a pipeline.

In RISC processors, the number of bytes per instruction is always the same, but in an ISA such as x86-64 they are of different sizes, which makes decoding the different instructions and in some particular instructions extremely complex. the amount the number of cycles to decode an instruction is too high. This is why it is necessary to implement a type of cache which allows faster acceleration work, the so-called uop cache, which also reduces the energy consumed during the step of decoding the instructions.

We don’t know how much AMD has improved uop cache and its communication, but this is one of the key things to improve the performance of x86-64 processors and one of the things AMD mentions that it is ‘is improved compared to Zen. 2.

Execution units in Zen 3

When it comes to sending instructions, Zen 3 still has a Dispatch unit that sends 6 Macro-Ops per cycle to the execution units, so that the maximum IPC per core remains at 6 and does not in theory not increased. Out of curiosity, the Front-End of the processor is equipped to send 8 instructions simultaneously, which suggests that it is possible that in a future iteration of the Zen architecture, we may see an increase in performance by increasing the number of units. execution, but in Zen 3 it is used for

But increasing the number of execution units is not the only way to increase the number of concurrent instructions per cycle that a processor can execute. Since one of the problems encountered by architects in charge of designing a processor is the amount of resources available to build the whole architecture, this often leads to different instructions that share the same path in regards to run.

If two instructions that run in parallel use the same path until they are resolved, then a conflict occurs when one of the instructions ends up slowing down the execution of the other, due to the fact that one part of the pipeline is shared by others. which ends up negatively affecting the amount of instructions per cycle and with it the performance of the processor.

The idea of adding new paths so that there are no conflicts when executing certain instructions is important in order to increase the real IPC of an architecture. This is the case with integer units in Zen 3 where the number of paths that can be followed by instructions using the integer unit from 7 to 10 has been enlarged in order to avoid conflicts between instructions.

Other possible changes are to rethink the internals of some instructions, making them take fewer cycles to execute, as is the case with the FMAC instruction in floating point units, which went from a latency from 5 clock cycles to just 4 clock cycles.

The other type of unit that AMD has improved in Zen 3 is the so-called Load / Store, which is responsible for loading and storing data to and from the processor. Load units can load up to 3 data or instructions simultaneously each, so there has been a 50% increase over Zen 2 while storage units have gone from 1 to 2 with 64 storage slots.

The new 8-core CCX

AMD completely revamped the CCX for the first time in Zen architectures, instead of being built around 4 cores, it is now around 8 cores, which implies changes, notably in the L3 cache, which is now unified and shared not by 4 but by 8 hearts.

L3 cache in Zen architectures is victim cache, it means it adopts the cache lines rejected by L2 cache inside, it means L3 cache does not participate in requesting and capturing data and instructions from memory. So the advantage when using 8 cores is to communicate cores which were previously in different CCXs, the communication in this aspect has improved.

Keep in mind that we can find multiple execution threads interacting with each other, if the communication distance between different cores that execute the same instruction increases, then so does the execution time. It doesn’t matter if one core is faster than the other because the slower will slow down the faster. This is why AMD unified the 8 cores into a single CCX, in order to avoid latencies due to interoperability between cores.

Some changes to the Northbridge processor

Compared to Zen 2, the changes compared to the Northbridge, or Scalable Data Fabric according to AMD jargon, are few.

In the case of the Ryzen 5000 for desktop, based on chiplets, the changes compared to the Ryzen 3000 IOD are almost zero, except in the support of DDR4 memories faster than those supported by Zen 2, as well as better optimization in the matter of energy consumption.
The same can be said for monolithic SoCs based on Zen 3 except for the new CCX, the rest of the processor is the same as those based on Zen 2.

In fact, we should see a new SDF or IOD with the arrival of DDR5 memory and new I / O standards like PCIe 5.0, but this should happen at the same time as AMD launches the Zen 4 architecture or on might end up with a Ryzen 6000 series with a new IOD, but using Zen 3 cores, but that can only be said by AMD.