At this point, all of Intel’s leading processor lineups have moved away from the Skylake core architecture, adopting Sunny Cove in one form or another. The 11th Gen mobile lineup leverages Willow Cove, but that’s essentially the same design on a more mature node, and paired with a lot more cache memory. Both the 11th Gen Rocket Lake-S and Ice Lake-SP CPUs leverage Sunny Cove, albeit on different nodes, but the fundamentals are more or less the same. In this post, we have a look at Intel’s Sunny Cove core architecture and AMD’s latest Zen 3 core that powers the Ryzen 5000 lineups, side by side and see how they stack up when compared to one another.
Intel Sunny Cove vs AMD Zen 3 Frontend: Fetch, Decode, and Branch
Compared to their predecessors, both Zen 3 and Sunny Cove have undergone several major changes to improve throughput. The various instruction queues and pipelines are wider, registers are denser and the re-order buffer has been amped up to improve OoO execution, thanks to the use of a smaller node.
On Intel’s side, the front-end is largely unchanged since the previous generation. The Instruction Fetch & Pre-decode (deciphering instruction length) is still 16 bytes wide, sending down six macro-ops per cycle. The instruction queue is also pegged at 50 entries (2×25), with the capability to fuse similar instructions. The decoder is a four-way 1:1 design that can translate simple macro-ops that only decompose into one micro-op. An additional 1:4 complex decoder is present that can decode a complex macro-op into one to four micro-ops.
Finally, the Microcode sequencer ROM (MS ROM) is used for translations involving the most complex instructions that can decompose into more than four micro-ops, potentially involving multiple branches and loops, taking up several decode cycles.
The micro-op buffer which holds decoded instructions has grown from 1.5K (on Skylake) to 2K on Sunny Cove, although it’s still an 8-way cache with a 64B window. Overall, the front-end still dispatches five or four instructions from the decoder, or six (from the u-op cache) if a translation is already available.
AMD’s Zen 2 and Zen 3 are a bit different in this department, primarily with respect to the branch prediction and decoders. While Intel doesn’t share much of the details regarding its Core-class predictor, it supposedly improved the accuracy with Sunny Cove. AMD, on the other hand, uses a TAGE predictor which has been updated with Zen 3 to include “Zero Bubble” prediction, with faster op-cache sequencing, lower latencies, and quicker recovery in case of a misprediction.
Furthermore, while the decoder is just a 4-way x86 decoder, the Instruction Fetch & Pre-decode is pretty much twice as wide as Intel’s Skylake and Sunny Cove core: 32-byte vs 16-byte. In contrast, Intel’s L2 bandwidth is 64 bytes/cycle while Zen 3 is limited to 32 bytes/cycle. In line with this, Sunny Cove has a much wider L2 TLB containing up to 2K instructions while AMD’s Zen 3 is limited to just 512. The L1 TLB width is most likely the same for both core architectures at 64 entries.
Similar to Sunny Cove, both Zen 2 and Zen 3 have an 8-way 32KB L1I cache, with the L1 Branch Target Buffer being doubled to 1,024 entries (512 on Zen 2) on Zen 3, L0 being unchanged at 16, and L2 BTB slightly wider at 6.5K (6K on Zen 2). AMD uses a standard x86 decoder capable of decoding four macro-instructions per cycle, with the micro-op cache delivering up to eight cached instructions, making it much more effective than the standard decode pipeline. It’s for this reason that u-op caches of most modern processors have a high hit rate of over 75% in most cases. Furthermore, while Intel handles instruction fusion in the prefetch stage (one per cycle with logic/arithmetic instructions and following conditional branches), AMD does it in the op-cache (mostly with branch instructions), further improving its bandwidth.
Overall, the front-end delivers 4 or 8 instructions to the u-op queue, depending on whether cached instructions are available or whether the full-fledged decode process is required. AMD’s Zen 2 & 3 cores feature 1:2 decoders while the MS ROM is used for complex instructions that decompose into more than two instructions.
In the case of Intel’s cores, the macro-op queue, Instruction Decode Queue, and the micro-op cache remain partitioned across different SMT threads running on the same physical core, while AMD allows the micro-op cache, decoders, and MSROM to be competitively shared amongst the co-located SMT threads. The same applies to the IDQ and MSROM.
Sunny Cove vs Zen 3 Backend: ReOrder, Rename and Execution (and Registers)
The use of the 10nm process (larger die in RKL) allowed Intel to significantly increase the width of the OoO buffer, growing from 224 in Skylake to 384 in Sunny Cove, making it the largest in any x86 processor, significantly boosting Sunny Cove’s OoO execution. Furthermore, the scheduler and registers have also been notably beefed up compared to Skylake. The Instruction Scheduler or Reservation Station in Sunny Cove is more than 50% larger at 160 entries (97 in Skylake) while the Integer register size has also gone up by a hundred to 280. The FP register also sees a modest boost, going from 168 to 224.
AMD’s Zen core separates its Integer and Floating-Point pipelines early on similar to the older Bulldoze designs. As such, the scheduler queues for the two pipelines are separate, much like the hardware registers. The Integer side got a fair bit of attention with Zen 3, increasing its scheduler to 96 (from 92 on Zen 2) despite being on the same node. The ALU and AGU schedulers are now shared, with a width of 24 entries each. Earlier, the ALU schedulers were four in number with an overall width of 64 while the AGU had a single 24-wide scheduler.
Although the ROB has seen a marginal boost from 224 to 256 entries, it’s still a far cry from Intel’s 352-entry wide design. The same is the case with the registers, with the Integer Register being pegged at 192 while Intel having room for 280 files. The number of execution units at the end of the pipeline hasn’t increased but has undergone a makeover. The INT ALUs have three dedicated units, three for the AGUs, one for the Branch Unit, and one shared between the ALU and Branch Unit. Overall, AMD has still managed to increase its Integer throughput with Zen 3 by 30% to 10 (from 7 in Zen 2): 4 from the ALUs, 3 from the AGUs, one from the branch unit, and two from the Store unit.
In an OoO design, upon execution, an instruction is stored in the ROB, and when the instruction is committed, the value is moved physical register file. The ROB holds values of instructions after execution, but before commit, as a sequence of information (instruction type, flag, name of result register), and the microarchitectural register file stores the latest committed value for every microarchitectural register. The register alias table (RAT), also called the “renaming table”, maps logical to physical registers, indicating the location of the latest definition of each microarchitectural register, between the ROB and physical register.
On the Floating-Point side, the pipeline has once again been widened, going from four to six ports in Zen 3. The scheduler queue has been nearly doubled, growing from 36 to 64 while the non-scheduling queue is unchanged at 64. Overall, Zen 3 has an impressive 128-wide FP queue. In contrast, Intel’s RS is limited to 160 which is shared between the integer and floating-point instructions. The larger ROB with a size of 352 makes up for this.
The ADD and MU/MAC have been separated and two ports have been dedicated to F2I (FP to INT registers), with one also handling the stores. The FP registers with a total of 160 entries and a width of 256-bit. Interestingly, the load-store unit (which handles well…loads or stores) is also 256-bits wide, transferring 32 bytes per cycle. In comparison, Sunny Cove can do a 128-bit load and a 64-bit load on the same cycle, thanks to the higher L1D bandwidth of 64 bytes.
Zen 3 can perform three loads or two stores per cycle while Zen 2 was capable of two loads and one store. As you can see, the bandwidth is the same, but it’s more flexible. Intel’s Sunny Cove core can do two loads and two stores at the same time due to the two dedicated ports and the buffer width. The latter has a much wider load and store buffers of 128 and 72, respectively. Skylake, on the other hand, has 72 entries in the load buffer and 56 in the store buffer. Zen 3 is limited to just 44 loads and 48 store entries.
Execution Units (Sunny Cove)
Sunny Cove is the first core to support native AVX 512 execution (without division into micro-ops) It can do perform 512-bit FMA (fused multiply and add) or two 256-bit FMA per cycle. The INT throughput is pegged at four instructions per cycle. The server core, unlike the mobile one, can do one FMA512 plus two 256-bit FMA per cycle. In comparison, AMD’s Zen 3 can perform two 256-bit MUL/MAC and two 256-bit ADDs. FMAC takes four cycles, one cycle less than Zen 2. It also has an impressive INT throughput of up to 10 instructions per cycle (including loads and BR).
Another tidbit to remember is that Sunny Cove has a wider L1D cache of 48KB while facilitates the higher number of loads/stores while Zen 3 (like Zen 2) is limited to 32KB. The L2 cache size on the two is identical at 512KB.
AnandTech has a neat comparison of Sky Lake and Sunny Cove:
Related:
- AMD Ryzen 5000 “Zen 3” Architectural Deep Dive
- Intel Gen12 Xe Graphics Architectural Deep Dive: The Bigger, the Better
- Intel’s Willow Cove Core (Tiger Lake) is Basically Sunny Cove w/ More Cache: Identical Decode, EUs, and BP
Diagram Credits go to Intel, AMD, WikiChip and Hiroshige Goto