AMD has initiated the end of active support for its GCN-class Vega and Polaris GPUs. The most recent indication was the end of Vulkan Linux driver support for these long-lived graphics processors. While Polaris and Vega-based Radeon cards will continue to receive bug fixes and security updates, they’ll no longer get the latest features and content updates released for Navi.
This retirement process began a while back with the Adrenaline 23.9 driver package. In September, AMD created separate driver branches for GCN and RDNA, with only the latter receiving game optimizations and GPUOpen technologies.
AMD GCN vs RDNA GPU Architectures
Let’s revisit the GCN graphics architecture, which served AMD for nearly a decade before being replaced by RDNA and CDNA. One of the primary reasons why GCN was abandoned is because it’s more of a compute-oriented design, more suitable for number crunching rather than gaming. It also didn’t scale much beyond a certain point, no matter how many transistors were thrown at it.
Consequently, AMD has to turn back to the drawing board and start afresh. That said, not everything about GCN was discarded. Some of its features continue to live on in the Navi GPUs while CDNA is essentially beefed up GCN.
Bigger Isn’t Always Better
GPUs thrive on parallel processing. It helps hide the high latency inherent in SIMD designs. The higher the hardware utilization, the more efficient the architecture. This is usually achieved by keeping several thousands of warps/waves in flight. And so with GCN (and Bulldozer), AMD created a compute monster with a long work queue and longer execution times
Each of AMD’s GCN Compute Units consisted of 64 shaders (stream processors) divided into four SIMD units, each with 16 SPs. An SIMD unit simultaneously executes a single operation across the 16 items in its work queue (vector execution). The four SIMD units can also work on separate wavefronts which in turn contain four sets of 16 queue work items.
Slow and Steady…
Overall, each SIMD can execute an entire wave (64 items), but over four clock cycles instead of one or two. This implies that each CU can process four waves over four clock cycles. The scheduler can issue one instruction from any of the four waves on the CU to one of its SIMDs. This adds up to four instruction issues over four cycles. Additionally, a GCN SIMD can keep up to 10 waves in flight for a total of 40 per Compute Unit.
Unfortunately for Team Radeon, most games utilized a short work queue, resulting in the underutilization of the SIMD/CU, as little to no instructions would be issued after the initial dispatch. Conversely, competing GeForce cards have shorter execution times of one or two cycles, resulting in more instruction issues over the same period.
GCN to RDNA: A Change in Wave
One of the most notable changes with RDNA was the introduction of wider SIMD units capable of executing 32-item wavefronts every cycle. The number of shaders per Compute Unit is still 64, but they are distributed across 32-wide SIMD32 units. Additionally, each CU is paired with another, sharing scalar and instruction cache, among other things, with each other.
This arrangement allows the execution of one whole wavefront in one clock cycle, reducing bottlenecks and boosting IPC by 4x. By completing a wavefront 4x faster, the registers and cache are freed up faster, allowing the scheduling of more instructions overall. Furthermore, wave32 uses half the number of registers as wave64, reducing circuit complexity and costs.
The vector register files have also been optimized for the narrower wavefronts. Each vector general-purpose register (vGPR) now contains 32 lanes that are 32-bits wide (for FP32), and an SIMD contains a total of 1,024 vGPRs –, 4X the number of registers as in GCN. RDNA also doubles the number of scalar units per CU (1>2).
A Compute Unit in RDNA 1 can dispatch four instructions per cycle, two scalars, and two vectors. Consequently, an RDNA1 WGP has a throughput of 128 vectors and 4 scalars per clock. Furthermore, each SIMD can keep up to 20 wavefronts in flight, courtesy of the 10KB scalar register.
More Cache and Shared Cache
The GCN graphics architecture relied on two cache levels, but RDNA added a third “L1” layer between the L2 and L0 cache. The L0 cache is private to a DCU, while the L1 cache is shared across a group of Dual Compute Units. This reduces the load on the L2 cache. In GCN, all the cache misses of the per-core L1 cache were handled by the L2 cache. In RDNA, the new L1 cache centralizes all caching functions within each shader array.
The L1 graphics cache is shared between four WGPs or a shader array. The L1 cache controller coordinates memory requests and forwards four per clock cycle, one to each L1 bank. Like in any other cache memory, L1 misses are serviced by the L2 cache.
On the Polaris GPUs, only the Compute Units were clients of the L2 cache. The RBs, Copy Engine, and CP wrote directly to the memory, resulting in lots of L2 flushes. Vega refined this design by making the RBs clients of the L2 as well, thereby reducing L2 flushes. RDNA goes a step ahead of the GCN derivatives by making the copy engine a client of L2 as well. This should further reduce L2 flushes.
Backend and Texture Units
The final fixed-function graphics stage in a modern GPU is the Render Backend (RB). It performs depth, stencil, and alpha tests and blends pixels for anti-aliasing and other final tests. Each of the RBs in the shader array can test, sample, and blend pixels at a rate of four output pixels per clock.
The primary change in RDNA is that the RBs access data through the graphics L1 cache, which reduces the pressure on the L2 cache and saves power by moving fewer data. In GCN, the RBs wrote data directly to the memory, then in Vega via the L2 cache.
The Texture Units on Navi are also significantly faster than Vega. The load and store processing speeds are multiple times faster compared to GCN, making it easier for the GPU to reach maximum bandwidth via both loads and stores.
RDNA vs GCN: ALU Utilization Comparison
It’s much easier to saturate the SIMDs and thereby the WGPs on Navi (RDNA) compared to GCN. One WGP (2 CUs) requires just (4 SIMDs *32 items) 128 threads to reach 100% ALU utilization. GCN, on the other hand, needed (2 CUs * 4 SIMDs * 65 items) 512 threads to reach 100% utilization.