One of the most prominent changes in NVIDIA’s Ada Lovelace microarchitecture can be seen in the memory subsystem. The midrange RTX 40 series GPUs come with a slimmer bus compensated by a larger L2 cache. The GeForce RTX 4060 Ti has a paltry 128-bit bus but delivers an effective memory bandwidth of 554GB/s, higher than the 448GB/s offered by the 3060 Ti despite its wider 256-bit bus.
In a blog post, NVIDIA explains how its next-gen GPUs deliver a higher effective memory bandwidth (bandwidth between the shaders and game assets) than their predecessors despite featuring slimmer busses. We’re not going to talk about the VRAM buffer for the time being.
NVIDIA tested the GeForce RTX 4060 Ti against a special variant with just 2MB L2 cache (=L2 cache of previous-gen RT 30 series GPUs) and got the following results:
This increase in the L2 cache significantly improves the cache hit rate while reducing the memory bus traffic by over 50% compared to the 2MB variant. The cache hits and VRAM accesses can be seen in the above and below images:
While the impact of the larger L2 cache on VRAM utilization is largely limited, the RTX 4060 Ti memory subsystem performs better than the (256-bit bus + 2MB L2) one on the RTX 3060 Ti/3070. We’re looking at a GPU with a 128-bit bus and a theoretical bandwidth of 288GB/s, effectively delivering up to 554GB/s of peak bandwidth. The RTX 3060 Ti and 3070 top out at 448GB/s, slightly lower than the RTX 4060 Ti in practice.
NVIDIA also took a shot at AMD’s Infinity Cache, explaining how a large L2 cache is faster and more efficient than a small L2 plus large L3 cache combo. This makes sense as each cache layer is slower than its next consecutive lower level. A more complex cache hierarchy (more layers) results in a higher number of reads and writes, increasing latency and power consumption.