AMD's Infinity Cache May Solve Big Navi's Rumored Mediocre Memory Bandwidth

AMD (via @momomo_us) has trademarked the term "AMD Infinity Cache." The filing, which is on the Justia Trademarks website, applies to both the chipmaker's processor and graphics cards. In fact, the description of the trademark is so broad that it encompasses just about every type of silicon that AMD manufactures.

But the common consensus is that the trademark correlates with AMD's pending Big Navi launch. Memory bandwidth, among other aspects, is one of the major talking points about Nvidia's Ampere. The GeForce RTX 3090 flaunts an impressive memory bandwidth up to 936.2 GBps. The GeForce RTX 3080 and GeForce RTX 3070 aren't too shabby either, with theoretical values that peak to 760.3 GBps and 448 GBps, respectively.

In contrast, early leaked specifications (which should be taken with a bit of salt) on the Radeon RX 6000 series suggest that the Radeon RX 6900 might be limited to a 256-bit memory interface. The news caused a bit of distress within hardware circles as Big Navi might land with disappointing memory bandwidth. However, the rumors also mentioned the existence of a special cache that could be a game-changer.

Other than the folks at AMD, we doubt anyone has any idea of what the Infinity Cache is truly all about. It might be a new feature, or it could just be a fancy term for an existing concept. For example, AMD branded the L3 cache on its Zen 2 processors as GameCache. It sounds great for marketing, but at the end of the day, it's still just the L3 cache that we've all come to know from most modern CPUs.

When it comes to processors, the cache serves as temporary data storage that allows data to be retrieved quickly. However, the cache is very small, so you can't expect the processor to find all the data it wants inside the cache. A 'cache hit' refers to what happens when the requested data is present in the cache, and a 'cache miss' happens when the data is not readily available.

The same concept applies to graphics cards. For comparison, the Radeon RX 5700 XT is equipped with 4MB of L2 cache. A bigger cache would imply fewer cache misses. If Big Navi were to have a 128MB cache, the graphics card could fetch what it needs from the cache and make fewer trips to the main memory (RAM).

It remains a mystery whether the Infinity Cache actually refers to the L2 cache or a new L3 cache, or something else entirely. Graphics cards commonly come with L1 and L2 caches because the bigger caches are slower and induce higher latency.

There's a possibility that the Infinity Cache may be related to a patent that AMD filed last year on Adaptive Cache Reconfiguration Via Clustering. Subsequently, the authors published a paper on the topic. It talks about the possibility of sharing the L1 caches between GPU cores.

Traditionally, GPU cores have their own individual L1 cache, while the L2 cache is shared among all the cores. The suggested model proposes that each GPU is allowed to access the other's L1 cache. The objective is to optimize the caches' use by eliminating the replicated data in each slice of the cache. The results are pretty amazing. Across a suite of 28 GPGPU applications, the new model improved performance by 22% (up to 52%) and energy efficiency by 49%.

AMD is likely finalizing the preparations to announce the much-awaited Radeon RX 6000 series of graphics cards on October 28. Nvidia's Ampere is a tough nut to crack, so AMD needs to bring its A-game. Perhaps that A-game comes in the form of the Infinity Cache, but we won't know for sure until the company's announcement.

TOPICS

Zhiye Liu is a news editor, memory reviewer, and SSD tester at Tom’s Hardware. Although he loves everything that’s hardware, he has a soft spot for CPUs, GPUs, and RAM.

13 Comments Comment from the forums

awolfe63

You can't patent a term, name, or phrase in the U.S. They have a trademark.
Reply
nofanneeded

GPU deals with a huge data I dont know how caching would solve the memory bandwidth here ... it would add latency if it needs fetching most of the time.
Reply
Conahl

PBUn4RmBRDcView: https://www.youtube.com/watch?v=PBUn4RmBRDc
Reply
Chung Leong

nofanneeded said:
GPU deals with a huge data I dont know how caching would solve the memory bandwidth here ... it would add latency if it needs fetching most of the time.

It'll help with the ray-tracing mainly. A cache large enough to store the upper layers of the BVH tree should greatly reduce the number of reads from VRAM.
Reply
Friesiansam

It wouldn't surprise me if the 'rumours' of poor memory bandwidth, originated in Nvidia's marketing department.
Reply
JamesSneed

awolfe63 said:
You can't patent a term, name, or phrase in the U.S. They have a trademark.

AMD trademarked the name but they also patented the concept behind the name.

Likely patent: https://www.freepatentsonline.com/20200293445.pdf
Reply
digitalgriffin

nofanneeded said:
GPU deals with a huge data I dont know how caching would solve the memory bandwidth here ... it would add latency if it needs fetching most of the time.

pre-emptive scheduling. No different than how a CPU prefetches data and puts it into a cache when the decode engine starts reading instructions and makes predictions about what sections of memory it will read from.

<prefetch> Okay I'm going assign a CU to render a block in the upper left hand corner. Let's grab all the predicted textures in advance and create a small frame buffer for it.
<CU> Setting up triangles and calculating lighting in advance. Also doing ray hit testing. (200 cycles)
<CU> Okay I'm ready to apply that texture. Luckily for me it's already in the cache and I'm ready to go. (5 cycle penalty)
<CU> Okay I'm ready to apply that texture 2. Luckily for me it's already in the cache and I'm ready to go. (5 cycle penalty)
<CU> Okay I'm ready to apply that texture 3. Luckily for me it's already in the cache and I'm ready to go. (5 cycle penalty)

Old way:

<prefetch> Okay I'm going assign a CU to render a block in the upper left hand corner.
<CU> Setting up triangles and calculating lighting in advance. Also doing ray hit testing. (200 cycles)
<CU> Okay I'm ready to apply that texture. Let me retrieve texture 1 from VRAM (20 cycle penalty)
<CU> Okay I need texture 2. Let me retrieve that from VRAM (20 cycle penalty)
<CU> Okay I need texture 3. Let me retrieve that from VRAM (20 cycle penalty)

CU's handle small blocks at a time so you only need relatively small chunks of cache for them.

Plus cross CU cache coherency delays are really lowered. This is important when one CU block is reading/writing from another from the same mem space. (And why crossfire/SLI didn't work well and had glitches)

See the diff?
Reply
Avro Arrow

I know what cache is and how it works but I don't know if 128MB is enough to offset a 128-bit bandwidth disadvantage. However, I also don't know that it isn't enough to offset a 128-bit bandwidth disadvantage. I'm sure that the ATi engineers do know (which is why they're ATi's engineers).

If it works, it's ingenious. If it doesn't, it's moronic. Having said that, it probably will.
Reply
Avro Arrow

Friesiansam said:
It wouldn't surprise me if the 'rumours' of poor memory bandwidth, originated in Nvidia's marketing department.
I wouldn't be surprised either. They're just shady enough to do it.

digitalgriffin said:
pre-emptive scheduling. No different than how a CPU prefetches data and puts it into a cache when the decode engine starts reading instructions and makes predictions about what sections of memory it will read from.

<prefetch> Okay I'm going assign a CU to render a block in the upper left hand corner. Let's grab all the predicted textures in advance and create a small frame buffer for it.
<CU> Setting up triangles and calculating lighting in advance. Also doing ray hit testing. (200 cycles)
<CU> Okay I'm ready to apply that texture. Luckily for me it's already in the cache and I'm ready to go. (5 cycle penalty)
<CU> Okay I'm ready to apply that texture 2. Luckily for me it's already in the cache and I'm ready to go. (5 cycle penalty)
<CU> Okay I'm ready to apply that texture 3. Luckily for me it's already in the cache and I'm ready to go. (5 cycle penalty)

Old way:

<prefetch> Okay I'm going assign a CU to render a block in the upper left hand corner.
<CU> Setting up triangles and calculating lighting in advance. Also doing ray hit testing. (200 cycles)
<CU> Okay I'm ready to apply that texture. Let me retrieve texture 1 from VRAM (20 cycle penalty)
<CU> Okay I need texture 2. Let me retrieve that from VRAM (20 cycle penalty)
<CU> Okay I need texture 3. Let me retrieve that from VRAM (20 cycle penalty)

CU's handle small blocks at a time so you only need relatively small chunks of cache for them.

Plus cross CU cache coherency delays are really lowered. This is important when one CU block is reading/writing from another from the same mem space. (And why crossfire/SLI didn't work well and had glitches)

See the diff?
Yep. As long as 128MB is large enough to make a difference, it should work beautifully.
Reply
JayNor

Intel came out with Rambo Cache for their Ponte Vecchio GPU. Perhaps AMD considered Rambo Infinity, but marketing had the last say.
Reply

Show more comments