TeamGroup Aims to Push DDR5 Kits Above 9,000 MT/S With Signal-Boosting Tech

Get ready for your shiny new DDR5 memory kit to feel slow and out of date. Memory and storage mainstay TeamGroup claims to have found a way to kick DDR5 signal speed and reliability up to new levels with the help of a new component from Renesas Electronics.

According to TeamGroup, it has incorporated a client clock driver, or CKD, into upcoming DDR5 kits that will "strengthen, buffer, and steadily output high-frequency signals from the CPU to memory kits," which the company says will lead to both higher and more reliable frequencies. Renesas says this is the first clock driver to be used in consumer DIMMs. Perhaps before long, we'll see one of these CKD kits on our Best RAM for Gaming page.

Technical details of the CKD, as well as when we might see kits equipped with the tech, are extremely slim at the moment. But according to TeamGroup's press release (and the image released with it), it seems that we'll see CKD-equipped kits arrive first as JEDEC-standard DDR5-6400 options, with faster options to come later this year. The company writes that it hopes that the addition of the client clock driver will help TeamGroup achieve "frequencies in OC memory to 9,000MHz or higher, to deliver ultra-fast gaming experiences for gamers, and reliable, high-performance information processing solutions for creators."

Obviously don't expect these ground-breaking memory kits to arrive anywhere near the realm of general affordability. The fastest 32GB DDR5-7800 options that seem to be readily available now are selling for around $400, and it seems unlikely that adding a new piece of hardware is going to lower the price of high-end kits.

More and faster top-end options may, though, help push the price of mainstream DDR5 options down while lifting speeds up. That would be good for both general consumers looking to move to the latest memory standard – and AMD specifically. Because while Intel's 13th Gen CPUs support both the older (and still by far cheaper) DDR4, AMD has gone all-in on DDR5 with Ryzen 7000. Along with the higher price of its motherboards, the cost of DDR5 has made AMD's latest platform tough to argue for if you care much at all about getting the most performance for your money. Achieving speeds above and beyond what's available via DDR4 would also of course make more people inclined to make the switch to DDR5.

See more RAM News

TOPICS

After a rough start with the Mattel Aquarius as a child, Matt built his first PC in the late 1990s and ventured into mild PC modding in the early 2000s. He’s spent the last 15 years covering emerging technology for Smithsonian, Popular Science, and Consumer Reports, while testing components and PCs for Computer Shopper, PCMag and Digital Trends.

22 Comments Comment from the forums

DavidLejdar

Nice. What the actual latency of it is, that is a bit different matter though. E.g. in case of DDR4, 3600 kits exist, which have a lower latency than DDR4-4800. And in case of the DDR5-6000 I have here, that's a lot of transfer capability even with a Gen4 SSD and without DirectStorage v1.1. So at least for me, the final latency question would be more of a selling point to eventually upgrade.
Reply
Kamen Rider Blade

Now if we can only get IBM's OMI to be accepted and move the Memory Controller to be near or directly next to the DIMM Sockets, then you would have less traces between the memory controller and the DIMM slot and a nice fast Serial Connection between the Memory Controller and the CPU.
Reply
bit_user

Kamen Rider Blade said:
Now if we can only get IBM's OMI to be accepted
I think CXL.mem has basically killed off any chance of that happening.

Kamen Rider Blade said:
move the Memory Controller to be near or directly next to the DIMM Sockets, then you would have less traces between the memory controller and the DIMM slot and a nice fast Serial Connection between the Memory Controller and the CPU.
This will probably come at the expense of latency, power, and bandwidth. The main selling point would be scalability, which you especially get with a switch fabric, but that would come at the expense of yet more latency and power.

I think the current approach of integrating memory controllers directly into the CPU is pretty much optimal for client-oriented CPUs.
Reply
Kamen Rider Blade

bit_user said:
I think CXL.mem has basically killed off any chance of that happening.
CXL.mem is a seperate thing from OMI, they're related, but not quite covering the same Domain.

bit_user said:
This will probably come at the expense of latency, power, and bandwidth. The main selling point would be scalability, which you especially get with a switch fabric, but that would come at the expense of yet more latency and power.

I think the current approach of integrating memory controllers directly into the CPU is pretty much optimal for client-oriented CPUs.
Microchip already designed a controller for OMI that only adds on 4 ns of Latency to the Serial link.
So the Latency part isn't that big of a deal, 4 ns compared to existing DIMM latency of 50-100 ns is manageable.

As for power, it was within the same power envelope of DDR4/5, only a bit better since you have less parallel traces running super long paths and more Serial Traces going from the Memory Controller to the CPU.

Also Bandwidth went UP, not down. You would get more Channels and for less traces once you routed everything through the serial connection linking the Memory Controller to the CPU. Or you can save on traces from the CPU Package by having the same amount of Channels and using the extra contacts for more PCIe lanes or other connections.

That gives your CPU designer more package flexibility.

Also, since the Memory Controller is detached, you can probably pull the same Infinity Cache trick by stacking SRAM on top of the Memory Controller.
Something AMD has already done on the Graphics Side, but can do again on the CPU side which can actually help since it can pre-cache certain Reads/Writes and have MUCH lower latency when referencing certain Cache Lines.

And given that TSMC & AMD love shoving SRAM on top, it offers the ability to truly lower the latency bridge from RAM to CPU by taking care of the most important problem.

Once the request is given, if the data is already in the SRAM on top of the Memory Controller, the data gets fed back MUCH faster.

SRAM latency is in the 9-25 ns range depending on size of the cache and how far it is compared to the 50-100 ns range that it is for RAM.
Reply
bit_user

Kamen Rider Blade said:
CXL.mem is a seperate thing from OMI, they're related, but not quite covering the same Domain.
Did you not hear that development of OpenCAPI has been discontinued? Who is going to adopt a dead-end standard?
https://www.anandtech.com/show/17519/opencapi-to-fold-into-cxl

Kamen Rider Blade said:
Microchip already designed a controller for OMI that only adds on 4 ns of Latency to the Serial link.
Even so, the serial link will also add latency.

Kamen Rider Blade said:
As for power, it was within the same power envelope of DDR4/5, only a bit better since you have less parallel traces running super long paths and more Serial Traces going from the Memory Controller to the CPU.
More traces at a lower frequency is actually more efficient. That's part of HBM's secret.

Kamen Rider Blade said:
Also Bandwidth went UP, not down.
This is incorrect. Why do you think DDR5 uses a parallel bus?

Kamen Rider Blade said:
That gives your CPU designer more package flexibility.
The only reason anyone would ever use OpenCAPI or CXL is for scalability. So, you won't find them in a client CPU, unless it's a high-end client with in-package DRAM and it just uses CXL as a way to add extra capacity.

Kamen Rider Blade said:
Also, since the Memory Controller is detached, you can probably pull the same Infinity Cache trick by stacking SRAM on top of the Memory Controller.
You'd want your cache inside the CPU package, for it really to do much good. The overhead of CXL or OMI could be multiple times the latency you normally get with L3 cache.
Reply
Kamen Rider Blade

bit_user said:
Did you not hear that development of OpenCAPI has been discontinued? Who is going to adopt a dead-end standard?
https://www.anandtech.com/show/17519/opencapi-to-fold-into-cxl
Yes, I have, that means they're being absorbed into CXL and becoming part of the standard and all technology gets moved under the CXL umbrella.
OMI is part of that.

https://openmemoryinterface.org/A Statement from the OpenCAPI Consortium Leadership

As announced on August 1, 2022 at the Flash Memory Summit, the OpenCAPITM Consortium (OCC) and Compute Express LinkTM (CXL) Consortium entered an agreement, which if approved and agreed upon by all parties, would transfer the OpenCAPI and Open Memory Interface (OMI) specifications and other OCC assets to the CXL Consortium.

The members of both consortiums agreed that the transfer would take place September 15, 2022. Upon completion of the asset transfer, OCC will finalize operations and dissolve. OCC member companies in good standing will be contacted with details of their specific membership benefits in CXL.

The OCC leadership extends its gratitude to its members and supporters for six years of effort producing the specifications for a cache-coherent interconnect for processors, memory expansion, accelerators and for a serial attached near memory interface providing for low latency and high bandwidth connections to main memory (OMI).

We are excited to witness the industry coming together around one organization to drive open innovation. We expect this will yield the finest business results for the industry and for the members of the consortia.

Bob Szabo, OpenCAPI Consortium President
It's not a "Dead End" Standard when it's part of CXL.

Even so, the serial link will also add latency.
Yes, I know, that extra SerDes process inherently adds latency because it's another step.
But I've already factored that in and I know how long it will take and how that compares to what currently is out there.

More traces at a lower frequency is actually more efficient. That's part of HBM's secret.
It also makes HBM absurdly expensive, all those extra traces makes HBM "SUPER EXPENSIVE" to implement.
Less Traces = Cheaper for the masses.
And DIMM's have only gone up in Traces over time, not gone down, I forsee future DDR specs and DIMM's to go up in the # of traces, not go down.

This is incorrect. Why do you think DDR5 uses a parallel bus?
Because it's part of Memory Standards Legacy from the early days of RAM & DIMM.
Have you noticed that DDR 1-5 has used the "Exact Same" physical width for the PCB and similar type of interface.
But the # of Contact Pins have only gone up.

Everything else in computing has gone "Serial over Parallel" over time:
We went from Parallel Ports & various other proprietary Parallel connections -> USB & ThunderBolt.
We went from ISA & PCI -> PCIe.
We went from IDE & SCSI -> SATA & SAS
Nearly every aspect of modern PC / Computing has kicked the Parallel Connections to the curb and gone with a "Serial Connection" to replace it.

The only reason anyone would ever use OpenCAPI or CXL is for scalability. So, you won't find them in a client CPU, unless it's a high-end client with in-package DRAM and it just uses CXL as a way to add extra capacity.
Or DDR6 and it's updated DIMM standard calls for even more Contact PINs which exacerbates the issue.

You'd want your cache inside the CPU package, for it really to do much good. The overhead of CXL or OMI could be multiple times the latency you normally get with L3 cache.
L3$ isn't going anywhere. L3$ sits before the Memory Controller Step.

The SRAM Cache layer I want to add is on top of the Memory Controller, similar to how AMD has done it with RDNA3. That will also help alleviate Latency & Bandwidth.

They already tested the OMI Latency.
https://www.electronicdesign.com/technologies/embedded-revolution/article/21808373/attacking-the-memory-bottleneck
Using the OMI approach brings advantages with it, such as higher bandwidth and lower pin counts. Normally, load/store operations are queued by the memory controller within the processor. In this case, the memory controller is integrated within the SMC 1000 8x25G. Microchip’s product has innovated in the area of device latency, so that the difference in latency between the older parallel DDR interface and this newer OMI serial interface is under 4 ns when compared to LRDIMM latency
That "< 4 ns Latency Penalty" is well worth increasing the bandwidth by multiple folds when you plan on keeping the same # of Contacts on the platform.

IBM engineers wouldn't waste their time if they weren't trying to solve a real problem, and this is the next major step.
Serializing Main Memory!
Reply
bit_user

Kamen Rider Blade said:
Yes, I have, that means they're being absorbed into CXL and becoming part of the standard and all technology gets moved under the CXL umbrella.
OMI is part of that.

https://openmemoryinterface.org/
It's not a "Dead End" Standard when it's part of CXL.
The Anandtech article explains that further development on OpenCAPI & OMI was ended. Administration of the existing standard is being managed by the CLX consortium and they inherit all of the IP.

So, yes it's finished. There will be no new versions of OMI. And it's not part of CXL in any sense other than CXL is free to use their IP. It's not as if there will be any CXL.mem devices that support OMI, or anything like that. OpenCAPI saw the writing on the wall and realized that CXL had all of the industry buy-in. So, as the saying goes: "if you can't beat 'em, join 'em."

Kamen Rider Blade said:
It also makes HBM absurdly expensive, all those extra traces makes HBM "SUPER EXPENSIVE" to implement.
I dunno. Somehow Radeon VII had 16 GB of it for $700, back in 2019. If it's in-package, then the chiplets get a lot cheaper and easier to connect.

Anyway, that wasn't my point. Rather, my point was that HBM2 runs at frequencies around 1 GHz. 1024-bits per stack. And yet, when you compare 4096-bit (4 stacks) against GDDR6 @ 384-bit, it's a net power savings for HBM! Granted, some of the power-savings is from being in-package, but a lot of it's due to simply running at a lower frequency.

Kamen Rider Blade said:
Less Traces = Cheaper for the masses.
You don't think JEDEC considered costs, when they decide all the DIMM standards? They know about serial link technology - it's been around for decades. They also know its downsides and decided the best approach was to continue with parallel.

Kamen Rider Blade said:
Because it's part of Memory Standards Legacy from the early days of RAM & DIMM.
DDR5/LPDDR5 was a fresh chance to rethink these assumptions. They changed quite a bit, yet decided to stick with parallel.

Kamen Rider Blade said:
Everything else in computing has gone "Serial over Parallel" over time:
We went from Parallel Ports & various other proprietary Parallel connections -> USB & ThunderBolt.
We went from ISA & PCI -> PCIe.
We went from IDE & SCSI -> SATA & SAS
Nearly every aspect of modern PC / Computing has kicked the Parallel Connections to the curb and gone with a "Serial Connection" to replace it.
Again, you're not hearing the answer. It's scalability. Serial is easier and cheaper to switch. So, when you don't need the bandwidth and don't mind taking a hit on latency or power, then yes - go serial. As for the cabled standards, they went serial for further reasons that don't apply to memory modules.

Kamen Rider Blade said:
L3$ isn't going anywhere. L3$ sits before the Memory Controller Step.

The SRAM Cache layer I want to add is on top of the Memory Controller, similar to how AMD has done it with RDNA3. That will also help alleviate Latency & Bandwidth.
You're missing the point, which is that L3 needs to be as close to the CPU as possible. AMD's decision to put it on a separate chiplet was worse, and only done for cost reasons. What you're proposing is way worse, by putting a big serial link in between! The latency between the CPU core and L3 should be as low as possible. Do the math.

Kamen Rider Blade said:
They already tested the OMI Latency.
https://www.electronicdesign.com/technologies/embedded-revolution/article/21808373/attacking-the-memory-bottleneck
That "< 4 ns Latency Penalty" is well worth increasing the bandwidth by multiple folds when you plan on keeping the same # of Contacts on the platform.
Ha! They cheated by comparing it with LRDIMMs. Client machines don't use LRDIMMs for good reasons - it adds cost and latency, and is only really needed for the sake of scalability.

But, as it regards your L3 cache idea, the problem is they're comparing end-to-end latency between LRDIMMs (and probably not a good one) vs. their solution. They claim the round-trip got longer, but they're not saying it's only 4 ns from the CPU package to their controller. They're saying they added 4 ns to that part. So, probably it was already like 6 ns (if we're being generous) and now it's 10. That's doubling the L3 latency, which is a non-starter.

And nowhere in that article did they talk about power (other than the ISA), which is the elephant in the room. That article is basically a marketing piece for Microchip's controller IC.

Kamen Rider Blade said:
IBM engineers wouldn't waste their time if they weren't trying to solve a real problem, and this is the next major step.
The problem they're trying to solve is scalability. They only make servers, and scalability is a huge issues there. If this approach were generally applicable, then JEDEC would've adopted it long ago. It's not as if nobody thought about this, 'till now.
Reply
Kamen Rider Blade

bit_user said:
The Anandtech article explains that further development on OpenCAPI & OMI was ended. Administration of the existing standard is being managed by the CLX consortium and they inherit all of the IP.

So, yes it's finished. There will be no new versions of OMI. And it's not part of CXL in any sense other than CXL is free to use their IP. It's not as if there will be any CXL.mem devices that support OMI, or anything like that. OpenCAPI saw the writing on the wall and realized that CXL had all of the industry buy-in. So, as the saying goes: "if you can't beat 'em, join 'em."
CXL.mem covers a COMPLETELY different problem/solution.

OMI isn't dealing with that problem at all. OMI is dealing with the scaling of extra DIMM's in PC/Server architecture.

I dunno. Somehow Radeon VII had 16 GB of it for $700, back in 2019. If it's in-package, then the chiplets get a lot cheaper and easier to connect.
And there's a DAMN good reason why HBM has been relagated to Enterprise products, it's too damn expensive, not that it doesn't perform.

Anyway, that wasn't my point. Rather, my point was that HBM2 runs at frequencies around 1 GHz. 1024-bits per stack. And yet, when you compare 4096-bit (4 stacks) against GDDR6 @ 384-bit, it's a net power savings for HBM! Granted, some of the power-savings is from being in-package, but a lot of it's due to simply running at a lower frequency.
Yes, it's a net power savings, at the cost of $$$ & fixed non-modular implementation.
Alot of it is due to running traces that are SUPER Short, compared to how far normal circuitry for DIMMs have to traverse to the Memory Controller.

You don't think JEDEC considered costs, when they decide all the DIMM standards? They know about serial link technology - it's been around for decades. They also know its downsides and decided the best approach was to continue with parallel.
It was decided ALONG time ago that DIMM's were to be dumb boards that hosted the RAM Packages.
That the Memory Controller wouldn't be on the DIMM.
At that time, the parallel connection was the standard due to simplicity and what was around.

DDR5/LPDDR5 was a fresh chance to rethink these assumptions. They changed quite a bit, yet decided to stick with parallel.
Because of Sunk Cost reasons, it's easier to use what already exists and ramp it up instead of making major changes like going Serial.

Again, you're not hearing the answer. It's scalability. Serial is easier and cheaper to switch. So, when you don't need the bandwidth and don't mind taking a hit on latency or power, then yes - go serial. As for the cabled standards, they went serial for further reasons that don't apply to memory modules.
And IBM's researchers are running into walls for scalling DIMM slots in Server / PC land.
288 contacts per DDR4/5 DIMM channel. That's only going to grow in the future.

You're missing the point, which is that L3 needs to be as close to the CPU as possible. AMD's decision to put it on a separate chiplet was worse, and only done for cost reasons. What you're proposing is way worse, by putting a big serial link in between! The latency between the CPU core and L3 should be as low as possible. Do the math.
You're missing MY POINT. L3$ isn't moving anywhere on the CCX/CCD, it's staying where it is.
The Memory Controller in the cIOD is getting moved out and replaced with a OMI interface and L4$ or SRAM is getting shoved on top of the memory controller.

Ha! They cheated by comparing it with LRDIMMs. Client machines don't use LRDIMMs for good reasons - it adds cost and latency, and is only really needed for the sake of scalability.
OMI was designed for Servers first, LRDIMMS are quite common in server infrastructure.

But, as it regards your L3 cache idea, the problem is they're comparing end-to-end latency between LRDIMMs (and probably not a good one) vs. their solution. They claim the round-trip got longer, but they're not saying it's only 4 ns from the CPU package to their controller. They're saying they added 4 ns to that part. So, probably it was already like 6 ns (if we're being generous) and now it's 10. That's doubling the L3 latency, which is a non-starter.
Again, L3$ isn't being moved or touched in any way/shape/form. You misunderstand what I'm trying to do.
The Memory Controller is getting moved out of cIOD and being MUCH closer to the DIMMs. A layer of SRAM is getting slapped on top of the Memory Controller to function as L4$.

And nowhere in that article did they talk about power (other than the ISA), which is the elephant in the room. That article is basically a marketing piece for Microchip's controller IC.
Do you think Microchip are liars?

The problem they're trying to solve is scalability. They only make servers, and scalability is a huge issues there. If this approach were generally applicable, then JEDEC would've adopted it long ago. It's not as if nobody thought about this, 'till now.
That's kind of my point, scalability with DIMMs, especially if my prediction with DDR6 DIMMs come true we're going to run into scaling issues.

And OMI has been submitted to JEDEC, it's only a matter of time before Server side sees the issue and decides on what to do.
Should they continue business as usual, or how are you going to reliably scale out Memory?

It would be easier to support more DIMMs & Higher speeds if the Memory Controller was PHYSICALLY closer to the DIMMS then farther away.

A serial link lowers the number of traces needed to feed data back to the CPU or adds more bandwidth for the same amount of traces.
Take your pick.
Reply
bit_user

Kamen Rider Blade said:
CXL.mem covers a COMPLETELY different problem/solution.
Perhaps a superset of OMI. You neglected to state the non-overlapping set of use cases.

Kamen Rider Blade said:
And there's a DAMN good reason why HBM has been relagated to Enterprise products, it's too damn expensive,
Sure, it's more expensive, but that's for a variety of reasons. Anyway, my point was about power - and that was the only reason I brought it up - because it's the most extreme example and really drives home that point!

Kamen Rider Blade said:
It was decided ALONG time ago that DIMM's were to be dumb boards that hosted the RAM Packages.
Rambus came along and challenged that, but the industry said "no, thank you" and just took their DDR signalling ideas (which Rambus has been milking the patent royalties from, ever since).

Kamen Rider Blade said:
Because of Sunk Cost reasons, it's easier to use what already exists and ramp it up instead of making major changes like going Serial.
The thing you keep refusing to see is that every new DDR standard represented another opportunity to revisit these decisions and they kept going with parallel for very good reasons. Serial has benefits and drawbacks. If you don't need its benefits, then it's foolish to take on its drawbacks.

I don't understand where you seem to get the idea that you're smarter than everyone else.

Kamen Rider Blade said:
You're missing MY POINT. L3$ isn't moving anywhere on the CCX/CCD, it's staying where it is.
The Memory Controller in the cIOD is getting moved out and replaced with a OMI interface and L4$ or SRAM is getting shoved on top of the memory controller.
Okay, so you're keeping L3 inside the CPU package and putting L4 in the DIMMs? That sounds expensive and would have a lower hit-rate than having unified/centralized L4.

Kamen Rider Blade said:
OMI was designed for Servers first, LRDIMMS are quite common in server infrastructure.
That's my point. You're trying to take a server technology and blindly apply it to client machines. It didn't happen with LRDIMMs and even CXL.mem won't be a direct substitute for existing DDR DIMMs!

Kamen Rider Blade said:
Do you think Microchip are liars?
That's not what I said. I said it was like a marketing piece, because it highlighted all of the selling points and none of the drawbacks. Power being chief among them. Also, they didn't specify the timing of the LRDIMM they were comparing against, which is a bit shady.

Furthermore, they're comparing against DDR4, which made sense given when it was written. However, we should re-evaluate their speed comparisons against DDR5.

Kamen Rider Blade said:
it's only a matter of time before Server side sees the issue and decides on what to do.
Should they continue business as usual, or how are you going to reliably scale out Memory?
They've seen the issue and the industry has coalesced around CXL.mem.

Kamen Rider Blade said:
It would be easier to support more DIMMs & Higher speeds if the Memory Controller was PHYSICALLY closer to the DIMMS then farther away.
PCIe has shown that you hit a wall in frequency scaling, hence the need for PAM4. That makes serial interface more complex and expensive. So, even if we look beyond the power issue, it's not as if serial doesn't introduce problems of its own. It's not a magic-bullet solution.

FWIW, PCIe is addressing these issues and CXL is piggybacking off their work. So, that's the natural solution for the industry to take.

Kamen Rider Blade said:
A serial link lowers the number of traces needed to feed data back to the CPU or adds more bandwidth for the same amount of traces.
Take your pick.
You act like you're the first person to see that. You should understand that the industry has been dealing with interconnecting devices for a long time. Not just CPUs and DRAM, but PCIe, multi-CPU fabrics, and even GPU fabrics.

IMO, 12-channel (though actually 24), 768-bit in a 6096-pin socket is getting pretty insane! Do you really think nobody at AMD saw the trend lines of increasing channel counts and package sizes and considered whether it made sense to switch to a serial standard like OMI? Do you think they never heard of OMI? The issue they're facing is a power-efficient way to add not just capacity but also bandwidth. Once they move to having some in-package DRAM and use memory-tiering, their external DRAM bandwidth needs will lessen and then CXL.mem starts to look more appealing. Plus, the cache-coherency it offers with accelerators and other CPUs is attractive as well.
Reply
Kamen Rider Blade

bit_user said:
Perhaps a superset of OMI. You neglected to state the non-overlapping set of use cases.
CXL.mem's problem & solution doesn't even cover the same thing as what OMI is trying to solve.
They're tackling COMPLETELY different problems and have no relation to each other at this point in time.
They both sit under the CXL consortium at this point and talk about memory.
But largely, they don't interact AT ALL with each other.

Sure, it's more expensive, but that's for a variety of reasons. Anyway, my point was about power - and that was the only reason I brought it up - because it's the most extreme example and really drives home that point!
And that's factored into my idea.

Rambus came along and challenged that, but the industry said "no, thank you" and just took their DDR signalling ideas (which Rambus has been milking the patent royalties from, ever since).
Rambus was also greedy and wanted a royalty with their tech, something JEDEC wasn't going to do.
That's why we haven't seen ODR (Octal Data Rates) or Micro Threading, because Rambus patent on those technologies haven't expired yet.

The thing you keep refusing to see is that every new DDR standard represented another opportunity to revisit these decisions and they kept going with parallel for very good reasons. Serial has benefits and drawbacks. If you don't need its benefits, then it's foolish to take on its drawbacks.
Again, you think that I'm going to copy OMI 1-to1?
No, I said move the Memory Controller closer to the DIMM, not ONTO the DIMM.
That's the No-Go / No-Sale point for most vendors, they don't want to add in a ASIC or specialty chip on their DIMM or change the Parallel interface.
That has been decided by the industry for "Quite a While" now.
Even I saw that writing on the wall.
Doesn't mean I don't want to move the Memory Controller to be "Physically close" to the DIMM slots.
One of the OMI's lesser talked about implementation is keeping the Memory controller on the MoBo, but PHYSICALLY very close to the DIMM slots.
It could be right underneath it, on the opposite side of the MoBo, or right next to the slots, the location choices are limited by the engineers imagination on where to place it.
As long as it's not on the DIMM itself and remains on the MoBo.
This way, you're only paying for the Memory Controller once, for the time you buy the MoBo.

The DIMM form factor largely remains untouched.

I don't understand where you seem to get the idea that you're smarter than everyone else.
Or maybe I see what IBM sees and sees the upcoming problem with every iteration of new DIMM form factor, the Pin-Count gets higher each generation when you do a major revision. That's not sustainable long term. We can't just keep uping the # of contacts on our CPU Socket just to account for new larger DIMMs.

What if DDR6 has 544-Pins per DIMM Channel?
What then, what will you do when Dual Channel takes 1088 Pins just for RAM/Main System Memory?

The whole Point of OMI is to make things manageable, and to reduce pin count to the CPU via Serial connection.

Where you locate the Memory Controller doesn't have to be on the DIMM, IBM just wants it that way because it's their "Preferred Way" to do things.
It's easiest on IBM and MicroChip gets to sell more Memory Controllers.

Doesn't mean it's what is best for the industry. Their alternate implementation where the Memory Controller sits on the MoBo is the better solution IMO.
The cheaper solution and one that is more manageable in the long term along with powering/cooling the Memory Controller properly.

Okay, so you're keeping L3 inside the CPU package and putting L4 in the DIMMs? That sounds expensive and would have a lower hit-rate than having unified/centralized L4.
Again, that's not what I'm saying.
The Memory controller isn't on the DIMM like in the main OMI implementation.
I'm using their other implementation where the Memory Controller sits on the MoBo.
The L4$ will sit on top of the Memory Controller, similar to how RDNA3 has a Memory Controller with SRAM on top that they market as "Infinity Cache".
It's realistically going to be L4$ since it'll save you many cycles by caching the data straight onto SRAM locally.

That's my point. You're trying to take a server technology and blindly apply it to client machines. It didn't happen with LRDIMMs and even CXL.mem won't be a direct substitute for existing DDR DIMMs!
Again, you don't seem to underst what CXL.mem does and implie that it even covers the same domain.
They aren't covering the same domain at all.

IBM is using OMI in their POWER servers, and we should pay attention to what IBM does, they're at the "Fore Front" of technological innovation.

That's not what I said. I said it was like a marketing piece, because it highlighted all of the selling points and none of the drawbacks. Power being chief among them. Also, they didn't specify the timing of the LRDIMM they were comparing against, which is a bit shady.
I know there's a power cost, but the amount you spend is worth it.

Furthermore, they're comparing against DDR4, which made sense given when it was written. However, we should re-evaluate their speed comparisons against DDR5.
Your Speeds are relative to how fast your Memory Controller can run the DIMMs, and it only gets easier if they are "Physically Closer".

They've seen the issue and the industry has coalesced around CXL.mem.
Again, CXL.mem is a seperate solution to a different problem.
Go read up on what they're trying to do and understand what problem they're solving vs what OMI is solving.

PCIe has shown that you hit a wall in frequency scaling, hence the need for PAM4. That makes serial interface more complex and expensive. So, even if we look beyond the power issue, it's not as if serial doesn't introduce problems of its own. It's not a magic-bullet solution.
But it's a solution to ever increasing # of parallel connections, that's a real problem.

Adding in more Memory Channels gets harder every generation, especially with the MASSIVE pin-counts per DIMM Channel.

FWIW, PCIe is addressing these issues and CXL is piggybacking off their work. So, that's the natural solution for the industry to take.
And OMI, being now part of CXL, is a solution that we can copy and do better within both Intel & AMD.
Both can move the Memory Controller off of the Die, use a Serial Connection between the CPU Die and the Memory Controller.

The Memory Controller doesn't have to be on the DIMM.
The DIMM can remain the cheapo dumb board that it is.
The way everybody loves it.

You act like you're the first person to see that. You should understand that the industry has been dealing with interconnecting devices for a long time. Not just CPUs and DRAM, but PCIe, multi-CPU fabrics, and even GPU fabrics.
And everybody within the industry has eventually gone Serial, after being Parallel for so long.
Now it's the venerable DIMM slot / Memory Channels turn.
Move the frigging memory controller off the CPU die, and into it's own dedicated die, sitting directly next to the DIMM slot.

Not on the DIMM, but next to the DIMM slot.
That's how you maintain a reasonable cost.

IMO, 12-channel (though actually 24), 768-bit in a 6096-pin socket is getting pretty insane! Do you really think nobody at AMD saw the trend lines of increasing channel counts and package sizes and considered whether it made sense to switch to a serial standard like OMI? Do you think they never heard of OMI? The issue they're facing is a power-efficient way to add not just capacity but also bandwidth. Once they move to having some in-package DRAM and use memory-tiering, their external DRAM bandwidth needs will lessen and then CXL.mem starts to look more appealing. Plus, the cache-coherency it offers with accelerators and other CPUs is attractive as well.
And what's next after 12-channel, 16-channel, 20-channel?
Or we can reduce the # of contacts needed and re-use some of those extra contacts for more PCIe lanes.
Everybody LOVES having more PCIe lanes. Who doesn't love having them?

OMI wasn't finalized until very recently, platform development has very long lead times.

And OMI wasn't part of CXL until VERY Recently.
Also OMI was using tech implemented by another company.

AMD and Intel won't be using Microchip's Memory Controller, they'll be doing it themselves.
That takes time to setup their own internal standards to copy OMI and implement it's core functionality.
But the BluePrint is there. The Memory Controller is the secret sauce. Every company has their own version.

in-package DRAM isn't nearly as good as in-package SRAM.
The latency savings you get from SRAM is HUGE and well worth it.

External Bandwidth needs will always go up.

& Cache Coherency is great and all, but CXL.mem is all about running the processing locally, where the data sits, if it's on the local accelerators memory, so be it.

If the data sits on my servers RAM, and somebody needs it, it just accesses it through CXL.mem and performs any adjustment locally on my side through remoting in via the CXL.mem protocol. This way there's less overall movement of data. Ergo saving on energy.

That's a seperate issue from what OMI is solving.
Reply

Show more comments