Sign in with
Sign up | Sign in
Your question
Solved

Memory controller integration

Last response: in Memory
Share
a b } Memory
a b à CPUs
August 27, 2011 6:49:39 AM

Most modern CPUs have the memory controller integrated into the die to increase performance while decreasing power usage. A drawback to this is that it pretty much locks a CPU into using a specific memory type such as only DDR2 or only DDR3 (or both in AMD AM3 CPUs).

I have been thinking very hard on ways to improve current technology (I hope to implement them myself some day) and I have come up with a question.

Could the memory controller be integrated into the memory? You could have the CPU talk to the memory with something like QPI or Hyper-transport (except faster than either). The controller takes up a fair amount of space on the CPU die so that space would be free for say more cache or a better integrated GPU or more cores and theoretically the CPU would be compatible with any memory given that motherboards with its socket support the memory. The largest problem I see is that the CPU would need to have a link to the RAM that can handle speeds two or three generations later.

something like this would be a significant change in computers so I could see it being a challenge to implement but I don't see a reason for it to be impossible/unreasonable.
Does anyone else see a problem for this?
a b } Memory
a b à CPUs
August 27, 2011 8:41:11 AM

From the cpu's point of view, which is a very important point of view, your suggestion seems to be a step backward . . . moving the controller back off the cpu . . . to where we were before integration.
m
0
l
a b } Memory
a b à CPUs
August 27, 2011 11:39:08 PM

The point of moving the controller onto the CPU die was to increase performance and decrease power usage. When the CPU wanted to talk to the memory it had to talk to the northbridge which would talk to the memory. integrating the controller means the CPU can talk more directly to the RAM. By moving the controller to the RAM we free up considerable die space without adding an additional chip layer between the CPU and RAM which leaves more space on the CPU for cache, cores, connectivity, or more.

We also enable a CPU to be used with any RAM that can talk with the same interface at the same speeds so we don't have to replace the CPU when we get a motherboard upgrade. I admit that AMD has two controllers on some of its chips which also allows this but my idea allows more than just that.

By integrating the controller into the RAM you could use multiple generations of RAM at once or even completely different kinds of RAM (such as both XDR and DDR3) at once. An FB-DIMM with the controller may even use less pins than they do now (69 pins) which would enable more channels/more FB-DIMMs per channel with a similar amount of wiring in the motherboard.
m
0
l
Related resources
August 28, 2011 5:41:42 AM

Sounds a little like Rambus.
m
0
l
a b } Memory
a b à CPUs
August 28, 2011 5:51:58 AM

Rambus currently has XDR and XDR2 RAM. Both are on a 64 bit bus like ddr but XDR is octal data rate and XDR2 is hexadecimal data rate. instead of 2x its 8x and 16x. Part of what I'm wanting is to have successive generations of the same RAM technology (or even different RAM tech) to use the same socket which is something XDR and XDR2 do (they both work in an XDR or XDR2 slot at up to the highest speed both the slot and RAM support).

I've got a massive post coming so I will put it on a different site and I'll just give a small comment with a URL to my site. It will be a few days but I'll give you guys something to really think about.
m
0
l
August 28, 2011 6:04:03 AM

How your idea will work with more as one memory modules?
m
0
l
August 28, 2011 6:06:39 AM

An advantage I see in your idea is as the memory controller will need to support only one specific memory banks organization - will be simple and fast.
m
0
l
a b } Memory
a b à CPUs
August 28, 2011 6:19:20 AM

For your question: just add an additional link in the CPU. Even with 4 of them I'm sure they would be smaller than an entire controller. for your next post: Simplicity is one of the greatest factors in the designs I'm working on.

By moving most complexity off the CPU (Most goes strait to the software) We are left with a ton of die space... and with extremely simple CPUs some really awsome things can be done. I'm hoping to get a CPU model that runs DDR or QDR for its cores to get some insane work done per second.

The website I plan to use hasn't been set up yet but I have luciferno.ucoz.net as a fail safe if I can't get a new one running. I intend to open a site as an inventory for a used/new computer selling and repair business probably through ebay and amazon. the same site will serve my new concept posting work as well if its up within the week.
m
0
l
August 28, 2011 6:56:19 AM

So you'll create a bus, dedicated to the RAM /Rambus/?
The advantage of having the controller on the CPU is at it may be accessed quickly by the CPU, meaning low latency and it may be manufactured by using the CPU tech /which is much more advanced as that of the memory modules themselve, and for the CPU tech the controller is a very simple thing actually. The necessarily internal controllers for the cache are much more complicated, because access there should be much faster and mostly because the access organization is more complex, involving two way access etc. The complexity of the RAM controller comes from the necessity of multiplexing many addresses and many possible types of RAM organization. But it takes a relatively small piece of the die and the requirements to it are the lowest compared with all other modules of the CPU.
Using more as one module with a controller to it if they are connected through one bus will mean they will fall in conflicts with each other /because the time between sending the access request by the CPU and sending the data from the memory is many cycles and the number of these cycles may vary, it's not fixed, so there is no direct correlation between the time when the request is sent and when the data are returned, so how you will deal with the question - what data corresponds to what request?/
m
0
l
a b } Memory
a b à CPUs
August 28, 2011 7:38:02 AM

Yeah I need to create a bus for the RAM and I got a start but an not even half way done yet. Bear with me, I haven't taken any classes on this so I'm playing this by what I read online. I don't expect this to be a workable computer system model for quite a while since I just started a two or three weeks ago.

As for the link between the RAM and CPU:
I want the CPU to be more or less unaware of what kind of storage is at the other end of the bus. It should work regardless of if it is a hard drive, SSD, or memory on the other side. Now I know it wouldn't matter how fast the CPU is if I didn't use something at least as fast as DRAM because the "memory" would be a bottleneck.

Every memory module's controller would have to use the same bus to be compatible with the CPU and would look all the same. The CPU would know the capacity of the device it is using as storage the bandwidth of the link but would not know if you have a hard drive. If you had a hard drive it would see it holds (1TB) about 931G(i)B of data and that it has the performance of a slug in a hamster wheel but COULD use it if for whatever reason someone modded a hard drive for it.

I just got my site running and will post the RAM information first to answer any unanswered (which I'm sorry, I have a lot of work to do with the site and the full post on it) questions ASAP. Could be tomorrow afternoon. In the post you will see me address some of the issues you just presented me. I swear.

Website is up but doesn't have a post yet.
techenthusiast.ucoz.com

Don't complain about the .ucoz in the URL. Its free for me to use and I can't afford (yet) to get my own domain. I refuse to do anything with donations until I'm really contributing to my readers and until that happens I stick with the free site hosting.
m
0
l
August 28, 2011 9:26:43 AM

"I want the CPU to be more or less unaware of what kind of storage is at the other end of the bus. It should work regardless of if it is a hard drive, SSD, or memory on the other side"

That is a completely not good idea. Because the purpose of the HDD and the RAM is completely different. The HDD is a storage - non volatile and large volume. And very very slow, compared to RAM. RAM is the operative memory - the memory, used by software for it's operative needs. Your idea means to turn RAM into nothing more but cache for storage. And what your software will use for operative needs - the storage /harddrives/?

And you don't answer my question, about how you'll deal with the lack of ordered relationship between the requests CPU sends to the memory and the returning of data when you have more than one memory module, respective controller? Or all the controllers will just wait until the right one accesses the necessary memory block and sends data back? That, s not the best to do if you have let's say 4 controllers, but it will probably work. You only need to synchronize the controllers - you'll need to establish some communication between them. What may say goodbye to simplicity. And you'll probably end up with 4 complex controllers, working at best with the speed of one.
m
0
l
a b } Memory
a b à CPUs
August 28, 2011 9:48:57 AM

The RAM will still be used for the same thing just implemented a different way. Its still for the software's operating needs.

By making the CPU unaware of its memory medium it is more flexible in what medium it can use and how it works. lets say I have a $10k PCI-E SSD. If I really wanted to I could just plug it into a pci-e port, hit a setting in the BIOS*, and now I've got non-volatile memory (even though it won't match older DDR speeds).

*(Treat PCI-E port x as DMI-PCI-E bridge) or something like that.

Another benefit is that the CPU isn't limited in the maximum capacity of RAM it can use. The controllers do all that work for it so any limit on a controller only applies to the module it controls.

The flexibility even keeps going. Since each module is on its own connection to the CPU each module can run at a different speed. My design doesn't implement channels (its point-to-point). You can write/read multiple modules at once (the T-RAM chips are dual ported so they can be worked with twice at once as well)

I apologize for not making it clear that the flexibility isn't limited to allowing hard drives to be connected to memory and RAM being used to cache storage.
m
0
l
August 28, 2011 1:20:20 PM

The CPU is not "aware" of anything, it just processes instructions. But your OS should be aware. And because the instructions, of which an OS consists run on the CPU it's necessary for it to address the RAM. If the CPU can't address the RAM, who will. There is no any flexibility, you have the instruction set, and a part of that instruction set is addressing memory. No other "awareness" there. Flexibility or not, your CPU must address the memory. Do you have any idea how the software operates on low level, on instruction level? It's like that - get the value of register B and add it to the value of the memory at address X, and put the result back in B. Don't you learn any assembler anymore?
m
0
l
a b } Memory
a b à CPUs
August 28, 2011 6:23:02 PM

I've never been good at explaining things...
The controller can only work with one type of memory. I considered this the CPU "knowing" what type of memory it's using. By moving the controller to the memory the controller still only works with one type of memory but now the CPU can work with different controllers.

the link is like this:
CPU--Bus--Controller/Buffer--T-RAM chips
instead of CPU/Controller--Bus--buffer--chips

The same bus would replace Hyper-Transport/QPI and partially PCI/PCI-E for my system. I say partially for PCI-E because I don't want to make all those add on and video cards incompatible.

Since my RAM chips aren't DRAM but are T-RAM they can clock much higher. I have them clocking internally 4 times the clock of internal DRAM so now with the bus multiplier net clock rates get pretty high. DRAM has such high latencies because it doesn't clock very high. It doesn't need to clock very high because of its massive connections with the CPU. I lower the parallel connections and serialize them because I increase the clock rate significantly so latencies should improve significantly.

For example, my slowest module runs at 800MHz internal clock. With the bus multiplier its at 3200MHz and then DDR to 6400MT/s. For comparison the slowest speed of DDR3 (which isn't in common use) is 100MHz internal X4 bus multiplier to 400MHz and X2 for DDR to get 800MT/s. The modules still have the same size as current ones and the same 8-17*chip count.
*17th chip being the buffer.

The number of chips for the CPU and RAM are the same I simply moved the controller from chip A to chip B (chip A = CPU chip B = buffer). in function it is still between the buffer and CPU just on the other side of the bus. I didn't add another chip between the two so I wouldn't expect a drop in performance.

The CPU still address memory but it doesn't get locked into one type of RAM OR have the snail-like speed of a on-board controller.

Does this reply clear up the problems you mentioned?
m
0
l
a b } Memory
a b à CPUs
August 28, 2011 7:06:14 PM

I missed one of your posts so I'll address the controller problem.

Quote:

And you don't answer my question, about how you'll deal with the lack of ordered relationship between the requests CPU sends to the memory and the returning of data when you have more than one memory module, respective controller? Or all the controllers will just wait until the right one accesses the necessary memory block and sends data back? That, s not the best to do if you have let's say 4 controllers, but it will probably work. You only need to synchronize the controllers - you'll need to establish some communication between them. What may say goodbye to simplicity. And you'll probably end up with 4 complex controllers, working at best with the speed of one.


Each module is on a separate bus. The CPU can have up to 8 buses for memory but most will have 2-4. To use more than 8 modules (servers/workstations) you could have 1 master module per bus with a controller and 3 slave modules for (8X4) 32 modules which should be enough for servers. Should that not be enough there is a technology by CISC that allows a sub-channel to act as a DIMM. to read more on this go to

https://www.cisco.com/en/US/prod/collateral/ps10265/ps1...

I could implement something like that for 64 module servers if need be.

Being on a separate bus (I'm using a point-to-point topology) the controllers shouldn't communicate with anything but the CPU. When the CPU wants to access multiple modules at once then it will talk to each one independently and all at once.

m
0
l
August 29, 2011 6:17:03 AM

OK, I see what do you mean by "unaware", unaware for the type, not for the address. But it's also the case when the controller is on the northbridge chip.

"Each module is on a separate bus. The CPU can have up to 8 buses for memory..."
And how the CPU will know where a specific address may be found, to send a specific request to a specific bus? Or you'll send the request to all buses, and then all of controllers will wait /meaning you'll make all your buses to effectively work as one, but being more expensive/?
You are going to create an extremely complex and expensive memory system, not a simplier one. So think about what your goals are, and what possible advantages and disadvantages you will have.
m
0
l
a b } Memory
a b à CPUs
August 29, 2011 6:27:27 AM

Like I said I'm working on it. I don't have an answer for this question just yet. I realize in practice its something like the northbridge but it should be much faster once I'v worked it out.

My goal is a high bandwidth, low latency memory that the CPU can use multiple generations of. Maybe if I leave the controller on die and make the memory compatible with the older generations the goal would be met in a simpler way.
m
0
l
a b } Memory
a b à CPUs
August 29, 2011 6:29:08 AM

I want the memory to not be much slower in latency than the slowest on-die cache. I'm hoping for about 1/3 to 1/2 the speed of the L3 cache at the slowest.
m
0
l
August 29, 2011 12:53:47 PM

It's not impossible, but for it you'll probably need Static RAM /from what is available now/ and it will be extremely expensive.
m
0
l
a b } Memory
a b à CPUs
August 29, 2011 1:55:44 PM

There is a relatively recent RAM technology called T-RAM that has the speed of SRAM and the density of DRAM on the same process. With it I think I could get away with it easily because it can be built at the lower nanometers like SRAM so it can be even denser then DRAM.
m
0
l
a b } Memory
a b à CPUs
August 29, 2011 6:18:23 PM

I just had an idea. What if the controller was integrated into the CPU didn't know the physical layout of the RAM but still controlled it and the memory buffer translated between RAM signaling and my bus? I should have the compatibility between different RAM technologies and their generations but still have the performance and simplicity of a controller on the CPU die.

For info on T-RAM you can go to wiki but the article sucks. The first source material for it goes to the website of the company that makes the stuff so there is probably much more technical info there. I'm reading it now.
m
0
l
a c 347 } Memory
a b à CPUs
August 29, 2011 6:25:21 PM

I don't care to read all of this but to address the OP's first post. It would be slower than hell, add latency, increase cost, and cause proportional problems to the number os DIMM slots used; kinda like RAID 0 weakest link BUT without any of the speed advantages.
m
0
l
a b } Memory
a b à CPUs
August 29, 2011 6:34:49 PM

If you don't read the rest of this then you have no idea what were talking about now.
m
0
l
a c 347 } Memory
a b à CPUs
August 29, 2011 6:59:06 PM

Because the whole idea is stupid. Learn some physics and the inefficiency of both a chipset and distance. Imagine L1/L2/L3 on the ass-end the MOBO or worst through some off-die chipset. The new LGA 2011 in a bitter attempt to reduce the inefficient paths has moved the DIMM equidistant on both sides of the CPU.

Like Intel finally getting rid of chipsets like the X58 PCIe controller or MOBO's that use a NF200 BOTH adding enormous/excessive latency.

What you're suggesting is backwards and inefficient.

LGA 2011 DIMM arrangement; note the proximity of the DIMM <-> CPU:
m
0
l
a b } Memory
a b à CPUs
August 29, 2011 7:00:36 PM

I have completely changed my idea. If you like I will summarize what it is now. I admit that is an awesome looking mobo but some coolers would be killed by that DIMM placement.

I'm going to close this thread and open a new one soon since the topic has changed so much. I'll call it "Designing a new RAM technology and would like input/criticism"
I'll start it after pepe2907s next reply.
m
0
l

Best solution

August 30, 2011 5:19:26 AM

"I just had an idea. What if the controller was integrated into the CPU didn't know the physical layout of the RAM but still controlled it and the memory buffer translated between RAM signaling and my bus?"

It's possible, but it will add latency /actually this seems to me even closer to the Rambus tech, as much as I know it/. But the reason why the controller on the CPU die is chosen is mostly because it results in the lowest latency.
Latency is one of the most important factors for the resulting performance of the memory system. But you may diminish this importance, if you may find a way to preload all the necessary data from the main RAM to processor registers and cache, and also serialize the access to the main memory. And such attempts ware made. But it has it's own complications and also it's a matter of a different architecture, architectural approach and command set.

And it just reminded me of the VLIS /Intel Itanium architecture/.
Share
a b } Memory
a b à CPUs
August 30, 2011 5:48:59 AM

Considering my RAM has more in common with SRAM than DRAM the added latency would be slight if it's based on clock speed. My RAM would start at 800MHz per chip and consider the bus multiplier it's 3200 MHz. that gives it .3125 nanoseconds per clock so if it adds a few clocks I'm not too worried. My RAM would internally be 128 bit but externally it's on a 4 bit bus for 51.2GB/s. my higher speeds (up to 204.8GB/s for the awesome integrated graphics) would have a wider external bus (up to 16 bit) but internally still be 128 bit so latencies could increase and the data speed per each 1 bit link doesn't. Since my RAM utilizes T-RAM instead of DRAM I can have some very small chips with large capacity. My idea for it is 4 T-RAM dies on each chip that act like individual chips. Each die has the same word width of DRAM chips so with double the dies vs chips on a conventional DRAM you get double the bit width of there interface with the buffer. This design for the chips may run a little hot but simple heat spreaders should solve that should it become a problem.

I'm not using a serial bus would run HOT at those speeds. serial links to current FB-DIMMs run to hot topping out at about 10GB/s let alone my RAM.

Each 1 bit link would still run at12.8GB/s but each link has a hexadecimal data rate so it's actual clock rate is relatively low at 800MHz.

My RAM is dual ported (two accesses at once, but you probably knew that) so it can pretend it has somewhat lower latencies than it already has.
m
0
l
a b } Memory
a b à CPUs
August 30, 2011 12:51:41 PM

All right I'll get to work on the new thread now and when its open I'll close this one.
m
0
l
a b } Memory
a b à CPUs
September 6, 2011 3:54:28 AM

Best answer selected by blazorthon.
m
0
l
!