shawn_eary

Distinguished
Sep 3, 2009
16
0
18,510
I've heard that Registered memory is more reliable then Unbuffered memory. Can someone point me to some statistics to show me what the concerns are with Unbuffered (Unregistered) memory?

Since I only plan to use between 6 and 12 Gigs of memory with a single Xeon W3503 with 4.8 GT/s bandwith, I am wondering if I registered memory is necessary.

Note: My new system will be used for MPEG rendering of home movies and compilation of linux distributions via Sun VirtualBox.
 

jedimasterben

Distinguished
Sep 22, 2007
1,172
1
19,360
Unless you're planning on buying a server motherboard, then regular DDR3 is the stuff for you. Fully Buffered/Registered RAM is much more expensive, but it wouldn't give any sort of performance boost or increase reliability for your case.
 

shawn_eary

Distinguished
Sep 3, 2009
16
0
18,510
Ok, thanks.

So maybe I don't really need Registered RAM, but does everyone still think I need ECC RAM.

I read on Wikipedia that the error rate for RAM is
"roughly one bit error, per month, per gigabyte of memory" (continuous operation)
This figure seems pretty high and scarry to me.

Suppose I do a 4 hour MPEG rendering that totally uses 8 Gigs of RAM, then I think my risk of a bit failure will be
(8 bit errors per month) * (1 month / 31 days) * (1 day / 24 hours) * (4 hours / rendering) = 0.04 = 4% chance of of bit failure.
Quite frankly that scares me.

If the above doesn't scare anyone. Consider the following program psuedocode:

begin program
begin sub CleanFiles
if (userWantedToDeleteAllFilesOnSystem) then
system ("rm -r -f /") // Bye Bye OS!!!
end if
End Sub CleanFiles

/* Below is a 1 bit wide global boolean variable */
bool userWantsToDeleteAllFilesOnSystem
begin main
userWantsToDeleteAllFilesOnSystem = GetUserAnswer()
sleep (4 hours)
// What would happen here if
// userWantsToDeleteAllFilesOnSystem was filliped durring the 4 hours
// sleep period?
CleanFiles()
end main
end program

 

BigBurn

Distinguished
Jan 24, 2008
192
0
18,690
I think you are scared for nothing. Let's say you really have an error every 250 times you render a MPEG, it doesn't mean your redering would fail, maybe a pixel will be the wrong color for 1/60 of a second or something.

It's like saying Vista uses 2GB of memory at all time (so 0.01 chances to crash in 4 hours based on your calc) so it should crash every 400 hours. Which doesn't happen.

And I don't think your code would work because it would just crash when it would try to delete a system file with a UnauthorizedAccessException/IOException error. So I think you are safe :p
 
system ("rm -r -f /" ) // Bye Bye OS!!!
:lol: It will only work under root. WHY would you even use root for every day programing,etc !????! Besides, the most distros should/will warn you if you use that command.

And BigBurn is right. You are just worried for nothing. My normal DDR2/DDR3 systems are rock stable and they run CFD/CAD programs. CFD programs can run for ~12+ hrs and I have not had any problems. Memtest86+ can also run for days on my systems. You only need ECC,etc for servers because they need absolute reliability and because they are mission critical.
 
In modern desktop computer systems, RAM is the ONLY component that doesn't have ANY error checking, a situation which I personally find to be rather deplorable. RAM errors can be very finicky to diagnose. People just assume that POST will tell you if there are errors or that you can run Memtest86 and be confident, but in fact you can get intermittent errors that only show up under very particular conditions.

I was the victim of an intermittent RAM problem that the normal testing utilities couldn't diagnose. If it wasn't for NTBackup giving me verify errors on my backups I would never have discovered that my data was being corrupted. I ended up writing my own test program and discovered that the error was very sensitive to specific instructions and addressing modes, and it simply wasn't revealed by the standard tests.

As a result of this, I vowed to use ECC memory when I bought my latest system. Once you've copied a file through memory and a bit has been flipped, you're probably never going to get it fixed again - particularly if you don't find out about it before your cycle of backups is exhausted.

It's a personal choice, but bear in mind that to use ECC memory you need a system which can handle it. In my case that meant going for a Xeon W3520 instead of a Core i7 920 CPU. The Xeon was about $40 more expensive and is essentially identical in every way to the Core i7 except that it's memory controller contains the logic needed to handle ECC. And I also purchased the Asus P6T6 WS Revolution because it supports ECC memory when used with the Xeon CPU.
 

shawn_eary

Distinguished
Sep 3, 2009
16
0
18,510
Mr. Shadow:

I normally wouldn't run such a program under "root"; however, if a bit was somehow flipped while I was compiling part of a UNIX kernel, XServer or standard C Library...

You said that you ran serveral CAD sessions for 12+ hours without ECC memory? That seems like a pretty good test... How many Gigs of memory were fully used in that test? Of that memory, I wonder:
1) how much of it was used for program flow control?
2) how much of it was used to form the structure of save files?

The results you cite are encouraging, but I am still disturbed.

Lastly, ECC memory is not much more expensive than Non-ECC memory. The main cost is in the "added-value" it provides to the mentioned systems. For example, tell Dell or HP that you wan't an ECC machine and they will automatically call such machine a workstation or server. They will then proceed to charge you a price that is many times higher than what it would cost you to build the system yourself from equvalent Xeon and "Server" MB parts.

With increasing RAM desnsities, I think that either:
1) ECC will need to be reintroduced into consumer computing.
or
2) RAM will have to be improved so that this bit flipping happens less often.
[1 Bit per Gig per Month of continuous operation is too high - even if all you do is play Civilization 40 hours a week]

 

shawn_eary

Distinguished
Sep 3, 2009
16
0
18,510


Mr. sminial:

I appreciate this postion. I think I might need ECC, which is one reason why I may stear clear of getting a Mac. Only the ultra expensive Mac-Pro has ECC and it isn't registered at that.

I guess my next question is wheither or not I need registered memory. I understand that registered memory is supposed to help reduce the system voltage load when large amounts of memory are accessed in a short period of time. The balanced voltage load that registered memory helps to provide is supposed to result in increased stability. So then, the question comes to mind as to how much memory does it take to break down the stability of an unbuffered system?

One article I read seems to indicate that anyone that uses more than 4 Gigs of memory should get registered memory. "But for those who need to utilize more than 4GB of memory in a system, registered memory is absolutely a must-have." [1] The only problem with this quote is that it doesn't really tell why registered memory is needed. Even though the article throws out a useful statistic by which someone could gague his/her need for ECC, there is no such value by which one could estimate his/her need for registered memory.

I wish there was somewhere besides ACM where I could get an article that shows some statistics of what happens to a unbuffered server with varying levels of memory usage. While I wouldn't mind reading an ACM article if I stumbled across one, I can't directly access ACM Portal because I quit paying my dues quite some time ago...

[1] - Article by NewEgg "Do I Need ECC and Registered Memory"
http://images10.newegg.com/UploadFilesForNewegg/itemintelligence/NI_System-Memory/NIC-Pro-Do_I_Need_ECC_and_Registered_Memory-v1.1e.doc


 
The issue is that the signals sent out from the memory controller (which in the case of the Core i7 / Xeon W35xx parts is the CPU chip itself) can only drive a certain number of chips. The more chips they're connected to, the further the signal is stretched (ie, you end up with a voltage drop because more current is required to drive the additional loads). Registered memory places what amounts to a "repeater" in the signal path to eliminate the problem.

The problem with the article you found is that the need for registered memory is dependent on the number of chips being driven by the memory controller, not the capacity of the chips. As chip capacity grows, the amount of memory that can be safely handled with unbuffered memory grows as well, since the number of chips can stay constant. I found that article but there's no date on it, which makes it pretty much useless as a means of judging how much memory to use. It also completely ignores the number of channels the controller has, which is important since each channel has it's own signal drivers. So limit is the number of chips per channel, not the total capacity.

IMHO, a desktop motherboard with ECC memory shouldn't present any reliability issues. It seems to me that based on the number of 6 DIMM socket motherboards out there that the Core i7 / Xeon W35xx processors are engineered for 2 DIMMs per channel. And, if there is a glitch the ECC logic in the CPU's memory controller should detect it.

If high uptime is really, REALLY critical, then you might want to either:

(a) try to find a motherboard that supports registered memory (not an easy task, you'll probably have to go with a server motherboard), or

(b) limit yourself to only one DIMM module per channel (ie, 3 DIMM modules for a Core i7 or W35xx). One module places the minimum amount of load on the memory controller.

 

shawn_eary

Distinguished
Sep 3, 2009
16
0
18,510


Mr. Sminlal:

Your explanation of registered memory is the best that I have found. Now that I know chip density has little to do with memory controller voltage drop, I can make a more informed decision.

I wonder if Intel or anyone else has studied how many chips an i7 / Xeon 3500 or Xeon 5500 can reliably drive? I also wonder if ECC doesn't aggravate the need for registered memory. Doesn't ECC add one extra chip per memory module?

BTW:
Your suggestion to mitigate memory controller voltage drop by only allowing one DIMM per channel is very interesting. If I were to utilize your suggestion, I should be able to temporarily make an unbuffered memory system achieve greater stability without sacrificing throughput. This would save me $$$.
 
It does, but since the ECC logic is on the CPU chip that means the CPU must have additional signal lines to communicate with the extra chip. I don't see why there would be any more load on those signal lines than on the other ones - so I don't believe ECC would make any difference as to whether registered memory would be necessary.
 
That seems like a pretty good test... How many Gigs of memory were fully used in that test? Of that memory, I wonder:
1) how much of it was used for program flow control?
2) how much of it was used to form the structure of save files?
Program used was FloWorks and CFDesign. RAM usage under XP x64 was ~4.5GB of 6GB. See my i7 build specs in my sig. No idea about the program break down as it's closed source.
HOWEVER, these were just some non mission critical simulations. Of course, if I were to build a pro level 2P system for CAD work such as this: http://www.tomshardware.com/forum/forum2.php?config=tomshardwareus.inc&cat=31&post=267080&page=1&p=1&sondage=0&owntopic=3&trash=0&trash_post=0&print=0&numreponse=0&quote_only=0&new=0&nojs=0
I would have to go with ECC anyways. Some 2P boards allow for non ECC RAM, but in any case I'd go for ECC if it's mission critical.
I normally wouldn't run such a program under "root"; however, if a bit was somehow flipped while I was compiling part of a UNIX kernel, XServer or standard C Library...
Ah.... I see what you mean. Under linux/Windows(?) a change for file permission bit could cause problems....
 

shawn_eary

Distinguished
Sep 3, 2009
16
0
18,510


Mr. Shadow:

The fact that you can utilize 4.5 GB of Non-ECC RAM for 12 solid hours without any serious errors seems to indicate that the problem might not be as severe as I initially thought, but I still think I am going to get an ECC system.

Without going off on the deep end, I wish to say that the reincorporation of ECC into the common PC is inevitable. Higher Chip Density [1][2], Faster Speeds [1][2] and Lower Voltage [1] are all potential contributors to soft memory errors. With that said, I think that systems using the new i7 and Xeon 3500/5500 processors are pushing the limit.

Some approximations for the soft error rate of RAM that I have seen are:
1) "...one bit error, per month, per gigabyte of memory." [2].
2) "Chances for a single-bit soft error occurring are about once per 1GB of memory per month of uninterrupted operation." [3].
3) "...a system with 1 GByte of RAM
can expect an error every two weeks..." [1]

Approximations 1 and 2 are the same, but approximation 3 is 2x higher than that of 1 and 2. Because the more conservitive approximation of 1 bit error per GB per month of continuous operation was seen in two different sources, we will assume (bad thing to do) that it is correct.

Now since your CAD session uses about 4.5 GB, we would expect 4.5 bits to be corrupted per month of continuous operation.

Now since your CAD session runs for 12 solid hours, we would expect the following error rate:

(4.5 Errors per month) * (1 month per 31 days) * (1 day per 24 hours) * (12 hours per sesion) =
0.07 (Errors per Session)

This translates to about a 7% chance of something
going wrong with your CAD session; however, you have reported higher sucess rates. According to what you have represented, your rendering session does not actually fail (7/7)[failures] per (100/7)[number of tries] = 1 / 14 or 1 failure every fourteen attempts to render.

So perhaps my decision to simply go off the 1 bit per month per Gig statistic is unwaranted, and I should actually put more faith in what real world user's like yourself tell me, but something deep down inside me tells me that a allowing a memory chip to spontaneously change its value is a bad thing...

[1] - Soft Errors in Electronic Memory - A While Paper by Tezzaron Semiconductor [2004]
http://www.tezzaron.com/about/papers/soft_errors_1_1_secure.pdf

[2] - Dynamic Random Access Memory by Wikipedia [Last Update 5-SEP-2009?]
http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction

[3] - Do I Need ECC and Registered Memory? by Newegg??? [Unknown]
http://images10.newegg.com/UploadFilesForNewegg/itemintelligence/NI_System-Memory/NIC-Pro-Do_I_Need_ECC_and_Registered_Memory-v1.1e.doc

[BTW: I appreciate your info, your usage statistics are helpful]



I went overboard, if I remember right, a typical Linux kernel and drivers should be less than 1 Meg in size so it should be pretty immune to soft errors at run time. A more likely problem would be that I bit would be flipped during either compilation or the Linux kernel or compilation of the compiler used to compile the Linux kernel [1]. Another problem is bit errors while X Windows is running or being compiled. My understanding is that X Windows has elevated privileges compared to other system programs.

[1] - Compilation of the GCC compiler is normally done by the sponsor's of the distribution; however, if you are crazy like me and want to try Linux From Scratch then you will find your self compiling GCC more than once...
 
The scary thing isn't that something could "go wrong with the CAD session". The scary thing is that a memory error silently changes an attribute of some drawing object and is then saved back to disk without being detected. Anecdotal evidence is useless for judging those errors precisely because they aren't detected.