Well clustering would make more sense to me than having hot spares taking up space in a data centre doing nothing.
Bad / Faulty RAM does not usually cause a typical users PC to crash, give application faults, or appear any less stable than the next, it *can* cause all these in the worst cases, but in only slightly dodgy RAM it tends to corrupt data and rarely (if ever) cause a BSOD.
If the above data is per client, times it by the percentage of clients with slightly flaky RAM, or anything in a worse state, (for a BSOD to occur the dodgy bit needs to be at certain addresses that the OS is actually using) then multiply that by 100,000 - 1,000,000+ client workstations (users) and average the frequency of 'minor problems' then on a per day/per week basis.
As the OS will allocate blocks for either instructions or data, never both (anymore) the applications don't always throw expect ions, in fact most of the time they'll parse the bad data as is without checking it to much (overheads are considered not worth the gains), thus there is rarely an application exception either... they just parse the data "as is" with only the most basic of checks on it.
The 6 months on average a single bit of data being flipped is for machines with 64 MB of SDRAM, for machines with 2 GB of RAM at the same quality expect it to occur 32 times more often. Consider also the OS does not check the integrity of data pushed to the OS disk cache before committing it to disk (or to a network drive, via the network clients 'virtual disk' cache, which may be as large as 512 MB if not larger). - This leads to one minor bit being inverted as frequently as (
worst case scenario used) every 5 to 14 business (operating) days. Although less likely as the more powerful the machine the lower the load on it will be. Consider this is for PCs with RAM that actually passes memory tests. PCs worse states than "once every 6 months minor failures" will only push the stats the other way.
Even if the applications do all the checks, when 'committed to disk' (via network clients write-back cache, which is dynamic in size, and may even be a 2nd buffer 'behind' the OS's core disk cache) the data can become corrupted. (This partially depends how the application is coded, what it is doing, etc).
Then factor in the typical amount of page operations (even just to other RAM) that occur in a typical application. The virtual address is always the same within each running process, but the physical addresses will be changing quite often.
Most the 'enterprise quality (cough) applications' I've seen don't check the data very thoroughly at all, they only care about throughput at both the client and server end. They are also coded as though the client workstation (user) PCs have near unlimited resources (add this fact to the last 3 major paragraphs).
I'll agree it is most minor, however:
- One 'normal' PC has a minor fault every 6 months.
- There are 100,000 to 1,000,000 client PCs
- Equates to (approx) between 547 and 5475 (for 1 million) 'minor' problems per day
- The quantity of minor problems starts to outweigh the impact of a more critical fault that only affects say 50 - 250 users
- In a given 6 month period that is around: 68,993 - 689,938 ; 'minor' problems. Many of which do lead to corruption of records in large databases, that are then used by other 'staff' which need to put up with 'bad data' which impacts customer service, and so on (this snowballs as once a minority of records are even slightly screwed it can affect a large number of 'users' and 'customers' (thus service and public image suffer).
- Assuming fields in records are always loaded to the same physical addresses (not unlikely), and if only 3% of PCs are faulty, and each staff member processes 50+ records a day (if not far more), then the problem snowballs at at even more rapid rate, to the point where it is harder to deal with using 'isolated cases' treatment
- By the time management acknowledge there is a problem 3%+ of all records are already corrupt, doesn't sound like much but the impact is huge. (and how to you 'un-corrupt' data ?, you can't really, so the best protection is prevention -
anyone with a better idea let me know, so I can patent it - 8) ).
I am yet to see applications on client PCs that truly check the data they are sending back to the server, via the system in my last post.
If you can point me in the direction of any 'client workstation PC' applications that do, please let me know (as above, consider the page faults, virtual vs physical address issues, etc).
Indicating the problem only affects each user maybe once every 6 months is wrong, as each user is loading records previously stored by other users, so the actual graph is an exponent, it is not linear - which is why it snowballs faster than looking at it as a 'per user - isolated case' perspective when it should be looked upon as a 'affecting several users, is not isolated, and bad data stored by one user (or PC) can, and will, affect other staff with 'perfect' PCs'. (You get my gist now).
I mean both perspectives are valid, and I appreciate everyones input.
Originally I figured, meh, isolated case, meh, isolated case, but after observing the issue at so many places, and not really having a viable solution to 'un-corrupt' data, thought I'd share my thoughts.
I know there are ways to stop 'more' 'minor issues' from occurring, but it will not undo the damage already done (which is currently fairly manageable, but it gets worse slightly each passing day).
8) - Tabris
arkPeace
PS: I've seen places that are already well beyond the '547 minor (hardware - and/or poor software checking - induced) problems per day per 100,000 staff' ratio, if you graph the data it forms the start of an exponential or logarithmic 'curve'. It doesn't need to get far before it becomes a **** storm in most cases.