IBM ChipKill(tm) - How cool is this ?

IBM ChipKill(tm) - How cool is this ?

Sure the doc is related to POWER 5 machines, but the same RAM can be used in existing x86 / x64 chipsets + system platforms today.

Figured I'd post it as most people still think ECC single-bit error correct, double-bit error detect is 'state of the art' memory protection.

You can build a ChipKill capable system today for under AU$ 5,000

8) - Tabris:DarkPeace.
20 answers Last reply
More about chipkill cool
  1. What do you think about FB-DIMM Tabris? :wink:

  2. I haven't drooled this much since Nina Hartley took on eight guys at Santa Monica Beach! :twisted:

    I tried ploughing through the 44 page PDF but I must have shaved my head this morning and applied "Jasmine Lubricant For HER Pleasure" on it, since most of it just WHOOOSHED over my head.

    I want one! I want one! How would this Power 5 compare with the 2xClovertown we're discussing in the other thread?
  3. Pretty cool indeed. Nothin but love for my IBM homies.
  4. Is it really necessary though? I don't see IT managers screaming for double ECC memory...
  5. Considering the cost of downtime, vs the cost of fore-mentioned approach, and being able to reduce downtime due to memory faults, and the odd bit of corruption here and there. I would call it cost effective.

    As for FB-DIMMs, I think the bloody rock, because the density will scale far better than Registered (ECC) DDR2 DIMMs (RDIMMs).

    However as AMD solutions typically have more memory controllers, they should also be able to reach 128 GB (max) at around the same time.

    Once we get over this 32 GB useful, 64 GB maximum PAE-36 hump then PCs will really begin to shine.

    It won't be hardware people rave about any more after that, it'll be smarter, more efficient software (higher performance per watt of software, or the 'whole solution', including users, routers, etc).

    It is the best 'benchmark' of all.

    Take Sun Microsystems SWaP concept, but apply it to users, routers, switches, software + hardware performance, etc = 'The whole system'. Come 2012 that is all they'll care about.

    Need more performance, ask the system what to add and where the bottlenecks are.

    eg: Do we add more users ?, and if so where ?, or will slow network links congest actual productivity first ?, or is the software not designed quite right ?, or, or, or....
    I want to be able to analyse data like that, and just explain it in a simple way with a diagram to those who allocate the funds to solutions - Is this too much to ask for ?, Just one 4 dimentional, relational, database information management system, with multi-duplex relationships able to represent files, e-mails, network objects, etc - Seriously it was possible software wise quite some time ago, where is the hold up ?
    Once we hit 128 GB in commonplace servers, then high-end workstations, then desktops, I find it quite possible that IA-64 (Itanium, but not as we knew it before, under a new brand name, or a new iteration thereof) will make a come back. (It'll pack in more cores than x86/x64/EM64T within the same die space, and at 45nm - 32nm it'll outpace x86/x64 servers).

    Shame they don't produce it using the latest die shrinks really.

    It'll still be Sun, Intel and AMD/IBM, IBM (alone), and perhaps Intel/Apple and IBM (based designs)/Apple in the market.

    Someone is dragging their arse, Is it you ?
  6. I didnt understand all of it and, to be honest, I didnt exactly read through the whole thing, but some of that stuff looks nice :)
  7. AU$5,000? What is that, like $3.27 US? :lol:
  8. Quote: reduce downtime due to memory faults...

    Hmm. I don't think I've ever heard of a memory failure on a server for any company I've ever worked for. RAM in desktops yes. HD's, backplanes, entire SAN shelves, cabling, a PSU catch fire, but never an ECC RAM issue. That doesn't mean that they don't happen, but following the RAID model it seems like redundant failover boxes with moderate cost components would be cheaper and more reliable than a single bleeding edge, ultra high tech, ultra high cost box.

    Now, as densities increase and pathways narrow and gamma radiation from black holes starts bouncing electrons clean off of those FinFETs, memory error correction might become a bigger problem...
  9. Around US$ 3955.89

    However the US$ is slipping closer and closer to AU$, so it might be worth US$ 4250.00 sooner or later.

    This concept is not exclusive to IBM Power5+ machines either, most server mainboards do support it.


    Also try "AMD ChipKill", "Intel ChipKill", etc - under products, chipsets, documentation for a given server class chipset - also reveals that using ECC for just 'single bit error correction' is old school - ECC can offer so much more on even 'older' server platforms.

    However there is heaps of decent reading in that .PDF besides the ECC / ChipKill(tm) tech.

    Regarding downtime related to memory faults, How often is it immediately isolated to the memory ? - It isn't just 'downtime' it is more the cost of undetected corruption of a single bit, in a record that eventually 100 users will access and use, which may cause 1 client to get screwed around so much, and if that 1 client has political or media sway it could cost a fair bit to a company.

    I've got scenarios where a client PC w/o ECC and a server with ECC and smart software, with CRC32 on all network traffic, etc can still cause a record on the server to go corrupt and the ECC will never even consider a hardware malfunction.

    eg: A PC workstation sends [BAD] data to server, via NIC, NIC gets data adds CRC32, server NIC rec data, confirms CRC32 is OK, passes data into servers ECC RAM, which the software eventually flushes to disk. At no point does the server 'realise' that it has processes and parsed bad data, then commited it to disk for other clients to read.

    However in the above scenario the fault was not of the servers lack of ECC or more adv tech, but of the clients lack of ECC and the servers inability (software) to truely know if the data is good or not.

    There are ways around it on the software running on the client, but they are not being used.....

    As PCs get more and more RAM, ECC on PC workstations will become a 'requirement' for safe opperation.

    That or client software is going to get much smarter, and need more CPU time to process the same workload (as far as user is concerned).

    Consider that in most workplaces the quantity of changes being processes on PC workstations with no ECC far outnumbers the servers with ECC, We are talking billions or trillions of transactions per year.

    If one workstation is slightly dodgy, in just one bit of RAM, over the course of [long period of time here], if all workstations access, read, write, change, send back changes to server, then over time, eventually all records in the database (or whatever) will become corrupt.

    It will take a long time, but eventually all of them will be.

    Unless of couse the integrity of the file (as a container) gets so bad the server side software can no longer read and parse the data within in.

    Say only 3% of the records in a database are corrupt - The database is thus next to useless. This will not require 100% corruption of 100% of records to have a business impact. Minor corruption of even 0.3% of records will cause a massive problem. (That grows over time).

    Think about it, and what sort of problems* large scale businesses have with their databases, middleware, etc.

    *(regardless of frequency, as not all are reported, and recorded as statistics)

    It is one reason why Microsoft (and most other large OS vendors) have a 'soft' recommendation that people use only ECC memory when working with 2+ GB of RAM.
  10. If a company is concerned about uptime, they will have a hot spare. When there is an ongoing problem on a box, they will shut it down and replace it with a spare, then troubleshoot it. Any company that troubleshoots a hardware problem on a running production box has issues in planning, management, financial or all of them. Now, some guy running his servers in his garage might not have this luxury, but he probably can't afford the $3.27 to get a chipkill setup either.

    Nice theory on data corruption, but that isn't exactly how it works. First, 'data' as you describe it is a tiny fraction of the contents of RAM, often something like .1% or less. Second, a 'dodgy bit' will probably fail in a pattern, maybe averaging something like every 10 million or 100 million reads. 10 million reads at modern RAM speeds could take place in as little as .04 seconds, but lets say realistically that the dodgy bit comes around once an hour, or every 100 billion random access reads hit that exact bit and that exact bit is wrong. So, your client application with the dodgy RAM will have an instability or exception every hour, and if you use your database application 8 hours a day for 5 days a week, every hour the application will throw an exception or Windows will show a fatal error, and on average every 6 months a single bit of 'data' will get flipped. This erroneous data doesn't get immediately transmitted to a database, but gets handled and checked explicitly and implicitly on multiple levels by application code, OS level functions, and even actual x86 instructions, neverminding the checks that the database performs, and unless it is a raw datatype like a byte, it stands an excellent chance of causing a data type exception on some level. Even using a very high rate of 50% of these errors not being caught, a user would have to put up with a computer that crashes every hour for a year before a single data item had a single corruption.

    Possible, yes, but generally errors in RAM lead to crashes and data instability, not databases being corrupt.

    Chipkill is an interesting idea, but to me it seems like a waste to have an additional spare chip on a memory card that already has an additional 2 bits per byte of ECC, for a total of 26 bits of overhead for 64 bits of storage. Why couldn't this 'smart algorithm' just send a warning saying DIMM 1 has a bad chip? Plus, if you do have a DIMM with a bad chip and chipkill engages, you no longer have a spare, and you can't replace just the spare chip, so a better system would probably be DIMMkill, where you have a hot spare DIMM and you can replace the ones that fail.
  11. Well clustering would make more sense to me than having hot spares taking up space in a data centre doing nothing.

    Bad / Faulty RAM does not usually cause a typical users PC to crash, give application faults, or appear any less stable than the next, it *can* cause all these in the worst cases, but in only slightly dodgy RAM it tends to corrupt data and rarely (if ever) cause a BSOD.

    If the above data is per client, times it by the percentage of clients with slightly flaky RAM, or anything in a worse state, (for a BSOD to occur the dodgy bit needs to be at certain addresses that the OS is actually using) then multiply that by 100,000 - 1,000,000+ client workstations (users) and average the frequency of 'minor problems' then on a per day/per week basis.

    As the OS will allocate blocks for either instructions or data, never both (anymore) the applications don't always throw expect ions, in fact most of the time they'll parse the bad data as is without checking it to much (overheads are considered not worth the gains), thus there is rarely an application exception either... they just parse the data "as is" with only the most basic of checks on it.

    The 6 months on average a single bit of data being flipped is for machines with 64 MB of SDRAM, for machines with 2 GB of RAM at the same quality expect it to occur 32 times more often. Consider also the OS does not check the integrity of data pushed to the OS disk cache before committing it to disk (or to a network drive, via the network clients 'virtual disk' cache, which may be as large as 512 MB if not larger). - This leads to one minor bit being inverted as frequently as (worst case scenario used) every 5 to 14 business (operating) days. Although less likely as the more powerful the machine the lower the load on it will be. Consider this is for PCs with RAM that actually passes memory tests. PCs worse states than "once every 6 months minor failures" will only push the stats the other way.

    Even if the applications do all the checks, when 'committed to disk' (via network clients write-back cache, which is dynamic in size, and may even be a 2nd buffer 'behind' the OS's core disk cache) the data can become corrupted. (This partially depends how the application is coded, what it is doing, etc).

    Then factor in the typical amount of page operations (even just to other RAM) that occur in a typical application. The virtual address is always the same within each running process, but the physical addresses will be changing quite often.

    Most the 'enterprise quality (cough) applications' I've seen don't check the data very thoroughly at all, they only care about throughput at both the client and server end. They are also coded as though the client workstation (user) PCs have near unlimited resources (add this fact to the last 3 major paragraphs).

    I'll agree it is most minor, however:
    - One 'normal' PC has a minor fault every 6 months.
    - There are 100,000 to 1,000,000 client PCs
    - Equates to (approx) between 547 and 5475 (for 1 million) 'minor' problems per day
    - The quantity of minor problems starts to outweigh the impact of a more critical fault that only affects say 50 - 250 users
    - In a given 6 month period that is around: 68,993 - 689,938 ; 'minor' problems. Many of which do lead to corruption of records in large databases, that are then used by other 'staff' which need to put up with 'bad data' which impacts customer service, and so on (this snowballs as once a minority of records are even slightly screwed it can affect a large number of 'users' and 'customers' (thus service and public image suffer).
    - Assuming fields in records are always loaded to the same physical addresses (not unlikely), and if only 3% of PCs are faulty, and each staff member processes 50+ records a day (if not far more), then the problem snowballs at at even more rapid rate, to the point where it is harder to deal with using 'isolated cases' treatment
    - By the time management acknowledge there is a problem 3%+ of all records are already corrupt, doesn't sound like much but the impact is huge. (and how to you 'un-corrupt' data ?, you can't really, so the best protection is prevention - anyone with a better idea let me know, so I can patent it - 8) ).

    I am yet to see applications on client PCs that truly check the data they are sending back to the server, via the system in my last post.

    If you can point me in the direction of any 'client workstation PC' applications that do, please let me know (as above, consider the page faults, virtual vs physical address issues, etc).

    Indicating the problem only affects each user maybe once every 6 months is wrong, as each user is loading records previously stored by other users, so the actual graph is an exponent, it is not linear - which is why it snowballs faster than looking at it as a 'per user - isolated case' perspective when it should be looked upon as a 'affecting several users, is not isolated, and bad data stored by one user (or PC) can, and will, affect other staff with 'perfect' PCs'. (You get my gist now).

    I mean both perspectives are valid, and I appreciate everyones input.

    Originally I figured, meh, isolated case, meh, isolated case, but after observing the issue at so many places, and not really having a viable solution to 'un-corrupt' data, thought I'd share my thoughts.

    I know there are ways to stop 'more' 'minor issues' from occurring, but it will not undo the damage already done (which is currently fairly manageable, but it gets worse slightly each passing day).

    8) - Tabris:DarkPeace

    PS: I've seen places that are already well beyond the '547 minor (hardware - and/or poor software checking - induced) problems per day per 100,000 staff' ratio, if you graph the data it forms the start of an exponential or logarithmic 'curve'. It doesn't need to get far before it becomes a **** storm in most cases.
  12. It's similar to the way they have more Cell's(SPU's) in the Cell CPU. All in the name of redundancy

    I do not consider it to be a bad idea at all....after all how much does it cost to add another 1/8th of the ram's capacity for redundancy.....Most times one chip goes bad and the rest could run for years...

    What do they do if the backup chip goes bad? move to the next one?
  13. What you're describing is shitty programming. But even poorly written apps will use the inherent datatyping of whatever language they are written in. So, if you're dealing with object types, your corrupt bit will likely happen in a pointer on the stack and that pointer will point to nowhere, memory exception, program dies. Or it will occur in a byte that is strongly typed, and has a good chance of making that type no longer processable. x86 instruction can't complete, fatal exception, application dies. And so forth.

    If you worked somewhere that had data corruption problems then you should most likely be blaming your programmers or vendors, not memory. 64 MB or 64 GB, if your system is crashing all the time you will know you have a problem long before you do a lot of damage to a database.

    From a warrantee standpoint, the chipkill saves IBM some money, coz they can use lower cost components and not have a ton of service calls.
  14. Would not it be fair to say 30% of memory corruption affects x86/x64 instructions while 70% affects the data it processes.

    In which case, assuming 'standard' application quality, it will parse bad data, not always, but neither will it detect it 100% of the time.

    I mean heck, if a memory error causes a program error it is piss easy to diagnose.

    It is the 70% of ones that don't that concern me. :wink:
  15. No it would not be fair. Open Word, or Adobe, or any application. A 30K Word or PDF gets allocated 30-100MB. That's about .1%-.03% utilization of memory for data. Ditto for most business applications.
  16. I am talking more enterprise applications, database servers and the like (and everything between the user and the absolute back of the back end).

    One eg: Firefox uses heaps of memory, as every GIF and JPEG (and .PDF by the Adobe Acrobat Plugin) it reads is given 3-4 bytes per pixel on screen as Firefox stores them in memory as Bitmaps, this is because they are faster to render when switching tabs this way.

    In some cases (eg: .PDFs in memory) even fonts may use 3-4 bytes of memory per pixel on screen (or rendered within the window rather). While the actual file is compressed, and stores text in a different format.

    This is why paging and virtual memory is a good idea. Every application in a Win32 env can only see virtual memory, that memory either has all pages in physical memory (quite often the case with 1 -2 GB RAM systems), or may have several pages in disk. A page is typically 4 KB.

    Consider: The C:\WINDOWS\FONTS\ folder is 16 MB with 213 standard (well 125 of them are standard, 88 are added in my case) fonts installed, each character (when on screen) is using a heck of a lot more than 1 or 2 bytes to represent, esp in the closer you get to the actual frame buffer. (Acrobat Reader .PDF is the perfect example btw).

    Without getting complex, with metafile formats and the like, each font is, on average, just over 74 KB, assuming an average of 256 (could be 384 because of Unicode char sets) characters per font that means each character uses an average of 296 bytes in meta-image format just to represent in a GUI state (give or take).

    That is one of many examples, but that 32 KB Word 97/2000 document file (on disk) uses far more memory than you're considering. (Word for DOS - well, yeah you'd be about right for it, as it pre-dates GUI standard fonts). - On a side note Mac OS X actually caches all font data in video memory instead of system memory (tricky to do, but worth it), which is why its GUI feels very 'fluid and responsive' compared to Windows (although each has its strong points). .

    File sizes often have very little to do with memory consumption, unless you're storing compressed data in memory and uncompressing it on the fly during reads, and compressing on the fly with writes (which would slow memory I/O to a crawl).

    There was once a product that compressed pages before committing to memory, and uncompressed when reading them back. The point was to use memory more often, and reducing paging, and when/if paging was required it would require less disk I/O as it was compressed.

    - This product failed, although it would be more useful today than it was during the days when the Intel 80 486 SX processor was a more common household CPU - Which was about the time it was released.

    When an application loads, the memory it uses is mostly data, not code. The code then uses the data to its advantage. eg: I assume you are talking about memory usage in Windows envs, in which case the bulk of the memory each process is using is for the GUI. The GUI is classed as a data object, not code.

    To imply that the 'soffice.bin + soffice.exe' processes in OpenOffice have over 128 MB of actual x86 instructions running in memory (combined) is totally ludicrous to say the least. (Sorry it really is). The bulk of that 128 MB+ is immediately paged out, as when not required by the OS / User for some time is paged to disk to make free physical memory for the OS disk cache to be larger, as this increases the cache hit rate of disk reads, and permits delayed writes to the disk sub-system.

    To have 96 MB+ of 'code' in memory, once compiled, would require over 1 GB of source code (at least, try 4 GB+ of source code before compiled to be realistic). The fact of the matter is most of that memory is 'data' not 'code', and this is before the user even does anything using 'soffice.bin' [The OpenOffice SysTray icon].

    If computers where 99% code, and 1% data (when in memory) why would most large vendors (eg: Fujitsu SPARC systems) decrease the size of L1 instruction cache while increasing the size of data caches ? (Because in your example that would hinder performance, not improve it).

    Fact of the matter is, most memory that is allocated is 30% instructions, and 70% data (any any GUI env, even in some DOS 'Graphical Console Apps, like DOS-SHELL in a more advanced GUI/Text mode).

    30% of the memory allocated will be code, and 10-20% of that code will perform 80-90% of the work. Which is why instruction caches got introduced (external to the CPU) back in the 80 386 days.

    Yes instruction only caches, they did not cache data because that did not help performance as much, and there was far less code to cache than data (caches were 8 - 64 KB at this stage).

    The 80 486 introduced 'on die' L1 caches in every chip of the series. They did not have data caches, only Burst-Reads.

    In the Pentium (tm) 60 and 66 MHz, both L1 DATA and L1 INSTRUCTION caches got introduced. The CPU was mis-reported as like having 2 x 486's (which would mean 2 x Instruction caches), when in fact it was like 1 x 486, with a more advanced FPU, with a mostly dedicated path for DATA in addition to CODE). This is what enabled applications to use far more DATA than they have previously used, skewing the ratio even more. (30% / 70% is more like 15% / 85% these days, but this was where it turned).

    By the time of the Pentium MMX (double L1 caches for both again), any system that lacked a L1 data cache was not scaling in performance, as data was really getting common, so common in fact that MMX enabled processing 4 blocks of data, not all data, but INTs in a small array, to permit even better scaling in future software that ran on it).

    Other CPUs concentrated on Out of Order (OoO) execution, which only affected instructions.

    (I am sure the point has got across by now).

    The only process that doesn't page 'data' is the 'System Idle Process', it uses under 32 KB but only because it can not make maximum utilization of several 4 KB pages, it really only has a few KB of instructions, and only a few bytes of which (The CPU.HALT instruction) are run in an endless loop. [It has no GUI at all, so it doesn't have any 'data' to deal with, only code].

    If it had code and data (a GUI, plus logging, etc) then the ratio of code/data would be 3% to 97%, just like many other applications.

    I suspect that you may have memory consumption confused with code/data ratios of that memory consumption.

    8) - Still 10 out of 10 for style, and having the guts to post in this thread...
    :lol: - But minus several million for (good ?) thinking.

    [That is a famous quite btw, don't take it literally, if you know where from feel free to chill here :mrgreen: ]
    [and yes, I did go overboard in this reply, but if too short someone might scroll over it too fast in future, and make exactly the same mistake]
  17. That software was:

    Some may know it as "SoftRAM"

    It was a novel idea, and the faster processors got (with the more cache), and the larger the delta between disk sub-system performance and RAM got, the more useful this concept becomes.

    However once 64 MB+ systems become commonplace, as RAM prices dropped heaps during this time - most applications of the day only need 8 - 24 MB to run - so the products sales dropped like a rock.

    It might have some practical purposes today, but with 4 GB as cheap as it is, vs the performance impact this concept has on a system (although.... quad-cores.... it might just be viable again).

    No-one would risk it, as it already failed once on the market.

    However large RamDrives compressed with Stacker 4.0 were popular in my day, as you could run Doom from RAM (the full non-shareware version) on high end PCs of the time. With space left over for a 'disk' cache (which improved performance still more as RAM Drive I/O wasn't efficient, but still faster than disk with excellent-o seek times).

    [Insane, yes, but people did it for performance and frags]
  18. Anyone remember: FILE-ID.DIZ ; ?

    I was reading some stuff, and it bought a ton of memories flooding back.

    People who [still] remember FILE-ID.DIZ are far and few between.

    God it has been so long.... so damn long.

    Must continue cru---sade---.
  19. I'm interested, what is/was this FILE-ID.DIZ?

    That quote from your earlier post, It's from HitchHikers Guide to the Galaxy, right? I know they used that line in the movie, but I cant remember what book it's from.
  20. It used to be in almost every ZIP, ARJ, LZH (LHA), RAR, UC2 and other compressed shareware on BBS's

    FILE-ID.DIZ - Defined the contents of the archive for extraction to a database that people could search, browse over to find shareware of interest to them.

    And yeah, the 'style' quote was from book one I suspect.

    In a momment of madness I actually reconsidered using QEMM, DESQView, MagnaRAM, IBM's PC-DOS v7.0 with Stacker 4.0, and text-GUI software again. (Which I used to use when very young, on a NetWare 3.x / 4.x platform to help assist running a BBS).

    However to make up for it I've just downloaded:- WMware Server [All the eval versions, which are time unlimited and free]
    - Novell Open Enterprise Server SP2 EVAL: Which includes:
    - Open Enterprise Server SP2 (CD's 1 - 5 inclusive)
    - SUSE Core Version 9 (CD's 1 -5 inclusive)
    - Novell Client 1.1 for Novell Linux Desktop 9
    - Open Enterprise Server NetWare OS
    - Open Enterprise Server NetWare Products
    - Novell Identity Manager 3.0.1 Bundled Edition (part of above, sort of, but sort of not aswell).
    - SUSE Linux Enterprise Server 10 VMWare Workstation Virtual Machine (image),
    - VMware Server - - all the free + unlimited time trial software for both Win32 and Linux platforms.

    I want to try and get a Web-Server up and running, while learning a little more about some of the new stuff.

    Will likely end up using Fedora (Core 4,5 or 6).RedHat or (Open) SUSE Linux 10.x to host it on ultimately though.

    I am looking for other people to learn it with (if that would even work as a concept).
Ask a new question

Read More