Memory - Part II - "What memory does"

bilbat

Splendid
Memory - Part II - "What memory does"

other content:
Part I - "What memory is"
Part III - "Evaluation and selection"
Part IV - "Tweaking and tuning"

rotatingdisks.gif


Types of memory in your computer, and how they are used:

Your computer (any computer...) really does, at the most basic level, two things: (besides the obvious - annoy the bejeezus out of us, turn electricity into copious amounts of heat [GTX 480[:huntluck:9], anyone?], and thoroughly prove the existence of "Murphy's Laws"!) it stores information, encoded in various ways, into 'bit patterns'; and it manipulates those stored patterns - and that's it!! If we think, instead of in terms of 'memory', of all the 'encoded stuff' as being simply 'storage', we can see that it changes our viewpoint: we've always had 'encoded storage'! First, we invented language, and more importantly, alphabets, to 'encode' our words. Then, we figured out that if we 'scratched' the characters into a tray of mud, and let the mud dry, we had a 'document'! (I think there were originally fifteen commandments - Moses must have dropped a tablet on the way back down the mountain - I'm pretty sure, for instance, that eleven was: "get off that damned cell phone and drive!") The Egyptians invented papyrus and inks, Hollerith invented ASCII coding and 'punch-cards', and the race was on!

The 'manipulation' part must be done by your CPU, with information it has, one way or another, 'imported' into itself. Thus, we have a 'ladder of access', or hierarchy of storage:

memorymapsmaller.jpg


The whole point of the memory hierarchy is to allow reasonably fast access to a large amount of memory/storage. If only a little memory was necessary, we'd use fast static RAM (i.e., the stuff they make cache memory out of) for everything. If speed wasn't necessary, we'd just use lower-cost dynamic RAM for everything. The whole idea of the memory hierarchy is that we can take advantage of the principle of 'locality of reference' to move often-referenced data into fast memory and leave less-used data in slower memory. Unfortunately, the selection of often-used versus lesser-used data varies over the execution of any given program, usually moment by moment. Therefore, we cannot simply place our data at various levels in the memory hierarchy and leave the data alone throughout the execution of the program. Instead, the memory subsystems need to be able to move data between themselves dynamically to adjust for changes in 'locality of reference' during the program's execution. There is a lot going on in your processor that you are totally unaware of - here's a 'map' of a single execution unit ('core'):
2000px-Intel_Nehalem_arch.svg.png

Individual processor instructions operate on information in 'registers'; register memory is built into each core, and can be accessed each 'execution' (CPU clock 'tick), but is very limited in size, and expen$ive, in both cost, as well as in terms of 'die real-estate'! To get around this cost, and still make somewhat larger amount of fast storage available, CPUs have 'cache memory' (static RAM, from Part I) in several speeds and sizes. Modern processor designs generally (there's that word again!) have three 'levels' of cache, named accordingly, L1 ("level one"), L2, and L3... These have their own 'hierarchy'; L1 is the smallest, and fastest; and both L1 and L2 are built into each core. L3 is larger, slower, and is 'shared' by all the cores on the die... Cache speeds and sized vary on a 'per architecture' basis; here is a table of relative speeds/latencies for a few server CPUs, given in 'core ticks':

cachecounts.jpg


Programs are largely 'unaware' of the memory/cache hierarchy. In fact, the program only explicitly controls access to main memory and those components of the memory hierarchy at the file storage level and below (since manipulating files is a program-specific operation). In particular, cache access and virtual memory operation are generally transparent to the program. That is, access to these levels of the memory hierarchy usually take place without any intervention on the program's part. The program just accesses main memory and the hardware (and operating system) takes care of the rest. Most cache memory is not organized as a group of bytes; instead, cache organization is usually in blocks of cache lines with each line containing some number of bytes (typically a small number that is a power of two like 16, 32, or 64:
memoryarchitecturecache.jpg

If the program really accessed main memory on each access, programs would run quite slowly, since modern DRAM main memory subsystems are much slower than the CPU. The job of the cache memory subsystems (and the cache controller) is to move data between main memory and the cache so that the CPU can quickly access data in the cache. Likewise, if data is not available in main memory, but is available in slower virtual memory, the virtual memory subsystem is responsible for moving the data from hard disk to main memory (and then the caching subsystem may move the data from main memory to cache for even faster access by the CPU).

With few exceptions, most transparent memory subsystem accesses always take place between one level of the memory hierarchy and the level immediately below or above it. For example, the CPU rarely accesses main memory directly. Instead, when the CPU requests data from memory, the L1 cache subsystem takes over. If the requested data is in the cache, then the L1 cache subsystem returns the data and that's the end of the memory access. On the other hand, if the data is not present in the L1 cache, then it passes the request on down to the L2 cache subsystem. If the L2 cache subsystem has the data, it returns this data to the L1 cache, which then returns the data to the CPU. Note that requests for this same data in the near future will come from the L1 cache rather than the L2 cache, since the L1 cache now has a copy of the data.

If neither the L1 nor L2 cache subsystems have a copy of the data, then the L3 cache is queried; if none of them gets a 'hit', then the memory subsystem goes to main memory to get the data. If found in main memory, then the memory subsystems copy this data to the L3 cache, which passes it to the L2 Cache, which passes it to the L1 Cache ,which gives it to the CPU. Once again, the data is now in the L1 cache, so any references to this data in the near future will come from it... This fairly complex-looking process is automated by various mechanisms: 'branch prediction' ("what instructions/data am I likely to need next?"); TLBs ('translation lookaside buffers' - "where can I find this information quickly?"); and 'flushing'.


Flushing: [:jaydeejohn:1]

The problem we've overlooked in this discussion on caches is "what happens when the CPU writes data to memory?" The simple answer is trivial, the CPU writes the data to the cache. However, what happens when the cache line containing this data is replaced by incoming data? If the contents of the cache line were not written back to main memory, then the data that was written will be lost. The next time the CPU reads that data, it will fetch the original data values from main memory and the value written is lost!

Clearly, any data written to the cache must, ultimately, be written to main memory as well. There are two common write policies that caches use: write-back and write-through. Interestingly enough, it is sometimes possible to set the write policy under software control; these aren't hardwired into the cache controller like most of the rest of the cache design. However, don't get your hopes up. Generally the CPU only allows the BIOS or operating system to set the cache write policy, your applications don't get to mess with this. However, if you're the one writing the operating system... [:bilbat:8]

The write-through policy states that any time data is written to the cache, the cache immediately turns around and writes a copy of that cache line to main memory. Note that the CPU does not have to halt while the cache controller writes the data to memory. So unless the CPU needs to access main memory shortly after the write occurs, this writing takes place in parallel with the execution of the program. Still, writing a cache line to memory takes some time and it is likely that the CPU (or some CPU in a multiprocessor system) will want to access main memory during this time, so the write-through policy may not be a high performance solution to the problem. Worse, suppose the CPU reads and writes the value in a memory location several times in succession. With a write-through policy in place the CPU will saturate the bus with cache line writes and this will have a very negative impact on the program's performance. On the positive side, the write-through policy does update main memory with the new value as rapidly as possible. So if two different CPUs are communicating through the use of shared memory, the write-through policy is probably better because the second CPU will see the change to memory as rapidly as possible when using this policy.

The second common cache write policy is the write-back policy. In this mode, writes to the cache are not immediately written to main memory; instead, the cache controller updates memory at a later time. This scheme tends to be higher performance because several writes to the same variable (or cache line) only update the cache line, they do not generate multiple writes to main memory.

Of course, at some point, the cache controller must write the data in cache to memory. To determine which cache lines must be written back to main memory, the cache controller usually maintains a 'dirty' bit with each cache line. The cache system sets this bit whenever it writes data to the cache. At some later time the cache controller checks this 'dirty' bit to determine if it must write the cache line to memory. Of course, whenever the cache controller replaces a cache line with other data from memory, it must first write that cache line to memory if the 'dirty' bit is set. Note that this increases the latency time when replacing a cache line. If the cache controller were able to write dirty cache lines to main memory while no other bus access was occurring, the system could reduce this latency during cache line replacement.


How CPU cache, ISR's, and the 'thread manager' combine to make memory speed (mostly) irrelevant:

Windoze looks like it is multi-tasking; in fact, its wide acceptance brought the term 'multi-tasking' into common useage (common abuse?); however, the 'magic behind the scenes' is much akin to the illusion that movies or TV use to convince us that thirty 'stills' shown per second are somehow 'moving'... If you're not very familiar with windoze 'innards', and want to be surprised, open: Control Panel > Performance Information and Tools > Resource Monitor > 'CPU' tab > 'Processes' window, especially the 'Thread' counts, and the 'Services' window; you'll find your processor has the ultimate 'schizophrenia problem':[:graywolf:7] "The One-Thousand-Two-Hundred-and-Fifty-Six Faces of EVE"!!

Windoze is a serial, time-slicing, pre-emptive multitasking operating system; its scheduler allows every task to run for some certain amount of time, called its time slice. If a process does not voluntarily yield the CPU (for example, by performing an I/O operation), a timer interrupt fires, and the operating system schedules another process (pre-empting the current one)) for execution instead. This ensures that the CPU cannot be monopolized by any one processor-intensive application.

Modern architectures are interrupt driven. This means that if the CPU requests data from a disk, for example, it does not need to 'busy-wait' until the read is over, it can issue the request and continue with some other execution; when the read is over, the CPU can be interrupted and presented with the read. Devices are operated, inside an operating system/BIOS by something called an 'interrupt mechanism'. Interrupts, to the system, are kind of like a little kid tugging on ma's skirt - "Ma, can I have a candy bar?"; "Ma, can I have a coloring book?"; "Ma, look at the funny man!" :??: Every time you hit a key on your keyboard, the hook-up generates an interrupt, telling the system "Hey, this is important - I've got a keystroke to process!" The BIOS/operating system have a set-up to then do what the interrupt requires - handle a keystroke, read a track from disk, whatever... Obviously, the interrupt mechanism has to 'stack up' the interrupts; it can't simply do 'this' while negecting 'that', say, 'ignoring' your keystrokes - they all have to be taken care of, mostly by the DPC (deferred procedure call) stack, which you can read about here, at TheSycon

Changing tasks, or servicing an interrupt, requires the CPU to do something called "context switching", context meaning, moreover, the 'environment' of that process - the number and list of its open file handles, its current progress in execution, the data about its 'objects', etc. In a switch, the state of the first process must be saved somehow, so that, when the scheduler gets back to the execution of the first process, it can restore this state and continue. The state of the process includes all the registers that the process may be using, especially the program counter, plus any other operating system specific data that may be necessary. This data is usually stored in a data structure called a process control block (PCB), or switchframe.

Now, in order to switch processes, the PCB for the first process must be created and saved. Since the operating system has effectively suspended the execution of the first process, it can now load the PCB and context of the second process. In doing so, the program counter from the PCB is loaded, and thus execution can continue in the new process. New processes are chosen from a queue or queues. Process and thread priority can influence which process continues execution, with processes of the highest priority checked first for ready threads to execute.

The modern CPU has built-in mechanisms to both simplify, and speed up this context switching - which is happening many, many times each second; indeed, if something 'locks up', and a thread gets 'stuck', or a device 'floods' the system with interrupts, you notice it immediately! In fact, the most common cause of audio or video 'stuttering', 'drop-outs', and pops, is an errant interrupt service routine - usually a 'naughty'[:lorbat:8] driver!

Another place where you'd think there'd be a tendency toward 'grabbing' pieces of memory from all over the place is actual multi-threading - getting a couple of your cores working on 'different' tasks; so far, this hasn't proved to be so... There are not a lot of places, so far, where this is actually done! The problem is - most problems look like this: data "A" -> computational process "X" -> interim result "B" -> computational process "Y" -> interim result "C" -> computational process "Z" -> answer "D"!! So, you can't get a 'core each' working, in parallel, on 'computational processes' Y and Z, because those 'interim results' are just not available to them! Mostly turns out that the 'overhead' involved in process synchronization and inter-thread communication obviates the whole effort in the first place...

Some tasks do lend themselves to this, and you can easily see why: you'll find a lot of video transcoders that have a setting for 'how many threads to launch'. It's pretty easy for the main 'manager' thread to say "OK, we've got a 150,000 frame video here - you, take frames 1 through 50,000; you, in the middle, you've got 50,001 to 100,00; and you, over there on the end, you've got the rest; everybody signal me when you're done, and I'll 'stitch 'em back together'!! An OS is a 'busy place'; I think you will see more 'parallelization' in the future for operating system functions - the tools and techniques (especially the debuggers, 'tracers', and 'sub-optimizers' - Intel, themselves, have a staggeringly good Parallel Studio product...) are evolving rapidly[:lectrocrew:2]!

By now, I'm sure you are wondering: "so, what does this all have to do with memory?" Short answer: EVERYTHING! You have seen the hierarchy, the 'ladder of acess', and that the CPU only wants to 'deal with' data at the very top of that ladder; you have seen that the task management and interrupt service funtions of the OS 'force' the CPU to 'dump and re-load' tasks' environment rapidly and repeatedly; you have seen (in Part I) that memory is arranged in multi-dimensional arrays of rows, columns, ranks - and that switching access from different 'dimensions' of that array carries with it the 'cost' of physical, inescapable delays - the latencies. This is why your system 'doesn't care much' how fast (frequency-wise) your memory is, but cares intensely how long your latencies are!!

There is a place where high speed, versus low latency, will be an advantage - any operations that require large, sustained, reads from and writes to RAM - like video transcoding... I always consider my 'pass/fail' system stress test to be: watch/pause one HDTV stream off a networked ATSC tuner, while recording a second stream off a PCIe NTSC tuner, while transcoding and 'de-commercialing' a third stream to an NAS media server... But, for the vast majority of people, for the vast majority of use, this is not the case. What is going on behind the scenes: the task scheduler is scurrying around, busier than a centipede learning to tap-dance, counting 'ticks':[:digitalprospecter] ...tick... yo - over there, you gotta finish up, your tick is over, push your environment, that's a good fella; oops - cache snoop says we've got an incoherency - grab me a page for him from over there; you - get me the address of the block being used by {F92BFB9B-59E9-4B65-8AA3-D004C26BA193}, will 'ya; yeah - UAC says he has permission - I dunno - we'll just have to trust him; dammit - everybody listen up, we've got a pending interrupt request, everyone drop what you're doing, and you - over there - query interrupt handler for a vector - this is important!!! ...tick.... And the most fascinating (scary) thing about it all, is that, at some synaptic, neural level, we're doin' the same thing! (...though, the older I get, the less dependable my interrupt return mechanism is - I repeatedly find myself at the bottom of the basement steps, wondering "now what did I come down here for?!")


DMA - direct memory access

There's another place where high speed memory is advantageous - and it's kind of a 'wierd one'; I didn't really know where to go with it... Using interrupt-driven device drivers to transfer data to or from hardware devices works well when the amount of data is reasonably low. For high speed devices, such as hard disk controllers or ethernet devices the data transfer rate is a lot higher. Direct memory access, or DMA, was invented to solve this problem. A DMA controller allows devices to transfer data to or from the system's memory without the intervention of the processor. To initiate a data transfer the device driver sets up the DMA channel's address and count registers together with the direction of the data transfer, read or write. It then tells the device that it may start the DMA when it wishes. When the transfer is complete the device interrupts the PC. Whilst the transfer is taking place the CPU is free to do other things. The DMA process occurs transparently from the processor's point of view...

DMA has several advantages over polling and interrupts. DMA is fast because a dedicated piece of hardware transfers data from one computer location to another and only one or two bus read/write cycles are required per piece of data transferred. In addition, DMA is usually required to achieve maximum data transfer speed, and
thus is useful for high speed data acquisition devices. DMA also minimizes latency in servicing a data acquisition device because the dedicated hardware responds more quickly than interrupts, and transfer time is short. Minimizing latency reduces the amount of temporary storage (memory) required on an I/O device.

DMA also off-loads the processor, which means the processor does not have to execute any instructions to transfer data. Therefore, the processor is not used for handling the data transfer activity and is available for other processing activity. And, in systems where the processor primarily operates out of its cache, data transfer is actually occurring in parallel, thus increasing overall system utilization. DMA 'happens' at the speed the memory is capable of, and the mechanism itself is optimized to 'strobe' successive, contiguous memory locations, making the transfers at the maximum 'burst speed' available from the hardware...


Why synthetic bechmarks are called 'synthetic'!

Reviewers, too often, IMHO, rely on 'synthetic' benchmark results in testing some products, and neglect the 'real-world' implications for actual performance... Memory, unfortunately, is often 'characterized' this way! A 'synthetic' benchmark does not actually perform a function outside of producing 'raw data' for comparison purposes. Whether or not these 'raw data' have any 'real-world' performance applicability is left up to the reader, who, all too often, has not the 'grounding' to understand them. Too many reviewers, and pretty much all of the computer press, have become 'cheerleaders' for advertisers - irregardless of the consequences! So - when you read a review that says, in effect: "Blah-blah hardware can do this at 'xxx' rate", the very first question you must ask yourself is: "Does my computer ever do this?", and, "If, indeed, my computer never does this, why do I care about 'xxx'?", as well as, "Just what will 'xxx' cost me?"

Often computers, to produce 'synthetic benchmark' results, have been 'tweaked': a fresh OS install has been done, all background programs and services have been 'stripped out', and as many hardware subsystems as possible (LAN chiips, disk controllers, etc.) have been disabled... This will work to give you nice 'big numbers', but what is it good for? In many respects, I see this situation as similar to the crew 'running up' processors under liquid nitrogen pots. Don't get me wrong - nothing bad about this - everyone needs a hobby, and the more expensive the better[:bilbat:7] - but, when we get back to the 'real-world', what is it actually good for? Pretty much, for comparing relatively useless high numbers with somebody else's relatively useless high numbers [:fixitbil:2] - and that's about it!! Try to avoid getting 'caught up' in this syndrome; if you actually intend to use the system, try to concentrate on actually useable improvements :sol:


Summary:

■mass storage/memory has a hierarchy, or 'ladder of access'...
■the 'mechanics' of each 'rung' of the ladder primarily concern movement to/from the 'rung' immediately above or below...
■rapidity of task switching and interrupt service require 'context swapping' by the CPU...
■'context swapping' requires access from non-contiguous areas of memory, incurring latency costs...
■few operations allow access to large, contiguous areas of memory, minimizing frequency's effect on tranfer speed...
■many 'benchmarks' do not reflect 'real-world' operation of the involved sub-systems...

> Next: Part III - "Evaluation and selection"