Memory - Part II - "What memory does"
other content:
Part I -
"What memory is"
Part III -
"Evaluation and selection"
Part IV -
"Tweaking and tuning"
Types of memory in your computer, and how they are used:
Your computer (
any computer...) really does, at the most basic level, two things: (besides the obvious - annoy the bejeezus out of us, turn electricity into copious amounts of heat [GTX 480
![[:huntluck:9] [:huntluck:9]]()
, anyone?], and
thoroughly prove the existence of "Murphy's Laws"!) it
stores information,
encoded in various ways, into 'bit patterns'; and it
manipulates those stored patterns - and
that's it!! If we think, instead of in terms of 'memory', of all the 'encoded stuff' as being simply 'storage', we can see that it changes our viewpoint: we've
always had 'encoded storage'! First, we invented language, and more importantly, alphabets, to 'encode' our words. Then, we figured out that if we 'scratched' the characters into a tray of mud, and let the mud
dry, we had a 'document'! (I
think there were originally
fifteen commandments - Moses must have dropped a tablet on the way back down the mountain - I'm pretty sure, for instance, that eleven was: "get off that damned cell phone and
drive!") The Egyptians invented papyrus and inks, Hollerith invented ASCII coding and 'punch-cards', and the race was on!
The 'manipulation' part must be done by your CPU, with information it has, one way or another, 'imported' into itself. Thus, we have a 'ladder of access', or hierarchy of storage:
The whole
point of the memory hierarchy is to allow
reasonably fast access to a
large amount of memory/storage. If only a
little memory was necessary, we'd use
fast static RAM (i.e., the stuff they make cache memory out of) for everything. If speed
wasn't necessary, we'd just use lower-cost dynamic RAM for
everything. The whole
idea of the memory hierarchy is that we can take advantage of the principle of 'locality of reference' to move often-referenced data into fast memory and leave less-used data in slower memory. Unfortunately, the selection of often-used versus lesser-used data
varies over the execution of any given program, usually moment by moment. Therefore, we cannot simply place our data at various levels in the memory hierarchy and leave the data alone
throughout the execution of the program. Instead, the memory subsystems need to be able to
move data between themselves dynamically to adjust for changes in 'locality of reference'
during the program's execution. There is a
lot going on in your processor that you are
totally unaware of - here's a 'map' of a single execution unit ('core'):
![]()
Individual processor instructions operate on information in 'registers'; register memory is built into each core, and can be accessed
each 'execution' (CPU clock 'tick), but is
very limited in size, and expen
$ive, in both cost, as well as in terms of 'die real-estate'! To get around this cost, and still make somewhat larger amount of fast storage available, CPUs have 'cache memory' (static RAM, from
Part I) in several speeds and sizes. Modern processor designs generally (there's that word again!) have three 'levels' of cache, named accordingly, L1 ("level one"), L2, and L3... These have their
own 'hierarchy'; L1 is the smallest, and fastest; and both L1 and L2 are built into
each core. L3 is larger, slower, and is 'shared' by
all the cores on the die... Cache speeds and sized vary on a 'per architecture' basis; here is a table of relative speeds/latencies for a few server CPUs, given in 'core ticks':
Programs are largely 'unaware' of the memory/cache hierarchy. In fact, the program only explicitly controls access to main memory and those components of the memory hierarchy at the file storage level and below (since manipulating files is a program-specific operation). In particular, cache access and virtual memory operation are generally transparent to the program. That is, access to these levels of the memory hierarchy usually take place without any intervention on the program's part. The program just accesses main memory and the hardware (and operating system) takes care of the rest. Most cache memory is not organized as a group of bytes; instead, cache organization is usually in blocks of cache lines with each line containing some number of bytes (typically a small number that is a power of two like 16, 32, or 64:
![]()
If the program
really accessed main memory on
each access, programs would run quite
slowly, since modern DRAM main memory subsystems are
much slower than the CPU. The job of the cache memory subsystems (and the cache controller) is to move data between main memory and the cache so that the CPU can
quickly access data in the cache. Likewise, if data is not available in main memory, but is available in slower virtual memory, the virtual memory subsystem is responsible for moving the data from hard disk to main memory (and then the caching subsystem may move the data from main memory to cache for even faster access by the CPU).
With few exceptions, most transparent memory subsystem accesses always take place between one level of the memory hierarchy and the level
immediately below or above it. For example, the CPU rarely accesses main memory directly. Instead, when the CPU requests data from memory, the L1 cache subsystem takes over. If the requested data is
in the cache, then the L1 cache subsystem returns the data and that's the end of the memory access. On the other hand, if the data is
not present in the L1 cache, then it passes the request on down to the L2 cache subsystem. If the L2 cache subsystem
has the data, it returns this data to the L1 cache, which then returns the data to the CPU. Note that requests for this same data in the near future will come from the L1 cache rather than the L2 cache, since the L1 cache now
has a copy of the data.
If neither the L1 nor L2 cache subsystems have a copy of the data, then the L3 cache is queried; if none of them gets a 'hit', then the memory subsystem goes to main memory to get the data. If found in main memory, then the memory subsystems copy this data to the L3 cache, which passes it to the L2 Cache, which passes it to the L1 Cache ,which gives it to the CPU. Once again, the data is now in the L1 cache, so any references to this data in the near future will come from it... This fairly complex-looking process is automated by various mechanisms: 'branch prediction' ("
what instructions/data am I
likely to need
next?"); TLBs ('
translation
lookaside
buffers' - "where can I
find this information
quickly?"); and 'flushing'.
Flushing:
The problem we've overlooked in this discussion on caches is "what happens when the CPU writes data to memory?" The simple answer is trivial, the CPU writes the data to the cache.
However, what happens when the cache line containing this data is
replaced by incoming data? If the contents of the cache line were not written back to
main memory, then the data that was written will be lost. The next time the CPU
reads that data, it will fetch the original data values from main memory and the value
written is lost!
Clearly,
any data
written to the cache
must, ultimately, be written to main memory as well. There are two common write policies that caches use: write-back and write-through. Interestingly enough, it is sometimes possible to set the write policy under software control; these aren't hardwired into the cache controller like most of the rest of the cache design. However, don't get your hopes up. Generally the CPU only allows the BIOS or operating system to set the cache write policy, your applications don't get to mess with this. However, if you're the one writing the operating system...
The write-through policy states that any time data is written to the cache, the cache immediately turns around and writes a copy of that cache line to main memory. Note that the CPU does not have to halt while the cache controller writes the data to memory. So unless the CPU needs to access main memory shortly after the write occurs, this writing takes place in parallel with the execution of the program. Still, writing a cache line to memory takes some time and it is likely that the CPU (or some CPU in a multiprocessor system) will want to access main memory during this time, so the write-through policy may not be a high performance solution to the problem. Worse, suppose the CPU reads and writes the value in a memory location several times in succession. With a write-through policy in place the CPU will saturate the bus with cache line writes and this will have a very negative impact on the program's performance. On the positive side, the write-through policy does update main memory with the new value as rapidly as possible. So if two different CPUs are communicating through the use of shared memory, the write-through policy is probably better because the second CPU will see the change to memory as rapidly as possible when using this policy.
The second common cache write policy is the write-back policy. In this mode, writes to the cache are not immediately written to main memory; instead, the cache controller updates memory at a later time. This scheme tends to be higher performance because several writes to the same variable (or cache line) only update the cache line, they do not generate multiple writes to main memory.
Of course,
at some point, the cache controller
must write the data in cache to memory. To determine which cache lines must be written back to main memory, the cache controller usually maintains a 'dirty' bit with each cache line. The cache system sets this bit whenever it writes data to the cache. At some later time the cache controller checks this 'dirty' bit to determine if it must write the cache line to memory. Of course, whenever the cache controller replaces a cache line with other data from memory, it must first write that cache line to memory if the 'dirty' bit
is set. Note that this increases the latency time when replacing a cache line. If the cache controller were able to write dirty cache lines to main memory while no other bus access was occurring, the system could reduce this latency during cache line replacement.
How CPU cache, ISR's, and the 'thread manager' combine to make memory speed (mostly) irrelevant:
Win
doze looks like it is multi-tasking; in fact, its wide acceptance brought the term 'multi-tasking' into common useage (common abuse?);
however, the 'magic
behind the scenes' is much akin to the
illusion that movies or TV use to convince us that thirty 'stills' shown per second are somehow 'moving'... If you're not very familiar with win
doze 'innards', and want to be surprised, open: Control Panel > Performance Information and Tools > Resource Monitor > 'CPU' tab > 'Processes' window, especially the 'Thread' counts, and the 'Services' window; you'll find your processor has the
ultimate 'schizophrenia problem':
![[:graywolf:7] [:graywolf:7]]()
"
The One-Thousand-Two-Hundred-and-Fifty-Six Faces of EVE"!!
Windoze is a
serial,
time-slicing,
pre-emptive multitasking operating system; its scheduler allows every task to run for some certain amount of time, called its time slice. If a process does
not voluntarily yield the CPU (for example, by performing an I/O operation), a timer interrupt fires, and the operating system schedules another process (
pre-empting the current one)) for execution instead. This ensures that the CPU cannot be monopolized by any one processor-intensive application.
Modern architectures are interrupt driven. This means that if the CPU requests data from a disk, for example, it does not need to 'busy-wait' until the read is over, it can issue the request and continue with some other execution; when the read is over, the CPU can be interrupted and presented with the read. Devices are operated, inside an operating system/BIOS by something called an 'interrupt
mechanism'. Interrupts, to the system, are kind of like a little kid tugging on ma's skirt - "Ma, can I have a candy bar?"; "Ma, can I have a coloring book?"; "Ma, look at the funny man!"
Every time you hit a key on your keyboard, the hook-up generates an interrupt, telling the system "Hey, this is important - I've got a keystroke to process!" The BIOS/operating system have a set-up to
then do
what the interrupt requires - handle a keystroke, read a track from disk, whatever... Obviously, the interrupt mechanism has to 'stack up' the interrupts; it can't simply do 'this' while negecting 'that', say, 'ignoring' your keystrokes - they
all have to be taken care of,
mostly by the DPC (deferred procedure call) stack, which you can read about here, at
TheSycon
Changing tasks, or servicing an interrupt, requires the CPU to do something called "context switching",
context meaning, moreover, the 'environment' of that process - the number and list of its open file handles, its current progress in execution, the data about its 'objects', etc. In a switch, the state of the first process must be saved somehow, so that, when the scheduler gets back to the execution of the first process, it can restore this state and continue. The state of the process includes all the registers that the process may be using, especially the program counter, plus any other operating system specific data that may be necessary. This data is usually stored in a data structure called a process control block (PCB), or switchframe.
Now, in order to switch processes, the PCB for the first process must be created and saved. Since the operating system has effectively suspended the execution of the first process, it can now load the PCB and context of the second process. In doing so, the program counter from the PCB is loaded, and thus execution can continue in the new process. New processes are chosen from a queue or queues. Process and thread priority can influence which process continues execution, with processes of the highest priority checked first for ready threads to execute.
The modern CPU has built-in mechanisms to both simplify, and speed up this context switching - which is happening many, many times each second; indeed, if something 'locks up', and a thread gets 'stuck', or a device 'floods' the system with interrupts, you notice it immediately! In fact, the most common cause of audio or video 'stuttering', 'drop-outs', and pops, is an errant interrupt service routine - usually a 'naughty'
![[:lorbat:8] [:lorbat:8]]()
driver!
Another place where you'd think there'd be a tendency toward 'grabbing' pieces of memory from all over the place is actual multi-threading - getting a couple of your cores working on 'different' tasks; so far, this hasn't proved to be so... There are not a lot of places,
so far, where this is actually done! The
problem is -
most problems look like this: data "A" -> computational process "X" -> interim result "B" -> computational process "Y" -> interim result "C" -> computational process "Z" -> answer "D"!! So, you can't get a 'core each' working, in parallel, on 'computational processes' Y and Z, because those 'interim results' are
just not available to them! Mostly turns out that the 'overhead' involved in process synchronization and inter-thread communication obviates the whole effort in the first place...
Some tasks
do lend themselves to this, and you can easily see why: you'll find a lot of video transcoders that have a setting for 'how many threads to launch'. It's pretty easy for the main 'manager' thread to say "OK, we've got a 150,000 frame video here - you, take frames 1 through 50,000; you, in the middle, you've got 50,001 to 100,00; and you, over there on the end, you've got the rest; everybody signal me when you're done, and I'll 'stitch 'em back together'!! An OS is a 'busy place'; I think you
will see more 'parallelization' in the future for operating system functions - the tools and techniques (especially the debuggers, 'tracers', and 'sub-optimizers' - Intel, themselves, have a staggeringly good
Parallel Studio product...) are evolving rapidly
![[:lectrocrew:2] [:lectrocrew:2]]()
!
By now, I'm sure you are wondering: "so, what does this all have to do with memory?" Short answer: EVERYTHING! You have seen the hierarchy, the 'ladder of acess', and that the CPU only wants to 'deal with' data at the very top of that ladder; you have seen that the task management and interrupt service funtions of the OS 'force' the CPU to 'dump and re-load' tasks' environment rapidly and repeatedly; you have seen (in
Part I) that memory is arranged in multi-dimensional arrays of rows, columns, ranks - and that switching access from different 'dimensions' of that array carries with it the 'cost' of physical, inescapable delays - the latencies.
This is why your system 'doesn't care much' how
fast (frequency-wise) your memory is, but cares intensely how
long your latencies are!!
There
is a place where high speed, versus low latency,
will be an advantage - any operations that require large, sustained, reads from and writes to RAM -
like video transcoding... I always consider my 'pass/fail' system stress test to be: watch/pause one HDTV stream off a networked ATSC tuner, while recording a second stream off a PCIe NTSC tuner, while transcoding and 'de-commercialing' a third stream to an NAS media server... But, for the vast majority of people, for the vast majority of use, this is not the case. What
is going on behind the scenes: the task scheduler is scurrying around, busier than a centipede learning to tap-dance, counting 'ticks':
![[:digitalprospecter] [:digitalprospecter]]()
...tick... yo - over there, you
gotta finish up, your tick is
over, push your environment, that's a good fella; oops - cache snoop says we've got an incoherency - grab me a page for him from over
there;
you - get me the address of the block being used by {F92BFB9B-59E9-4B65-8AA3-D004C26BA193}, will 'ya; yeah - UAC
says he has permission - I dunno - we'll just
have to trust him;
dammit -
everybody listen up, we've got a pending interrupt request, everyone drop what you're doing, and you - over
there - query interrupt handler for a vector - this is important!!! ...tick.... And the most fascinating (scary) thing about it all, is that, at some synaptic, neural level, we're doin' the same thing! (...though, the older I get, the less dependable my interrupt return mechanism is - I repeatedly find myself at the bottom of the basement steps, wondering "now what did I come down here for?!")
DMA - direct memory access
There's
another place where high speed memory
is advantageous - and it's kind of a 'wierd one'; I didn't really know where to go with it... Using interrupt-driven device drivers to transfer data to or from hardware devices works well when the amount of data is reasonably low. For high speed devices, such as hard disk controllers or ethernet devices the data transfer rate is a lot higher.
Direct
memory
access, or DMA, was invented to solve this problem. A DMA controller allows devices to transfer data to or from the system's memory without the intervention of the processor. To initiate a data transfer the device driver sets up the DMA channel's address and count registers together with the direction of the data transfer, read or write. It then tells the device that it may start the DMA when it wishes. When the transfer is complete the device interrupts the PC. Whilst the transfer is taking place the CPU is free to do other things. The DMA process occurs transparently from the processor's point of view...
DMA has several advantages over polling and interrupts. DMA is fast because a dedicated piece of hardware transfers data from one computer location to another and only one or two bus read/write cycles are required per piece of data transferred. In addition, DMA is usually required to achieve maximum data transfer speed, and
thus is useful for high speed data acquisition devices. DMA also minimizes latency in servicing a data acquisition device because the dedicated hardware responds more quickly than interrupts, and transfer time is short. Minimizing latency reduces the amount of temporary storage (memory) required on an I/O device.
DMA also off-loads the processor, which means the processor does not have to execute any instructions to transfer data. Therefore, the processor is not used for handling the data transfer activity and is available for other processing activity. And, in systems where the processor primarily operates out of its cache, data transfer is actually occurring in parallel, thus increasing overall system utilization. DMA 'happens' at the speed the memory is capable of, and the mechanism itself is optimized to 'strobe' successive, contiguous memory locations, making the transfers at the maximum 'burst speed' available from the hardware...
Why synthetic bechmarks are called 'synthetic'!
Reviewers, too often, IMHO, rely on 'synthetic' benchmark results in testing some products, and neglect the 'real-world' implications for actual performance... Memory, unfortunately, is often 'characterized' this way! A 'synthetic' benchmark does not
actually perform a function outside of producing 'raw data' for comparison purposes. Whether or not these 'raw data' have any 'real-world' performance applicability is left up to the reader, who, all too often, has not the 'grounding' to understand them. Too many reviewers, and pretty much
all of the computer press, have become 'cheerleaders' for advertisers - irregardless of the consequences! So - when you read a review that says, in effect: "Blah-blah hardware can do
this at 'xxx' rate", the very first question you must ask yourself is: "
Does my computer
ever do
this?",
and, "If, indeed, my computer
never does
this, why do I
care about 'xxx'?", as well as, "Just what will 'xxx'
cost me?"
Often computers, to produce 'synthetic benchmark' results, have been 'tweaked': a fresh OS install has been done, all background programs and services have been 'stripped out', and as many hardware subsystems as possible (LAN chiips, disk controllers, etc.) have been disabled... This will work to give you nice 'big numbers', but what is it good for? In many respects, I see this situation as similar to the crew 'running up' processors under liquid nitrogen pots. Don't get me wrong - nothing bad about this - everyone needs a hobby, and the more expensive the better
![[:bilbat:7] [:bilbat:7]]()
- but, when we get back to the 'real-world', what is it
actually good for? Pretty much, for comparing relatively useless high numbers with somebody else's relatively useless high numbers
![[:fixitbil:2] [:fixitbil:2]]()
- and that's about it!! Try to avoid getting 'caught up' in this syndrome; if you
actually intend to
use the system, try to concentrate on
actually useable improvements
Summary:
mass storage/memory has a hierarchy, or 'ladder of access'...
the 'mechanics' of each 'rung' of the ladder primarily concern movement to/from the 'rung' immediately above or below...
rapidity of task switching and interrupt service require 'context swapping' by the CPU...
'context swapping' requires access from non-contiguous areas of memory, incurring latency costs...
few operations allow access to large, contiguous areas of memory, minimizing frequency's effect on tranfer speed...
many 'benchmarks' do not reflect 'real-world' operation of the involved sub-systems...
> Next: Part III - "Evaluation and selection"