The most important single notion to get across here is that everything we say here is transparent to the program. We're going to be talking about fields within a memory address, about what bits mean what, about breaking down the address... the program doesn't see any of this. So far as the program is concerned, it generates an address and delivers that address to the memory subsystem as part of either a read or write request; the memory subsystem satisfies the request. How it goes about satisfying it is up to the memory subsystem.
Within the memory subsystem, we divide memory up into a bunch of blocks. It's going to turn out later that the block size is a crucial determinant of the cache system performance; a reasonable block size might be 16 bytes or so. What this is going to mean is that when we transfer data between memory and cache, we'll do it 16 bytes at a time. Notice that this has nothing whatever to do with CPU word size or alignment (though a block size that was smaller than the word size would be pretty weird). So now we can regard a 32 bit memory address as consisting of 28 bits of block number, and 4 bits of byte offset within the block.
The next step is going to be to define the cache. The cache block size is going to be the same as the memory block size, and we're going to transfer data back and forth between the memory and the cache. So when we have some memory we think we're going to use a lot, we move it up to the cache. When we decide we aren't going to use it so much, we'll move it back out to memory. We'll use a simple scheme to make these decisions: we'll just assume temporal locality holds, so whenever we access a memory location we'll move it into cache; we'll leave it there until the contents of another location lands in the same place in the cache and displaces it.
When we move data back and forth between memory and cache, we need a simple scheme to look in cache for it (it has to be simple, because it has to be fast). The scheme we're going to use is about as simple as it gets; we're going to hash the address. The hashing algorithm we're going to use is also as simple as possible; we're just going to take the memory block number modulo the cache size. Cache sizes are always powers of two, so we can perform the modulo function by just taking the low order bits of the memory block number.
This gives us the mapping we need to do a cache lookup: we take the
original address and extract the cache index. We go to the cache in
the location specified by the cache index. This gives us the data
word. Here's an example: suppose we have the 32 bit address
0x1234abcd
, a 16-byte block, and a 64K cache. 64K takes
16 bits to address, so we have a 4 bit byte offset, a 12 bit cache
index, and 16 bits we haven't used yet (which we'll see in just a
minute is called the tag). In the example, this gives us:
Offset | d |
Index | abc |
Tag | 1234 |
Now we have a new problem: given how we've defined the cache, many
different addresses map to a single block: in fact, any address of
the form xxxx xxxx xxxx xxxx xxxx 1010 1011 1100 xxxx
maps to the same cache block. The low order four bits aren't a
problem, since they distinguish the sixteen bytes of the block, but what
do we do about the most significant sixteen bits? For that matter, when we
first start executing, how do we tell that the data in the cache is
invalid?
We handle these problems by adding some bookkeeping to the cache.
Every data block will have some extra bits associated with it: a bit
called V
will tell if there's any valid data in the cache
block; when we first start executing, all the V bits will be 0. We
also attach the high order 16 bits of the original address; these are
called the tag (as we go on, we'll be adding some more
bookkeeping beyond this, too).
Now, how do we do a lookup?
In general, on a cache miss you need to perform the following steps.
Now we need to talk just a bit about what's meant by "remove the data from the cache." What's required here depends on another decision: what happens when we perform a write? More specifically, do we just update the data in the cache itself, or do we also write the data through to memory?
The easier answer is the second alternative: whenever we do a write, update both cache and memory. This answer is called a write-through cache; its advantage is its simplicity. Its disadvantage, though, is that writing to cache is limited by the speed of memory.
The harder answer is the first alternative: only write to cache, and let memory get "stale." Now, whenever we remove a block from the cache, we need to check to see if the block has been modified since we brought it in; to do this, we add another book-keeping bit, called the "Dirty" bit (manufacturers tend to prefer to call this a "modified" bit). So, if the block is dirty, we need to write it back to memory at this point. This is called a write-back cache, and it has complementary advantages and disadvantages to the write-through cache.
The greater the difference in speed between two levels in the memory hierarchy, the more compelling the case for a write-back cache. So, in modern systems you'll normally see a write-through L1 cache (since here we're transferring between two cache levels), but a write-back L2 cache (since memory is much slower than cache these days).
In general, increasing the block size makes better use of spatial locality and improves the hit rate. However, increasing the block size too much causes two side-effects:
The fact that a given memory block can only be mapped to a single location in cache means we can easily have a situation in which several memory blocks, all mapping to the same cache block, can end up continually displacing each other (this is called "cache thrashing"). We've got a couple of things we can do about this: first, if we can identify different types of memory accesses, we can have special-purpose caches for them. An example here is the instruction and data caches we've mentioned already. A fortuitous side-effect here is that we can also get the effect of increasing the cache's multiporting.
Another approach is using a set-associative cache: here, instead of only allowing a memory block to be mapped to a single cache block, it maps to a set of cache blocks. The number of blocks in the set is the associativity.
This introduces the new question of which block to replace on a miss. The most popular two options are probably LRU (and approximations to LRU), and random replacement.
Either making a cache larger, or increasing its associativity, will improve the hit rate (unfortunately, they both slow it down, too). The question of which decision to make (do you double the size of the cache, or leave it the same size and double the associativity?) is based on which one your simulations show as doing the most good, and which one will cost less. Which route will cost less is affected by whether the cache is on-chip or off-chip.
Keeping a discussion of caches up to date is challenging, to say the least. As of this writing (in March, 2003), recent PC processors and motherboards have had all two levels of cache, in the CPU module. The Mac G4 uses a three-level cache, with two levels in the processor module and one on the motherboard.
Suppose we have a 32-bit byte addressed machine, and a 64K, two-way set-associative writeback cache with a 32-byte blocksize using LRU replacement. First, let's work out what the fields will be:
The byte offset field is five bits, since 25 = 32.
The sum of the byte offset and index fields would be 16 bits if the cache were direct-mapped since 216 is 64K; since it's two-way set associative it's only 15 bits. As the byte offset field is five bits, the index field is 10 bits.
The total width of the bit fields has to be 32 (since that's the address size), so the tag has to be 32-15=17 bits.
Before we start, we assume the cache is empty.
We write four bytes at address 0x1234abcd
(this is
something an Intel processor would be able to do, but a MIPS would
not). This divides into:
0x0d
0x15e
0x2469
So, 32 bytes are read from memory starting at 0x1234abc0
into one of the blocks in set 0x15e
. This block is
marked as Valid and Dirty, and its tag becomes 0x2469
.
Now let's try to read four bytes at 0x2234abcd
. This
ends up reading 32 bytes from memory starting at address
0x2234abc0
; this will be placed in the other block in set
0x15e
, which will be marked Valid but not Dirty.
To finish up, let's read four bytes at 0x1234abdf
. This
will get nasty.
The index and tag will be the same as for the first example. But the
data we're trying to read will overlap two blocks: one byte
in the block starting at 0x1234abc0
, and three from the
block starting at 0x1234abe0
. So we get a hit in one
cache block, and a miss in the next. So we have to fetch the data
from memory, put it together, and send it up to the CPU.
Intel does this whole thing in hardware; notice that this adds quite a bit of complexity to the memory system and slows things down; consequently, compilers typically make sure that data is properly word-aligned or even cache-block aligned.
More modern architectures tend to disallow it on the grounds that it's a feature that shouldn't be taken advantage of even if it's present; in the case of both Sparc and MIPS, the result is a trap which is delivered to the program (on a Sparc running Solaris, the signal delivered isSIGBUS
, or ``Bus error.'' The IBM POWER
architecture took an interesting approach; it allowed unaligned
accesses that didn't overlap cache blocks, and specifically assumed
that on an unaligned access that did overlap blocks the OS would put
the data together for the application. I don't know if later members
of the POWER family have continued that.
We can classify cache misses into three types:
These definitions try to capture the notion that there are some misses that are simply unavoidable (ie the compulsory misses; there is simply no way to avoid a miss the first time you access some data that isn't in the cache yet), some misses that can be avoided by increasing the cache size but which can't be avoided by increasing associativity (ie the capacity misses), and some misses that can be avoided by increasing the associativity (ie the conflict misses).
Unfortunately, the definitions of conflict and capacity misses tend to be a bit too ``crisp,'' in the sense that a miss isn't regarded as a capacity miss unless the cache is absolutely full, and you will have misses that are essentially due to the cache being too small before that has quite happened. It might be more useful to have some measure of the effectiveness of the cache's associativity based on how many conflict misses you're getting vs. how full the cache is; if the cache is nearly empty or contains mostly data that hasn't been accessed in quite a while it's evidence that those are "real" conflict misses, while if the cache is nearly full and most of the data has been used recently it's evidence that they're "really" capacity misses. But when classifying misses, we use the definitions above.
This is part of the general problem of memory consistency, which can be thought of as maintaining the illusion that all the levels of the memory hierarchy contain the same data.
Think about a system with a cache, and with at least one DMA device. We have coherence problems on both reads and writes: on transfers from the device to memory the cache can get stale; on transfers from memory to the device (and a write-back cache) the memory can be stale.
The general solution is to use a ``snoopy'' cache: the cache watches the memory bus, and responds to memory transactions. On a transfer to the device, the cache notices when the device requests an address that has cached and dirty, suppresses the memory's response, and supplies the data in its stead (it may also do a writeback in this case). On a transfer from the device, it also watches for addresses that it has cached, and either snags a copy of the data (that's called an update protocol) or marks its copy as invalid (that's an invalidate protocol).