Getting all of the components of a memory subsystem to work together becomes quite a balancing act. Here are some issues:
When discussing cache, we carefully didn't mention whether the address presented to the cache is a virtual address or a physical address. It turns out that either is possible; in the former case we say the cache is in virtual space, in the latter that it is in physical space.
There are advantages to both: if we do the cache lookups in virtual space we can go faster, since we can do the cache lookup at the same time as the TLB lookup. Unfortunately, we have to completely flush the cache on a process switch, since we don't have a way to tell a virtual address produced by one process from an address produced by another process; this lowers the hit rate, and the larger the cache the more damage it does.
Both of the arguments in the last paragraph - that we have to have the cache in virtual space to do simultaneous cache/vm lookups, and that we can't tell what process an address comes from for cache lookups if the cache is in virtual space - turn out to have ways around them.
We can tell what process generates an address by maintaining a register to keep track of the process number (this is frequently called something like a ``region'' register); the contents of this register are prepended to the address. Now, whenever we change processes, we also change this register, so virtual addresses become distinguishable. We also have to have a concept of ``active'' vs. ``inactive'' processes, and some way to flush blocks from the cache so we can reuse region numbers.
We can also do cache lookups in parallel with TLB lookups, if the cache parameters meet some requirements. In particular, we need to have the cache small enough (or associative enough) that the set# field and the cache offset field, taken together, are no wider than the page offset field. The reason for this requirement is that these bits won't be changed by the virtual memory lookup.
So this lets us do the translation from virtual to physical in parallel with the cache lookup, giving us the benefits of both cache in virtual space and cache in physical space, at the cost of a wiring nightmare.
Let's try to put together some examples of simultaneous TLB and L1 cache lookups. For a first example, let's look at the simplest case: we'll make both the TLB and the L1 cache direct-mapped. Let's assume the following specifications:
Virtual Memory Address width: 32 bits Page size: 1 K bytes Single level page table Physical Memory 32 bit physical address space Cache Block size: 16 bytes Cache size: 1 K bytes Associativity: Direct mapped Translation Lookaside Buffer Number of translations: 64 Associativity: Direct mapped
Yes, I've deliberately set out to make sure that the only specification here that matches Intel is the size of the virtual space and the size of the physical space (physical space only on earlier ones). I generally want to make sure it's different every way I can, but 32 bits is just too common to pass up. In a few years I expect it won't be, and I'll have to modify this page.... I've also simplified things by using a single-level page table; this simplification would only become important when the TLB misses.
OK. Now we need to see what the address fields look like for each of the three components. It goes like this:
1K/16 = 64blocks. Since it's direct mapped we've got a six bit index field.
OK, so this looks like the following:
Tag | Cache Index | Byte Offset |
31-10 | 9-4 | 3-0 |
Virtual Page Number | Byte Offset |
31-10 | 9-0 |
TLB Tag | TLB Index |
31-16 | 15-10 |
Here, remember the key point that's going to make all this work: none of the bits in the cache index field are also bits in the VPN field. Now we can go on to combine the cache fields with the TLB fields to get the final address breakdown:
TLB Tag | TLB Index | Cache Index | Byte Offset |
31-16 | 15-10 | 9-4 | 3-0 |
Now, to do a simultaneous cache/TLB lookup, execute the following steps. Everything I've got shown as a single step is done in parallel; notice that there are no more steps here than there would be for a plain cache lookup or a plain TLB lookup (although more is done in a step - but that costs hardware, not time).
Here's a figure that tries to show the simultaneous lookup in action. It's been simplified a bit to reduce the spaghetti quotient; it doesn't show
Let's put specific numbers on this: we'll try to read one byte from
virtual address 0x1234abcd
.
The byte offset field contains d
(bits 3-0 of the address).
The cache index field contains 3c
(bits 9-4 of the address).
The TLB index field contains 2a
(bits 15-10 of the address).
The TLB tag field contains 1234
(bits 31-16 of the address).
So now we go through the following steps:
We look up translation 2a
in the TLB and cache line
3c
in the cache.
We obtain the TLB tag from the TLB and the cache tag from the cache.
We ask whether:
1234
(that's the TLB tag from our
virtual address.If the answer to all of the questions in Step 3 was "yes", we've both got a valid translation and a cache hit. We can either obtain our data from the cache or write our value to the cache.
Intel uses a L1 cache orgnanization that is compatible with a simultaneous VM/cache lookup. In the case of the Pentium 4, the L1 data cache is only 8K, and is 4-way set-associative. This is actually smaller than the Pentium III's L1 caches, which were 16K (and 4-way set-associative). AMD, on the other hand, has opted to use a larger cache that isn't able to do the translation in parallel (the Athlon 64 FX has 64K, 2-way set-associative data and instruction caches).
We have two competing requirements: we'd like to bring an entire cache line in from memory in one transfer (for bandwidth), but we want to have as few data lines as possible (for cost).
There are really three feasible solutions here: the fastest (but most expensive) approach is to use a memory bus that's as wide as a cache line. Now, any time you have a miss, you can just do a single memory transfer. The cheapest (but slowest) approach is to use a memory bus that's narrower than a cache line; then, on a miss, we take several memory transfers to bring the whole line in.
The third approach is a compromise between the first two: use the narrower bus from the second approach, but find a way to overlap the memory accesses. The traditional way to implement this approach was to have several distinct memory modules: you'd start a read from each of them in turn, and the data would arrive from them on consecutive cycles.
The current solution to this problem is to use fast page DRAM or syncronous DRAM. With both of these technologies, we can make a transfer from the internal DRAM cells (comparitively slow) into some substantially faster static memory on the memory chip, and then transfer the data from the static memory much more quickly than we could from DRAM. PC100 and PC133 SDRAM uses four transfers of 64 bits each to fill a cache line on a system with a 32 byte cache line.
Notice that it takes some work for memory-mapped IO and DMA to be compatible with cache: you need to make sure that if data is in cache and we do a transfer out we get the cache data, not the memory data; also that DMA in either goes to the cache or invalidates the cache.
Intel's page table entry has two bits that help with this: the Page Cache Disable and Page Write-Through bits.