CS 473

Homework 5: Superscalar and Out of Order Execution, and Cache

Due Monday, November 22, 2004

Superscalar Execution

Assume you've got a superscalar MIPS implementation, with the following characteristics (bear with me, now - there'll be some examples later. This really is the shortest and simplest set of assumptions I could come up with):

  1. The machine has two pipelines, which we'll refer to as the top pipe and the bottom pipe. We won't consider the branch pipe in this problem (if you'll look at the code sequence coming later, you'll notice there are no branches!)

  2. Here's a figure showing the two pipes. Most of the wires (like the paths to get the 16-bit constants where they're needed, for instance) aren't actually shown; you should assume all necessary wires are present.

    Dual Pipeline for Problem

    The top pipe has four stages: IF, ID, EX, WB. It can execute any instruction other than memory loads and stores.

    The bottom pipe has five stages: IF, ID, IE, EX, MEM, WB. It can execute any instruction.

    Under ideal circumstances, two instructions can be sent down the pipes at a time. The first instruction of a pair is sent down the top pipe; the second is sent down the bottom pipe. Here's an example of ideal circumstances.

    two instructions, one in each pipe

    Cycle by cycle,

    1. The first two instructions are read.
    2. The first two instructions are decoded, and registers $2, $3, $5, and $6 are read. The first instruction is sent down the top pipe, the second down the bottom. Meanwhile, the second pair of instructions are read.
    3. $2 and $3 are added in the top ALU, while $5 and $6 are added in the bottom ALU. Meanwhile, registers $8, $9, and $11 are read.
    4. Register $1 is written. The second instruction is in the bottom pipe's MEM stage, where it does nothing. $8 is added to $9, and 100 is added to $11. The first instruction is now finished.
    5. Registers $4 and $7 are written, and the second and third instructions are finished. Data memory is read.
    6. Register $10 is written. The fourth instruction is finished.

  3. All possible forwarding exists between the two pipes (so the result of the top pipe's EX stage is available to any stage in the either pipe on the next cycle, for instance).

  4. If the bottom pipe has to stall, the top pipe can continue but no more instructions can be fetched until the stall is resolved (so we can't send a single instruction down the top pipe). If the top pipe has to stall, the bottom pipe stalls as well (this keeps the instructions running in-order). Stalls are shown by extending the ID stage.

  5. If a load or store instruction is fetched so it would want to go down the top pipe, a NOP is inserted into the code stream so the load/store goes down the bottom pipe.

Let's have a couple of examples.

  1. Let's make the second instruction depend on the first.

    The bottom pipe has to stall a cycle, then we can forward from top pipe to bottom. This delays fetching the second pair of instructions.

    stalling and forwarding

  2. Make the third instruction be a load, and the fourth use its result. Also put a fifth instruction in.

    This results in a NOP being inserted in the code stream (shown as a blank code line, but activity on the top pipe) before the third instruction, to get it into the bottom pipe; effectively, the third instruction goes down the pipe by itself. The fourth instruction depends on its memory read, so it has to stall a cycle. Since the top pipe stalls, the bottom pipe has to stall too.

    insert a nop, stall and forward

At Last, the Problem!

  1. (25 points) Draw a Gantt chart showing how the following code is executed in this machine:
    lw $2, 100($3)
    lw $4, 200($5)
    add $6, $2, $4
    sw $6, 300($7)
    addi $2, $2, 4
    addi $5, $5, 4
    addi $7, $7, 4
    
  2. (25 points) Reorder the instructions so this code will run as quickly as possible on the machine, subject to the constraint that it get the same results. Now show how your revised code executes on the machine.

Out-of-Order Execution (15 points)

Show how the following code would be executed on a CDC 6600. Assume all additions are floating-point.

X1 <- X6 + X7
X1 <- X2 * X3
X5 <- X1 * X4
X4 <- X3 + X2

Cache Problems (10 points each)

Text, questions 7.9 and 7.10. Also, perform the same exercise assuming a two-way set-associative cache using LRU replacement, with a total of 16 one-word blocks.

Cache Simulation! (50 points)

This should be quite a bit easier than the CPU simulations you've been doing this semester....

For this problem, I want you to perform a trace-based simulation of a cache. In a trace-based simulation, you start by recording the memory reads and writes issued by the CPU while executing a program. This is called a trace. Now, you can just feed the trace into a simulation of the memory system and watch what happens. The nice thing about this is you don't have to simulate the CPU; you don't have to worry about the actual data that's read or written. You can just pay attention to the "bookkeeping" part of the cache.

We are fortunate in that there is a repository of memory traces, called the NMSU TraceBase, over in the ECE department. This contains both many traces for several architectures, and code for reading those traces. You can find the TraceBase at http://tracebase.nmsu.edu/tracebase.html, though I'll be giving you the parts you need.

So, first you need the code to read the traces. You can find that at http://www.cs.nmsu.edu/~pfeiffer/classes/473/sem/f04/hw/pdtfetch.c. This code contains all the documentation you really need to use it, but full documentation on the format is available at http://www.cs.nmsu.edu/~pfeiffer/classes/473/notes/pdats.ps (it's a paper by Eric Johnson and a former student named Jiheng Ha).

To use the PDATS code, you need to have a call

initPDATS();
to initialize the PDATS library. It has to be called once, before any calls to pdtfetch().

Now you just repeatedly call

int pdtfetch(int *labelp, int *addrp)

Each call to pdtfetch() gets one memory access from standard input. The labelp argument returns a label, describing the memory access type. While the file format supports a very rich set of access types, the particular trace you're going to look at has only three:

The addrp argument returns the address of the memory read or write. Notice that you don't get the data! That isn't used in a trace-based simulation.

Finally, the actual return value from the call is either 2 or EOF. If the return is EOF, it means the end of file was read (and the labelp and addrp values aren't valid). If it returns 2, it means you got good data and can use it.

Your Program

You need to write a cache simulator for a machine with separate I- and D- caches and variable cache size, block size, and associativity. The D-cache is a writeback cache. You can do this by using a struct to represent a cache line; the struct has to contain the valid bit, the dirty bit, the tag, and a variable marking where the line is in the LRU queue. Now, a set can be an array of lines (you can probably simplify your code just a bit by explicitly keeping track of the most recently used line in the set, too). Finally, the cache itself is an array of sets.

Your simulation needs to report the following information:

Last but not least: I want you to plot your results for cache sizes of 64K, 128K, 256K, and 512K bytes, line sizes of 4, 8, 16, and 32 bytes, and 1-way, 2-way, and 4-way set-associative caches (yes, this is 48 runs of your program with different parameters. If you write it appropriately, this will just take a trivial recompile for each run.

Here's some test data for you to try your program on: file:///user/pfeiffer/classes/473/sem/f04/hw/tex.pdt (it's a trace of a MIPS executing the TeX text formatting program. According to the TraceBase site, it has the following statistics:
Ref TypeCount
09796162
16745333
250288264
Total66829759


Last modified: Wed Nov 10 10:15:29 MST 2004