Superscalar Processing

The whole notion of RISC processors was driven by the goal of executing instructions at a rate of one instruction per cycle. When this was pretty much achieved, the question became one of how we could execute more than one instruction in a cycle... the answer is, of course, that it is necessary to issue more than one instruction per cycle, requiring multiple pipelines.

Like so much else, we can see early examples of superscalar processors in the 1960s -- both the CDC 6600 and the IBM 360/91 both were superscalar processors (they also both supported out-of-order execution, which we'll get to in a little bit).

So here's the basic idea of what we'll do: rather than have a single pipeline, we'll have two of them. We'll fetch two instructions from memory on a cycle, and attempt to start one of them down each pipeline.

Typically, there will be restrictions on what the two pipelines can do, which will affect our maximum speedup.

Also, notice that hazard detection and forwarding becomes downright painful.

So here's an explicit example: it's a bit more sophisticated than the example in the text on page 437, as it tries to be more realistic. Here's what we'll do:

  1. First, we'll observe that we're very close to having a separate pipeline for branches anyway. Let's Make It So.
  2. Now we'll add a second copy of most of the data part of the MIPS pipeline
  3. But we don't want to copy all of it: if we have both pipes able to talk to the data memory, we'll have to dual-port it. That'll cost us, and we aren't expecting a whole lot of memory operations anyway (notice the distinction here between the data memory and the registers: triple-porting the registers is also expensive, but it's not as expensive as double-porting the cache would be, and we are expecting to do lots of register reads and writes so it's worth it).
  4. So now we're up to three pipelines: one is the "branch pipe", the second is the "memory pipe", and the third is the "ALU pipe." The branch pipe can only do branches, the memory pipe can do arithmetic instructions, loads, and stores. and the alu pipe can only do arithmetic instructions.
  5. Now, another point here is that instead of grabbing a single instruction from the data memory per cycle, we want to grab several: typically either two or four. And we probably want to put a new pipeline stage - a queue stage - between IF and ID.

    Here's a picture. I've only put enough lines in to show how things go; trying to draw them all would be a mess....

    MIPS superscalar pipeline

    You can get a copy of this figure in PDF, scaled to print properly, here.

In the picture, the branch pipe is on top, the memory pipe is in the middle, and the alu pipe is on the bottom. Let's take a look at all the stages in more detail.

IF
Either zero, one or two instructions can be fetched from memory on a cycle. If the PC contains an address divisible by eight then two instructions are fetched; if only divisible by four, only one can be fetched. The instructions are put in the two farthest-back open slots in the IF/Q pipeline register (which we can also refer to as the instruction queue).

If there aren't enough open slots in the instruction queue for a full fetch, nothing is fetched. So if the PC is on a doubleword boundary (address divisible by eight) there must be space for two instructions; if only a word boundary there must be space for one.

The PC is incremented by four or eight, as appropriate: if the PC was on a doubleword boundary it is incremented by eight; if only a word boundary it is incremented by four. What's going on here is that in either case, the PC is incremented to the next doubleword boundary.

The PC increment and branch datapaths actually have to be more complicated than this description implies: if a branch is being executed, the PC must be incremented to four greater than the address of the branch instruction while the branch is proceeding down the pipeline, or else the branch will go to the wrong target. But this description captures the essence of what's going on.

Q
The Q stage might also be called a pre-decode. The instructions in the IF/Q pipeline register are examined and routed according to their type:

  • If the farthest-forward instruction is a branch, it can be sent to the branch pipe. If it's either a memory or an ALU instruction it can be routed to the memory pipe.
  • Now the second farthest-forward instruction is inspected. It can't be routed to the branch pipe; if it's a branch instruction it is stuck. But we'll get it unstuck in a second. If it's either a memory or an ALU instruction, and the first instruction wasn't routed to the memory pipe, it can be routed to the memory pipe. If the first instruction was routed to the memory pipe, and this is an ALU instruction, it can be routed to the ALU pipe.
  • Finally, the third instruction is inspected. The only thing it can do is, if neither previous instruction was routed to the ALU pipe and it's an ALU instruction, it can be routed to the ALU pipe.

One last thing happens in the Q stage: the later instructions in the IF/Q register are shifted forward to make room for later instructions.

As you can see, up to three instructions on a cycle can be "issued" from the Q stage, but they have to be a good mix to actually achieve anything like this. The machine will execute any valid MIPS code, but needs help from the compiler to execute quickly.

ID
The ID stage now is able to do what it's always done: decodes instructions. Of course, it'll be decoding up to three instructions at a time (the three that may have been sent down the pipeline before). Assuming the "improved" branch unit, this will be the stage on which branch instructions in the top pipe will decide whether to take the branch.

EX
The two bottom pipes perform arithmetic on this stage.

MEM-WB
This is the memory operation stage for the memory pipe, and the writeback stage for the ALU pipe.

WB
Finally, the writeback stage for the memory pipe.

Some Examples

Let's have some examples of the operation of the superscalar pipeline. First we'll have a trivial example to just get a feel for how things happen on the pipe, then we'll go on to a more realistic example

Trivial Example

For the first example, we'll assume just a whole bunch of add instructions with no dependences. Things look like this:

simple superscalar timing example

Here's what's happening on each cycle:

  1. Instructions 1 and 2 are fetched and put in the first two entries in the instruction queue.

  2. Instructions 1 and 2 are pre-decoded. Since instruction 1 is an ALU instruction, it can (and does) go down the memory pipe. Instruction 2 will go down the ALU pipe.

    This frees up the first two entries in the instruction queue, so instructions 3 and 4 can be fetched there.

  3. Instructions 1 and 2 are decoded.

    Instructions 3 and 4 are pre-decoded. 3 will proceed down the memory pipe, and 4 down the ALU pipe.

    Instructions 5 and 6 are fetched into the first two entries of the instruction queue.

  4. Instructions 1 and 2 are executed in the ALU.

    Instructions 3 and 4 are decoded.

    Instructions 5 and 6 are predecoded. Once again, 5 goes to the memory pipe and 6 to the ALU pipe.

    Instructions 7 and 8 are fetched into the first two entries of the instruction queue.

  5. Instruction 1 enters its memory stage. As with the old single-pipeline machine, any instruction going down the memory pipe has to go through this stage whether it's doing a memory operation or not.

    Instruction 2 writes its result back to its result register. At this point the instruction is complete; we say it is "retired" now.

    Instructions 3 and 4 are in their execute stage.

    All the other instructions continue down the pipe as we've seen before.

  6. Instruction 1 writes its result register, and is retired.

    Instruction 4 also writes its result register is likewise retired.

    Instruction 3 is now in its memory stage.

    All the other instructions proceed as before.

And so forth until all the instructions have been retired.

More Complicated Example

For a more complicated example, we'll have some instructions that have dependences on each other. We'll also add some information about forwarding and stalling.

We'll assume all the forwarding we need exists. In other words, on any cycle if an instruction needs a result from a previous instruction and that previous result is available anywhere in the CPU, we'll assume we can forward (I don't want to draw all the wires to implement the forwarding, since there'd be so much the figure would be hard to read).

On stalls, we'll assume stalling happens in a way that keeps the instructions executing in-order. First, we'll assume the only stage in which stalls can happen is the ID stage. That's where the stalls can be detected, and we note that we can resolve all hazards by stalling in this stage for long enough. So, if a stall happens:

  1. no more instruction pre-decode can occur until the stall ends.
  2. a higher pipeline is not stalled as a result, but a lower pipeline is. In other words, if the branch pipeline stalls then so do the memory and ALU pipelines. If the memory pipeline stalls, the ALU pipeline also stalls but the branch pipeline does not. If the ALU pipeline stalls, nothing else has to.
  3. instructions can continue being fetched into the instruction queue until it is full.

OK, now let's look at an example. While the previous example was crafted to make the superscalar execution run as smoothly as possible, this one ends up throwing a huge number of dependences into the works, so it's going to end up making the machine stall a lot.

complicated superscalar example

Now let's look at what's happening on each cycle of this example.

  1. Instructions 1 and 2 are fetched.

  2. Instructions 1 and 2 are pre-decoded. 1 will go down the memory pipe, while 2 will go down the ALU pipe.

    Instructions 3 and 4 are fetched.

  3. Instruction 1 is decoded. Instruction 2 cannot be decoded, as it has to stall waiting for the result of instruction 1.

    Instructions 3 and 4 cannot be pre-decoded due to instruction 2's stall.

    Instruction 5 is fetched.

  4. Instruction 1 is executed.

    Instruction 2 is decoded.

    Instruction 3 is pre-decoded. It will use the memory pipeline.

    Instruction 4 cannot be pre-decoded, as it has to use the memory pipeline.

  5. Instruction 1 is in the memory stage.

    Instruction 2 is executed (and $1 is forwarded from Instruction 1 to Instruction 2).

    Instruction 3 is decoded. Note that it doesn't actually have to stall in spite of its dependence on Instruction 2, since it's going down the pipeline a cycle behind Instruction 2 anyway.

    Instructions 4 and 5 are pre-decoded. 4 will use the memory pipeline, and 5 will use the ALU pipeline. Notice that these two instructions were read on different cycles, but pre-decoded on the same cycle. That's the reason for having an instruction queue.

  6. Instructions 1 and 2 write their results back to their respectively registers, and are retired.

    Instruction 3 is executed (and the result of Instruction 2 is forwarded to it).

    Instruction 4 is decoded.

    Instruction 5 has to stall due to its dependence on Instruction 4.

  7. Instruction 3 goes through its Memory stage.

    Instruction 4 is Executed.

    Instruction 5 is still stalled.

  8. Instruction 3 goes through its writeback and is retired.

    Instruction 4 goes through its Memory stage.

    Instruction 5 can now be decoded.

  9. Instruction 4 Writes back and is retired.

    Instruction 5 is executed (and the results from Instruction 4 are forwarded to it).

  10. Instruction 5 goes through its Memory stage.

  11. Instruction 5 goes through its Writeback.

(hmmm, in class somebody said that this took ten cycles. I'm getting 11 now....).

Optimizing the Complicated Example

It was very important for compilers to optimize code for superscalar pipelines. If we look at the code for the preceding example, we can see that it is logically divided into two "sets" of code that have no dependence on each other. We can interleave the instructions from these two sets of code and substantially reduce the stalls.

Here are two good general rules:

  1. Schedule loads as far in advance of their use as humanly possible. This is much more important with current architectures since memory operations are so expensive (a cache miss can easily cost hundreds of cycles), but is even a good idea with the example pipelines.
  2. Avoid using a value in the instruction after it's generated.

Following those two rules, our example code, and its execution, looks like the following figure. Since we've only got five instructions to work with there isn't a lot of opportunity for reordering the code; the only really useful one looks like moving the lw up as high as possible. But, as we'll see, even that is helpful.

reordered code to speed up superscalar pipeline

Here's what happens executing this code:

  1. Instructions 1 and 2 are Fetched.

  2. Instructions 1 and 2 are Predecoded. Instruction 1 will go down the Memory pipeline, and Instruction 2 down the ALU pipeline.

    Instructions 3 and 4 are Fetched.

  3. Instructions 1 and 2 are Decoded.

    Instructions 3 and 4 are Predecoded. Instruction 3 will use the Memory pipeline, and Instruction 4 will use the ALU piepline.

    Instruction 5 is Fetched.

  4. Instructions 1 and 2 are Executed.

    Instruction 3 is Decoded. Even though there is a dependence on Instruction 2, it's already going down the pipeline a cycle later so no stall is needed.

    Instruction 4 has a dependence on Instruction 3, so it can't be Decoded on this cycle.

    Instruction 5 can't be Predecoded, as Instruction 4 is stuck in the Decode stage.

  5. Instruction 1 is in its Memory stage.

    Instruction 2 is in its Writeback stage, and is retired at the end of this cycle.

    Instruction 3 is in its Execute stage.

    Instruction 4 can be Decoded now, as it has waited long enough to resolve the hazard.

    Instruction 5 can be Predecoded, as Instruction 4 is moving out of the way. It will use the Memory pipeline.

  6. Instructions 1 and 3 are in their Writeback stages, and are retired at the end of this cycle.

    Instruction 4 is Executed (and there is a forward from Instruction 3).

    Instruction 5 is Decoded.

  7. Instruction 4 Writes back, and is retired.

    Instruction 5 is Executed (note that there is no forward required here, since Instruction 1 performed its Writeback on the previous cycle).

  8. Instruction 5 is in its Memory stage.

  9. Instruction 5 is Writes back, and is retired.

Note that this simple optimization wound up saving nearly 20% in the execution time of the example. Superscalar, non out-of-order processors were extremely sensitive to the scheduling of instructions to optimize pipeline use. Once we get Out of Order execution working, these details become less important as the processor itself schedules the code subject to its dependences.


Last modified: Wed Mar 16 09:08:26 MST 2005