CS473 - Speeding Up Branches

Let's start by following a branch through the pipeline, and look at what else is going on?

beq $1, $1, target    _ _ _ _ _
add $2, $3, $4          _ _ _ _ _
add $5, $6, $7            _ _ _ _ _
add $8, $9, $10             _ _ _ _ _
add $11, $12, $13             _ _ _ _ _

Cycle by cycle, we get:

  1. Fetch branch instruction
  2. Decode branch instruction (fetch I+1)
  3. Compute branch target address, evaluate $1==$1 (fetch I+2)
  4. Load target address into PC (cancel I+1 and I+2, don't fetch I+3)
  5. Fetch instruction at target address

In effect, this instruction has taken four cycles to execute instead of the expected rate of instruction per cycle.

Branches taking extra time in pipelined processors is a big enough problem that it has a name: the branch penalty. We can divide the branch penalty into two sub-types: a branch-taken penalty and a branch-not-taken penalty (we're going to further subdivide it a little bit later). In this case, the branch-taken penalty is 3; the branch-not-taken penalty is 0.

So, how can we go about decreasing the branch-taken penalty? We'll have to answer three questions:

The answers to these questions depend critically on the detailed assumptions we make regarding the relative speeds of various operations in the machine. In general, we'll move a decision to an earlier point in the pipeline if we can do so without slowing down the clock enough to negate the advantage of moving the decision.

That said, we can make some rough but specific assumptions regarding the relative speeds of different hardware units. The basic assumptions we can make are:

With that as an introduction, here are some changes we can make:

  1. Compute the branch target during the ID stage. We can get away with this because the branch target ALU doesn't use data from the registers; it only uses data that is available at the start of the ID stage.
  2. Make the decision at the end of the ID stage. We can do this because checking for equality and then using an AND won't take nearly as long as a memory (well, cache) access, so we can do this without slowing the clock.

This will move the decision from the fourth stage to the second, and reduce the branch-taken delay to just one - not so coincidentally, the delay assumed by the delayed branch of the real MIPS! I'll also comment here that I get a strong sense that the original design of the branch path in the text is done specifically to show how speedups can be made.


Last modified: Mon Feb 28 08:52:49 MST 2005