CS 473 - Measuring Performance

  1. Clock Speed

    The worst possible way to measure performance is the one that seems to be most widely quoted: MHz. It leaves far too much out to be at all useful (how many cycles to do an instruction? How fast is memory? What about I/O?)

  2. Instruction Counts

    The second worst way seems to have lost a lot of favor recently: counting how many Millions of Instructions Per Second a processor is able to execute. Like MHz, it leaves far too much out to be useful. Worse, we can make the MIPS rating better while make actual performance worse, by using things like NOPs. We can see how bad these ratings are by performing a little experiment: let's write a little program, and see how it performs.

    First, the program. It doesn't accomplish anything it all; it just sits and spins

    I compiled it for Intel three times with different optimization levels, and ran it on several of the CS department's computers. Here's the results:

    Name Processor Speed Opt. Level Compiled Code #Instructions Time MIPS CPI
    Hemicuda Pentium 133MHz none perform.386.s 4,000,000,000 45.22 88.5 1.5
    Hemicuda Pentium 133MHz -O1 perform.386.o1.s 3,000,000,000 15.12 198 0.67
    Hemicuda Pentium 133MHz -O2 perform.386.o2.s 2,000,000,000 37.97 52.7 2.52
    Anson Pentium II 233MHz none perform.386.s 4,000,000,000 17.15 233 1
    Anson Pentium II 233MHz -O1 perform.386.o1.s 3,000,000,000 8.57 350 0.67
    Anson Pentium II 233MHz -O2 perform.386.o2.s 2,000,000,000 8.56 2331
    Casablanca Pentium II 400 none perform.386.s 4,000,000,000 10.0 4001
    Casablanca Pentium II 400 -O1 perform.386.o1.s 3,000,000,000 5.0 6000.67
    Casablanca Pentium II 400 -O2 perform.386.o2.s 2,000,000,000 5.0 4001
  3. Tiny Benchmarks

    The little loop program up above is representative of another class of poor performance measures: tiny programs. These were very popular as benchmarks at one time (around the early 1980's), but fail to give good results as they tend to exercise only a small part of the instruction set. The point of a performance measurement (either one of the ones already listed or any sort of benchmark) is to predict the behavior of the system on real problems. If the benchmark doesn't exercise the system well enough to accomplish this, the benchmark fails.

  4. Synthetic benchmarks

    Examples:

    The biggest flaw with these benchmarks is that it is possible for a manufacturer to tune a compiler to give artificially good results on the code present in the benchmarks, even though this will not be possible for normal programs.

    In the case of Whetstone, a key loop in the code contains the line

    	  
    	  X = SQRT(EXP(ALOG(X)/T1))
    	  
    	  
    It turns out that this is mathematically equivalent to
    	  
    	  X = EXP(ALOG(X)/(2 * T1))
    	  
    	  

    A reviewer of the authors' other text found two manufacturers' compilers that actually found and exploited this!

  5. Program kernels

    Examples:

  6. Real programs and suites of programs

    Example:

    Note that the original Spec89 benchmark included a program matrix300, which consisted of eight 300x300 matrix multiplies. 99% of the execution time of the benchmark wound up being spent on a single line of code. IBM found a way to optimize this code for the Powerstation 550 that resulted in a factor of nine improvement in performance - the program was removed from Spec92.

  7. Timings of your site's real workloads

    Even this assumes your workload doesn't change with the new equipment...


Last modified: Fri Sep 28 11:47:20 MDT 2001