The worst possible way to measure performance is the one that seems to be most widely quoted: MHz. It leaves far too much out to be at all useful (how many cycles to do an instruction? How fast is memory? What about I/O?)
The second worst way seems to have lost a lot of favor recently: counting how many Millions of Instructions Per Second a processor is able to execute. Like MHz, it leaves far too much out to be useful. Worse, we can make the MIPS rating better while make actual performance worse, by using things like NOPs. We can see how bad these ratings are by performing a little experiment: let's write a little program, and see how it performs.
First, the program. It doesn't accomplish anything it all; it just sits and spins
I compiled it for Intel three times with different optimization levels, and ran it on several of the CS department's computers. Here's the results:
Name | Processor | Speed | Opt. Level | Compiled Code | #Instructions | Time | MIPS | CPI |
---|---|---|---|---|---|---|---|---|
Hemicuda | Pentium | 133MHz | none | perform.386.s | 4,000,000,000 | 45.22 | 88.5 | 1.5 |
Hemicuda | Pentium | 133MHz | -O1 | perform.386.o1.s | 3,000,000,000 | 15.12 | 198 | 0.67 |
Hemicuda | Pentium | 133MHz | -O2 | perform.386.o2.s | 2,000,000,000 | 37.97 | 52.7 | 2.52 |
Anson | Pentium II | 233MHz | none | perform.386.s | 4,000,000,000 | 17.15 | 233 | 1 |
Anson | Pentium II | 233MHz | -O1 | perform.386.o1.s | 3,000,000,000 | 8.57 | 350 | 0.67 |
Anson | Pentium II | 233MHz | -O2 | perform.386.o2.s | 2,000,000,000 | 8.56 | 233 | 1 |
Casablanca | Pentium II | 400 | none | perform.386.s | 4,000,000,000 | 10.0 | 400 | 1 |
Casablanca | Pentium II | 400 | -O1 | perform.386.o1.s | 3,000,000,000 | 5.0 | 600 | 0.67 |
Casablanca | Pentium II | 400 | -O2 | perform.386.o2.s | 2,000,000,000 | 5.0 | 400 | 1 |
The little loop program up above is representative of another class of poor performance measures: tiny programs. These were very popular as benchmarks at one time (around the early 1980's), but fail to give good results as they tend to exercise only a small part of the instruction set. The point of a performance measurement (either one of the ones already listed or any sort of benchmark) is to predict the behavior of the system on real problems. If the benchmark doesn't exercise the system well enough to accomplish this, the benchmark fails.
Examples:
The biggest flaw with these benchmarks is that it is possible for a manufacturer to tune a compiler to give artificially good results on the code present in the benchmarks, even though this will not be possible for normal programs.
In the case of Whetstone, a key loop in the code contains the line
X = SQRT(EXP(ALOG(X)/T1))It turns out that this is mathematically equivalent to
X = EXP(ALOG(X)/(2 * T1))
A reviewer of the authors' other text found two manufacturers' compilers that actually found and exploited this!
Examples:
Example:
Note that the original Spec89 benchmark included a program matrix300, which consisted of eight 300x300 matrix multiplies. 99% of the execution time of the benchmark wound up being spent on a single line of code. IBM found a way to optimize this code for the Powerstation 550 that resulted in a factor of nine improvement in performance - the program was removed from Spec92.
Even this assumes your workload doesn't change with the new equipment...