The IEEE standard has some extra features to help the programmer, including guard bits, several rounding modes, and defined behavior on exceptions.
Integers have many advantages over floating point numbers - simplicity, ease in speedy implementations... and, especially, accuracy. An integer 1 has the value of exactly 1: not 1.000001 or .999999. With floating point numbers, we frequently find ourselves attempting to represent numbers with their closest approximation. As we perform arithmetic, the approximation gets worse and worse.
Suppose we had a simple floating point format that allowed three bits of precision, and calculate 2.00 - 1.75. This would be represented in binary scientific notation as 1.00x21 - 1.11x20. After denormalizing the subtraction becomes 1.00 - 0.11 = 0.01 = 1x2-2 (remember, when we denormalized the 1.75 we didn't get to keep all three bits).
If instead we had calculated the result with infinite precision, our result would have been 1.00 - 0.111 = 0.001 = 1x2-3
Our error was 100%! This is called "catastrophic cancellation."
To reduce this problem, the IEEE format specifies three bits called "guard", "round", and "sticky." These bits are maintained in the floating point unit, and catch bits as they are right-shifted off the end of the word. The idea is that the significand looks like:
fraction bits | G | R | S |
G and R just catch bits that have been shifted off, while S is the inclusive-or of any bits that went too far to the right for G and R.
In the example above, the guard bit by itself would eliminate the error; the "1" that was right-shifted out from the significand lands in the guard bit, so when we perform the subtraction we get the right answer.
Of course, these extra bits don't completely eliminate rounding error, of course. But they help.
It turns out that the way we were taught to round (anything less than .5 rounds down, .5 and above rounds up) is only one possible. rounding mode. This floating point format supports four different modes, as follows:
You can see that the extra bits above are necessary to implement these rounding modes. We could implement "round to zero" with no extra bits; since we just truncate them, they go unused (and I find it hard to imagine anyone wanting this mode!).
We need the other bits to implement the last three modes: how we round depends on their contents.
I claimed last time that anything with an exponent of all one's, and a significand of anything other than 0, was a NaN. That was true, but there's actually a little bit more to say about it: namely, if the most significant bit of the significand is a 0 then the number is a "signaling NaN", while if it is a 1 it is a "quiet NaN."
A "quiet NaN" can be used in arithmetic; it just quietly propagates itself through the operation and ends up as the result.
A "signaling NaN" can be copied, assigned, and compared without an exception; however, using it in an arithmetic expression triggers a hardware exception.
Finally, the standard defines a series of exceptions that can be enabled or disabled:
Here's a really good paper I found on the net that does a good job of discussing the picky details of IEEE floating point.