Archive for February 2008
floating point ST15
Let fl(x) denote the floating point number which is determined by a machine from x. Other than that, e0 is the smallest floating point number with radix a, t long significand, and k- as the smallest exponent: e0=ak--1. Let (1+e1) be the floating point number (which numbers are rational numbers) after 1, e1 equals to a1-t. The mathematical meaning of a floating point number, or the number it represents is like:
t
----- m
k \ i
a ) --
/ i
----- a
i = 1
Theorem.
|fl(x)-x| is less then or equal to e1|x| if |x|>e0
Proof.
It is enough to see the proof for a x>0, the x<0 case is similar.
Note that x can be written in a form
/ \ | t infinity | | ----- m --------- m | k | \ i \ i | a | ) -- + ) -- | | / i / i | | ----- a --------- a | | i = 1 i = t + 1 | \ /
After truncation the second sum’ll disappear so that is the distance of fl(x) and x. Now let’s see e1x:
infinity
-------- m
k \ i - t + 1
a ) ----------
/ i
-------- a
i = t
Because of normalization, m1>0.
QuED..