Bordley: Decompose floating-point number

Tuesday, 27 August 2013

Decompose floating-point number

Decompose floating-point number

Given a floating-point number, I would like to separate it into a sum of
parts, each with a given number of bits. For example, given 3.1415926535
and told to separate it into base-10 parts of 4 digits each, it would
return 3.141 + 5.926E-3 + 5.350E-8. Actually, I want to separate a double
(which has 52 bits of precision) into three parts with 18 bits of
precision each, but it was easier to explain with a base-10 example. I am
not necessarily averse to tricks that use the internal representation of a
standard double-precision IEEE float, but I would really prefer a solution
that stayed purely in the floating point realm so as to avoid any issues
with endian-dependency or non-standard floating point representations.
No, this is not a homework problem, and, yes, this has a practical use. If
you want to ensure that floating point multiplications are exact, you need
to make sure that any two numbers you multiply will never have more than
half the digits that you have space for in your floating point type.
Starting this this kind of decomposition, then multiplying all the parts
and convolving, is one way to do that. Yes, I could also use an
arbitrary-precision floating-point library, but this approach is likely to
be faster when only a few parts are involved, and it will definitely be
lighter-weight.

Bordley

Tuesday, 27 August 2013

Decompose floating-point number

No comments:

Post a Comment