Velocity Reviews > Semi OT: Binary representation of floating point numbers

Semi OT: Binary representation of floating point numbers

Dilip
Guest
Posts: n/a

 12-27-2006

Hello Wise men

This question is purely academic as I am just trying to understand this
stuff a little better. Therefore please free to instruct me to other
suitabe NGs if necessary.

Recently I started a thread (at comp.lang.c) on converting a single
precision floating point number to double precision after running into
a problem at work. Little did I realize it would spawn close to 100
replies. Needless to say I learnt a lot in the discussion.

Considering I will probably run into many such problems in the future I
set out to study the IEEE 754 representation of these numbers. I have
understood it to an extent but some gaps remain. I would be extremely
grateful if one of you more knowledgeable types can set me straight.

What I understand so far: (for single precision)

IEEE says the following is the bit pattern of such numbers:

<1 bit for sign><8 bits for exponent><23 bits for significand>

Mathematically a normalized binary floating point number is represented
as:

(-1)^s * 1.f * 2^(e-127)

where s is the sign bit
f is the 23 bits of significand **fraction**
and e is the biased exponent which can range from 0 to 255

So given a number that is represented like this:

1.00000000000000000000001 * 2^18

which has 23 zeros following the binary decimal point.

we turn that into:

1000000000000000000.00001

To convert that into decimal:

1 * 2^18 +...... +..... +..... + 1 * 1/2^5 which is 262144 + 0.03125
which is basically
262,144.03125

In other words the bit pattern for such a number will look like this:

0 10010001 00000000000000000000001

(sign bit, exponent and significand fraction) [exponent part is
10010001 because e-127 = 18 and hence e = 145 in decimal]

which in hex is 0x48800001

So far so good?

Now my question is given how do I go in the reverse direction?

Given a number 59.889999 how do I retrace back to what its binary
representation looks like?

I seem to be missing a step in between.

thanks!

Joe Wright
Guest
Posts: n/a

 12-28-2006
Dilip wrote:
> Hello Wise men
>
> This question is purely academic as I am just trying to understand this
> stuff a little better. Therefore please free to instruct me to other
> suitabe NGs if necessary.
>
> Recently I started a thread (at comp.lang.c) on converting a single
> precision floating point number to double precision after running into
> a problem at work. Little did I realize it would spawn close to 100
> replies. Needless to say I learnt a lot in the discussion.
>
> Considering I will probably run into many such problems in the future I
> set out to study the IEEE 754 representation of these numbers. I have
> understood it to an extent but some gaps remain. I would be extremely
> grateful if one of you more knowledgeable types can set me straight.
>
> What I understand so far: (for single precision)
>
> IEEE says the following is the bit pattern of such numbers:
>
> <1 bit for sign><8 bits for exponent><23 bits for significand>
>
> Mathematically a normalized binary floating point number is represented
> as:
>
> (-1)^s * 1.f * 2^(e-127)
>
> where s is the sign bit
> f is the 23 bits of significand **fraction**
> and e is the biased exponent which can range from 0 to 255
>
> So given a number that is represented like this:
>
> 1.00000000000000000000001 * 2^18
>
> which has 23 zeros following the binary decimal point.
>
> we turn that into:
>
> 1000000000000000000.00001
>
> To convert that into decimal:
>
> 1 * 2^18 +...... +..... +..... + 1 * 1/2^5 which is 262144 + 0.03125
> which is basically
> 262,144.03125
>
> In other words the bit pattern for such a number will look like this:
>
> 0 10010001 00000000000000000000001
>
> (sign bit, exponent and significand fraction) [exponent part is
> 10010001 because e-127 = 18 and hence e = 145 in decimal]
>
> which in hex is 0x48800001
>
> So far so good?
>
> Now my question is given how do I go in the reverse direction?
>
> Given a number 59.889999 how do I retrace back to what its binary
> representation looks like?
>
> I seem to be missing a step in between.
>
> thanks!
>

Our usual IEEE float consists of a 24-bit mantissa, an 8-bit exponent
and a one-bit sign. Yes, I know that's 33 bits but bear with me a while.
The mantissa is assumed to be normalized which means the msb of its
value is shifted into the msb of the mantissa (b23). Well, if b23 is
always 1 there is no reason to inspect it. We'll assume it's 1 and
assign the low order exponent bit to b23. Because of this trick we get
24 mantissa bits and 8 exponent bits into 31 bits. The sign is still its
own bit.

The exponent is said to be unsigned but for sake of argument it is
biased to a value of 126.

Regard the value 59.

b31 (sign) is positive
01000010 01101100 00000000 00000000 The binary of the float itself.
Exp = 132 (6) b30..b23 is the exponent. 132 - 126 = 6
00000110
Man = .11101100 00000000 00000000 Mantissa with b23 forced 1.
Now shifting the mantissa left 6 (the exponent) we get..
111011.00 which is 59 or..
5.90000000e+01

As float precise to 24 bits.
01000010 01101111 10001111 01011100
Exp = 132 (6)
00000110
Man = .11101111 10001111 01011100
5.98899994e+01

Converted to double, still precise to 24 bits. Conversion does not add
precision.
01000000 01001101 11110001 11101011 10000000 00000000 00000000 00000000
Exp = 1028 (6)
000 00000110
Man = .11101 11110001 11101011 10000000 00000000 00000000 00000000
5.9889999389648438e+01

Now with the x*=1000, x/=1000 trick. Added precision is an illusion?
01000000 01001101 11110001 11101011 10000101 00011110 10111000 01010010
Exp = 1028 (6)
000 00000110
Man = .11101 11110001 11101011 10000101 00011110 10111000 01010010
5.9890000000000001e+01

--
Joe Wright
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Joe Wright
Guest
Posts: n/a

 12-28-2006
Joe Wright wrote:
> Dilip wrote:
>> Hello Wise men
>>
>> This question is purely academic as I am just trying to understand this
>> stuff a little better. Therefore please free to instruct me to other
>> suitabe NGs if necessary.
>>
>> Recently I started a thread (at comp.lang.c) on converting a single
>> precision floating point number to double precision after running into
>> a problem at work. Little did I realize it would spawn close to 100
>> replies. Needless to say I learnt a lot in the discussion.
>>
>> Considering I will probably run into many such problems in the future I
>> set out to study the IEEE 754 representation of these numbers. I have
>> understood it to an extent but some gaps remain. I would be extremely
>> grateful if one of you more knowledgeable types can set me straight.
>>
>> What I understand so far: (for single precision)
>>
>> IEEE says the following is the bit pattern of such numbers:
>>
>> <1 bit for sign><8 bits for exponent><23 bits for significand>
>>
>> Mathematically a normalized binary floating point number is represented
>> as:
>>
>> (-1)^s * 1.f * 2^(e-127)
>>
>> where s is the sign bit
>> f is the 23 bits of significand **fraction**
>> and e is the biased exponent which can range from 0 to 255
>>
>> So given a number that is represented like this:
>>
>> 1.00000000000000000000001 * 2^18
>>
>> which has 23 zeros following the binary decimal point.
>>
>> we turn that into:
>>
>> 1000000000000000000.00001
>>
>> To convert that into decimal:
>>
>> 1 * 2^18 +...... +..... +..... + 1 * 1/2^5 which is 262144 + 0.03125
>> which is basically
>> 262,144.03125
>>
>> In other words the bit pattern for such a number will look like this:
>>
>> 0 10010001 00000000000000000000001
>>
>> (sign bit, exponent and significand fraction) [exponent part is
>> 10010001 because e-127 = 18 and hence e = 145 in decimal]
>>
>> which in hex is 0x48800001
>>
>> So far so good?
>>
>> Now my question is given how do I go in the reverse direction?
>>
>> Given a number 59.889999 how do I retrace back to what its binary
>> representation looks like?
>>
>> I seem to be missing a step in between.
>>
>> thanks!
>>

>
> Our usual IEEE float consists of a 24-bit mantissa, an 8-bit exponent
> and a one-bit sign. Yes, I know that's 33 bits but bear with me a while.
> The mantissa is assumed to be normalized which means the msb of its
> value is shifted into the msb of the mantissa (b23). Well, if b23 is
> always 1 there is no reason to inspect it. We'll assume it's 1 and
> assign the low order exponent bit to b23. Because of this trick we get
> 24 mantissa bits and 8 exponent bits into 31 bits. The sign is still its
> own bit.
>
> The exponent is said to be unsigned but for sake of argument it is
> biased to a value of 126.
>
> Regard the value 59.
>
> b31 (sign) is positive
> 01000010 01101100 00000000 00000000 The binary of the float itself.
> Exp = 132 (6) b30..b23 is the exponent. 132 - 126 = 6
> 00000110
> Man = .11101100 00000000 00000000 Mantissa with b23 forced 1.
> Now shifting the mantissa left 6 (the exponent) we get..
> 111011.00 which is 59 or..
> 5.90000000e+01
>
>
> As float precise to 24 bits.
> 01000010 01101111 10001111 01011100
> Exp = 132 (6)
> 00000110
> Man = .11101111 10001111 01011100
> 5.98899994e+01
>
> Converted to double, still precise to 24 bits. Conversion does not add
> precision.
> 01000000 01001101 11110001 11101011 10000000 00000000 00000000 00000000
> Exp = 1028 (6)
> 000 00000110
> Man = .11101 11110001 11101011 10000000 00000000 00000000 00000000
> 5.9889999389648438e+01
>
> Now with the x*=1000, x/=1000 trick. Added precision is an illusion?
> 01000000 01001101 11110001 11101011 10000101 00011110 10111000 01010010
> Exp = 1028 (6)
> 000 00000110
> Man = .11101 11110001 11101011 10000101 00011110 10111000 01010010
> 5.9890000000000001e+01
>

I forgot to note for the double, sign is 1-bit, exponent is 11-bits and
mantissa is 53-bits. Exponent bias is 1022.

--
Joe Wright
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Julian V. Noble
Guest
Posts: n/a

 12-28-2006
Dilip wrote:
> Hello Wise men
>
> This question is purely academic as I am just trying to understand this
> stuff a little better. Therefore please free to instruct me to other
> suitabe NGs if necessary.
>
> Recently I started a thread (at comp.lang.c) on converting a single
> precision floating point number to double precision after running into
> a problem at work. Little did I realize it would spawn close to 100
> replies. Needless to say I learnt a lot in the discussion.
>
> Considering I will probably run into many such problems in the future I
> set out to study the IEEE 754 representation of these numbers. I have
> understood it to an extent but some gaps remain. I would be extremely
> grateful if one of you more knowledgeable types can set me straight.
>
> What I understand so far: (for single precision)
>
> IEEE says the following is the bit pattern of such numbers:
>
> <1 bit for sign><8 bits for exponent><23 bits for significand>
>
> Mathematically a normalized binary floating point number is represented
> as:
>
> (-1)^s * 1.f * 2^(e-127)
>
> where s is the sign bit
> f is the 23 bits of significand **fraction**
> and e is the biased exponent which can range from 0 to 255
>
> So given a number that is represented like this:
>
> 1.00000000000000000000001 * 2^18
>
> which has 23 zeros following the binary decimal point.
>
> we turn that into:
>
> 1000000000000000000.00001
>
> To convert that into decimal:
>
> 1 * 2^18 +...... +..... +..... + 1 * 1/2^5 which is 262144 + 0.03125
> which is basically
> 262,144.03125
>
> In other words the bit pattern for such a number will look like this:
>
> 0 10010001 00000000000000000000001
>
> (sign bit, exponent and significand fraction) [exponent part is
> 10010001 because e-127 = 18 and hence e = 145 in decimal]
>
> which in hex is 0x48800001
>
> So far so good?
>
> Now my question is given how do I go in the reverse direction?
>
> Given a number 59.889999 how do I retrace back to what its binary
> representation looks like?
>
> I seem to be missing a step in between.
>
> thanks!
>

Have a look at "What every computer scientist should know about floating
point arithmetic" by D. Goldberg. It's available on the web in various

http://galileo.phys.virginia.edu/cla...51.jvn.fall01/

--
Julian V. Noble
Professor Emeritus of Physics
University of Virginia

Dik T. Winter
Guest
Posts: n/a

 12-28-2006
In article <(E-Mail Removed). com> "Dilip" <(E-Mail Removed)> writes:
....
> Now my question is given how do I go in the reverse direction?
>
> Given a number 59.889999 how do I retrace back to what its binary
> representation looks like?

First convert it to binary. You will find that it does not have a finite
binary expansion but it has a repeating expansion when you go far enough.
But when converting it to float you need initially only 24 + some extra
bits. The following 29 will do nicely:
111011.11100011110101101111100
If you truncate that to 23 bits you get:
111011.111000111101011011
that is to small, the next larger representable number is:
111011.111000111101011100
which is too large, but the second is closer to the original
than the first.
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/

christian.bau
Guest
Posts: n/a

 12-28-2006
Dilip wrote:

> Mathematically a normalized binary floating point number is represented
> as:
>
> (-1)^s * 1.f * 2^(e-127)
>
> where s is the sign bit
> f is the 23 bits of significand **fraction**
> and e is the biased exponent which can range from 0 to 255

> Given a number 59.889999 how do I retrace back to what its binary
> representation looks like?

To do this by hand: First you want (-1)^s, 1.f and (e-127). With the
example 59.889999:

Clearly (-1)^s = 1, so s = 0.
Since 2^5 <= 59.889999 < 2^6, e-127 equals 5, or e = 132.
1.f * 2^5 = 59.889999, or 1.f = 59.889999 / 2^5.
1.f is a multiple of 2^-23, so 2^23 * (1.f) is an integer.

2^23 * 1.f is the integer nearest to (59.889999 / 2^5) * 2^23 which is
about 15699803.898, so 2^23 * 1.f equals 15,699,804. Write this as a 24
bit binary number; the first binary bit must be a 1.

You need to be careful if you start with a value just below a power of
two, like 63.999999: The value 2^23 * 1.f that you calculate is always
less than 2^24, but rounding to the nearest integer could round it to
2^24 which would make 1.f = 2. In that case you have to choose e one
higher, and 1.f = 1.

David T. Ashley
Guest
Posts: n/a

 12-28-2006
"Dilip" <(E-Mail Removed)> wrote in message
news:(E-Mail Removed) ups.com...
>
> Given a number 59.889999 how do I retrace back to what its binary
> representation looks like?

I'm assuming you can calculate the binary representation of the integer
"59".

Since you know where the radix point is, you also know the exponent and the
number of bits you have after the radix point to express the fractional
part.

Assume you have 10 bits after the radix point ...

Then you just solve:

889999 / 1000000 = x / 2^10.

and convert x to binary.

Chris Torek
Guest
Posts: n/a

 12-28-2006
In article <(E-Mail Removed). com>
Dilip <(E-Mail Removed)> wrote:
>Considering I will probably run into many such problems in the future I
>set out to study the IEEE 754 representation of these numbers. I have
>understood it to an extent but some gaps remain. ...

You have the basics down, it seems. Others have already covered
the basics on conversions from decimal representations.

My article at <http://web.torek.net/torek/c/numbers.html> is
intentionally somewhat lightweight but may also be helpful. In
particular, note the special cases for zero, "sub-normal" or
"denormalized" numbers, Infinities, and NaNs.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
Reading email is like searching for food in the garbage, thanks to spammers.

Ernie Wright
Guest
Posts: n/a

 12-28-2006
Dilip wrote:

> Now my question is given how do I go in the reverse direction?
>
> Given a number 59.889999 how do I retrace back to what its binary
> representation looks like?

Despite the impression you might get from some of the answers, it's
actually fairly easy to do this, and it doesn't require doing any binary
arithmetic, or in fact anything you can't do pretty quickly with an
ordinary calculator.

Think of a floating-point number f as

f = -1^s * (1 + m/2^b) * 2^e

where s is a sign bit, m is (part of) the mantissa or significand, e is
the exponent, and b is the number of bits of precision. For 32-bit IEEE
floats, b is 23. We start out by finding s, m, and e and worry about
how they fit into the binary representation later.

1. Find the sign. Trivial, but by peeling this off first, the rest of
the work can be done using the absolute value of f.

s = SIGN(f) 1 if negative, otherwise 0
f = |f|

2. Find the base-2 log of f.

Most calculators don't have a button for this, but it's easy to do with
the log buttons they do have.

log2(f) = log(f) / log(2) = 5.9042432031146...

3. Round this DOWN to find the exponent e.

e = 5

This is the largest power of 2 that's less than f.

4. To find the significand, divide f by 2^e.

59.889999 / 32 = 1.87156246875

At this point you have (the absolute value of) f in the form of a
significand between 1.0 and 2.0, multiplied by a power of 2.

1.87156246875 * 2^5

Now you have to represent each component as an integer and pack them
into 32 bits.

5. The sign is the single bit s.

6. The exponent is 8 bits equal to e + 127.

7. The significand integer m is 23 bits:

m = round(( significand - 1 ) * 2^23 )
= round(( 1.87156246875 - 1 ) * 2^23 )
= round( 7311195.897856 )
= 7311196

If, after rounding, m == 2^23, set m = 0 and increment e.

At this point, you can check that you've got the right s, e, and m by
plugging them back into the formula.

1 * (1 + 7311196 / 838860 * 32 = 59.8899993896484375

8. Finally, pack these three integers into 32 bits.

s << 31 | (e + 127) << 23 | m

s * 2^31 + ((e + 127) * 2^23) + m

The procedure for doubles is nearly identical. Use b = 52 and build the
bit pattern using

s << 63 | (e + 1023) << 52 | m

- Ernie http://home.comcast.net/~erniew