A number , can be thought of as a projection , used for example
for measuring , be it time or distance , or for countig . A number is a discrete way of measuring ,
discrete as in opposed to continuous , which can be useful under certain condition .
In mathematics multiple kind of numbers exist . There are for
example , the nonnegative whole numbers , such as 0
or 1
, and
which are represented by the set N
.
There are the integer numbers which are formed of : 0
, the negative whole
numbers such as -1
or -2
, and the positive whole numbers . The integer numbers
are represented by the set Z
.
There is also , the rational numbers , of the form p/q
, where p and q are
integers , and q is different from 0
. Rational numbers are represented by the set
Q
.
Numbers are represented in the computer , by using an algorithm . The set
N
is represented by using the unsigned number representation ,
the set Z
is represented by using the signed number representation , and
the set Q
is represented using the floating point number representation .
Table of Contents
Limitations of positional numeral systems
Some numbers cannot be represented using a finite sequence of fractional digits , in a positional numeral system .
Irrational numbers , such as pi
, simply cannot
be represented using a finite sequence of digits , in a positional numeral system .
Rational numbers of the form p/q
, cannot be
represented using a limited sequence of fractional digits such as .12
, when the chosen base
, or the base multiplied by 1,2 ... q-1
is not divisible by q
.
As an example 1/4
can be represented in the decimal positional numeral
system , because 10 * 2
is divisible by 4
, as such
1/4
can be written in the decimal positional numeral system , as 0.25
.
1/3
is representable in the ternary positional numeral system , because when
borrowing in this case , we are borrowing multiples of 3 , so 1/3
can be written as
0.1
, in base 3
.
1/3
has a repeating representation in the decimal positional numeral system
, because 10 * 1
, and 10 * 2
are not divisible by 3
. As such
1/3
is represented in the decimal positional numeral system , as the repeating sequence
0.33333...
.
It can be proven that any repeating fractional number , can be written as a rational number , for example :
Problems arise , when limiting the fractional part of a number in a positional numeral system , to a limited number of digits .
When the number of digits is limited , this means that only a limited number of fractional values can be generated . The question hence to ask , is how to represent non generated fractional values , is it by one of the generated values , or by just stating it cannot be represented .
Let’s take as an example , the set of fractional parts generated in base 2 , when limiting the number of fractional bits to three .
0.1
in base 10 , has the following representation in base 2
.
0.1
is smaller than any nonzero value , in the selected base 2
subset , which
has a limited number of fractional digits , as such it cannot be represented in this subset . This is
called an underflow .
Larger values , such as values larger than 0.1
, and which are not present in the generated
set , can only be represented by approximation . 0.4
for example , can be represented as
.011
which is equal to 0.375
in decimal . The difference between
the stored value , and the actual value is as such : 0.4 - 0.375
, which is
equal to 0.025
.
The stored value of 0.77
, when subtracted from the stored value of 0.57
, is
equal to .110 - .100
, which is equal to .010
. The stored value of
0.3
is .010
, as such in our representation of the fractional parts in base 2 ,
0.77 - 0.57
is equal to 0.3
and is not equal to
0.2
.
Adding more bits to the fractional part , will improve the
precision of the represented numbers , and of the calculations . For example , when
4
bits are used for the fractional part , 0.1
in decimal can be represented as
0.0001
in binary .
Even with a wider number of digits , the problems discussed earlier will always persist . Some numbers cannot be represented accurately using a limited number of digits , such as the irrational , and having a limited number of digits , a limited number of fractional parts is available, as such a limited precision .
IEEE floating point format
When using the IEEE floating point number representation , in addition to the precision , which is the number of bits selected for the fractional part , there is the sign bits , and the exponent bits .
What this means , is that additional values can be generated , because of the presence of the exponent , and the sign bits , so simply put , this is just a way to improve the precision .
An IEEE floating point number , has a bit representation , and a numerical value . The IEEE specified the bit format that a floating point number must have , and by which its numerical value is to be interpreted . This bit format can be put to work or accustomed to different word length . A word is formed of a number of bits , so this format can be put to work on 32 bits , 64 bits , 128 bits and so on … It just works the same.
A single precision floating point number has a 32 bit representation , a double precision floating point number has a 64 bit representation , and a quadruple precision floating point number has a 128 bits representation .
Infinite values
When the exponent bits are set all to 1
, and the precision bits
are set all to 0
, the floating point number bits sequence , represent an infinite value .
There are two infinite values , positive and negative infinity
. When the sign bit is 0
, this is positive infinity , when the sign bit is 1
,
this is negative infinity .
Infinity can be caused for example , when dividing a nonzero
number by 0
, or when the result of an operation , after rounding has the infinite value .
Not A Number value
When the exponent bits are all set to 1
, and the precision bits
are set to anything but 0
, the bit sequence , of the floating point number ,
represents NAN
.
The value NAN
, means not a number , and it can happen
for example , when dividing 0
by 0
, or when adding positive and
negative infinity .
There is no negative or positive NAN , so the sign bit does not matter .
Denormalized values
When the exponent bits are all set to 0
, the floating point
number is in what is called a denormalized form .
The rational value of the floating point number , is equal ,
to the sign value , multiplied by 2
to the exponent value , multiplied by the precision value
.
The sign value is either 1
, if the sign bit value
is 0
or -1
, if the sign bit value is 1
.
The exponent value is calculated using the formula
1 - bias
. The bias is calculated using the formula :
The precision value is the same as the precision bits value . The precision bits value is calculated as a fractional positional binary number .
To illustrate this , let us say , that the floating points are
encoded using 7
bits . 1
bit is a sign bit , 3 bits are exponent bits , and 3
bits are precision bits .
The sign value is equal to 1
, when the sign bit
is 0
, and to -1
when the sign bit is 1 .
The exponent value is equal to 1 - bias
, which is
equal to -2
.
The precision bits values are , calculated as positional binary fractions .
The precision values are the same as the precision bits values .
As such when in denormalized form , the 7
bits floating point
numbers , possible rational values are :
Normalized values
When the exponent bits are not all set to 0
, or
are not all set to 1
, the floating point number bits , are in normalized form .
The floating point number rational values , are equal to the sign value , multiplied by two to the power of the exponent value , multiplied by the precision value .
The sign value , is equal to 1
, when the sign bit
is 0
, and is equal to -1
, when the sign bit is set to 1
.
The exponent value , is equal to the exponent bits values , minus the bias . The exponent bits values , are calculated , as if the exponent bits are in the binary positional numeral system . The bias is calculated as it was calculated in denormalized form .
The precision values , is equal to one , plus the precision bits values . The precision bits values , are calculated as if the precision bits , are a fractional base two number .
To illustrate this , let’s say that the floating point
number is encoded using 7
bits . 1
bit is used for the sign , 3
bits are used for the exponent , and 3
bits are used for the precision .
The sign value is equal to 1
, when the sign bit
is 0
, and -1
when the sign bit is 1
.
The possible exponent values are :
The possible precision values are :
The 7 bits floating point number , possible normalized form rational
values are :
Visualizing floating points possible values , when the encoding is done on 7 bits
Commutativity , associativity , and distributivity
IEEE floating point addition commutative : a + b = b + a not associative : ( 5 + 1e40 ) - 1e40 != 5 + ( 1e40 - 1e40 ) because 0 != 5 IEEE floating point multiplication commutative : a * b = b * a not associative : 1e40 * ( 1e300 * 1e-300 ) != ( 1e40 * 1e300 ) * 1e-300 because 1e40 != Infinity not distributive over addition and subtraction 0 * (1e308 + 1e308 ) != ( 0 * 1e308 ) + ( 0 * 1e308 ) because NaN != 0 1e30 * ( 1e300 - 1e300 ) != 1e30 * 1e300 - 1e200 *1e300 because 0 != NaN IEEE floating point division not commutative : ( 1.0 / 2 ) != ( 2.0 / 1 ) not associative : ( 1.0 / 2 ) / 3 != 1 / ( 2.0 / 3 ) because 0.1666 != 1.5 IEEE floating point subtraction not commutative : 1.0 - 2 != 2 - 1.0 not associative : -4 - ( -4 - -3.0 ) != ( -4 - -4.0 ) - -3 -4 - ( -4 - -3 ) = -4 - -1 = -3 ( -4 - -4 ) - -3 = 0 - -3 = 3
Floating points format in C
Floating point formats are used to represent real , or rational numbers , in
the computer . The C standard has three data types to be used with a floating point
standard , they are : float
, double
, and long double
.
The c standard , does not specify which floating point standard
is to be used , but usually , the IEEE floating point format is used , as such float
is
mapped to the IEEE single precision floating point format , and double
is mapped to the IEEE
double precision floating point format .
By default a floating point literal such as 1.0
, has the type
double
, unless suffixed with f
, in which case it will have the type
float
, or suffixed with l
, in which case it will have the type
long double
.
To detect if a floating point value is NaN
, it
can be done by using the isnan
function , and to detect if a floating point number is
infinity it can be done using the isinf
function .
The isnan
, and isinf
functions are both defined in the math.h
header .
#include<stdio.h> #include<math.h> int main ( void ){ printf( "%d\n", isnan( 0.0/0.0 ) ); /* 0.0 divided by 0.0 , results in not a number , as such isnan returns a nonzero value . */ // Output : 1 printf( "%d\n" , isinf( 1.0/0.0 ) ); /* 1.0 divided by 0.0 results in + infinity , as such isinf returns a nonzero value */ // Output : 1 }