A __ number , can be thought__ of as a projection , used for example
for measuring , be it time or distance , or for countig . A number is a discrete way of measuring ,
discrete as in opposed to continuous , which can be useful under certain condition .

In mathematics __ multiple kind of numbers__ exist . There are for
example , the

__, such as__

*nonnegative whole numbers*`0`

or `1`

, and
which are represented by the set `N`

.There are the __ integer numbers__ which are formed of :

` 0 `

, the negative whole
numbers such as `-1`

or `-2`

, and the positive whole numbers . The integer numbers
are represented by the set `Z`

.There is also , the __ rational numbers__ , of the form

`p/q`

, where p and q are
integers , and q is different from `0`

. Rational numbers are represented by the set
`Q`

.Numbers are represented in the computer , by using an algorithm . The set
`N`

is represented by using the unsigned number representation ,
the set `Z`

is represented by using the signed number representation , and
the set `Q`

is represented using the floating point number representation .

Table of Contents

## Limitations of positional numeral systems

Some numbers cannot be represented using a __ finite sequence of fractional
digits__ , in a positional numeral system .

__ Irrational numbers__ , such as

`pi`

, simply cannot
be represented using a finite sequence of digits , in a positional numeral system .__ Rational numbers__ of the form

`p/q`

, cannot be
represented using a limited sequence of fractional digits such as `.12`

, when the chosen base
, or the base multiplied by `1,2 ... q-1`

is not divisible by `q`

.As an example `1/4`

can be represented in the __ decimal positional numeral
system__ , because

`10 * 2`

is divisible by `4`

, as such
`1/4`

can be written in the decimal positional numeral system , as `0.25`

.`1/3`

is representable in the __ ternary positional numeral system__ , because when
borrowing in this case , we are borrowing multiples of 3 , so

`1/3`

can be written as
`0.1`

, in base `3`

.`1/3`

has a __ repeating representation__ in the decimal positional numeral system
, because

`10 * 1 `

, and `10 * 2 `

are not divisible by `3`

. As such
`1/3`

is represented in the decimal positional numeral system , as the repeating sequence
`0.33333...`

.It can be proven that __ any repeating fractional number__ , can be
written as a rational number , for example :

__ Problems arise , when __ limiting the fractional part of a number
in a positional numeral system , to a limited number of digits .

When the number of digits is limited , this means that only a limited number of fractional values can be generated . The question hence to ask , is how to represent non generated fractional values , is it by one of the generated values , or by just stating it cannot be represented .

Let’s take as an example , the set of fractional parts generated in base
2 , *when limiting the number of fractional bits to three .*

`0.1`

in base 10 , has the following representation in base `2`

.

`0.1`

is smaller than any nonzero value , in the selected base `2`

subset , which
has a limited number of fractional digits , as such it cannot be represented in this subset . This is
__ called an underflow__ .

Larger values , such as values larger than `0.1`

, and which are not present in the generated
set , can only be represented by approximation . `0.4`

for example , can be represented as
`.011`

which is equal to `0.375`

in decimal . The __ difference between
the__ stored value , and the actual value is as such :

`0.4 - 0.375`

, which is
equal to `0.025`

.The stored value of `0.77`

, when subtracted from the stored value of `0.57`

, is
equal to `.110 - .100`

, which is equal to `.010`

. The stored value of
`0.3`

is `.010`

, as such in our representation of the fractional parts in base 2 ,
`0.77 - 0.57 `

is equal to `0.3`

and is *not equal to*`0.2`

.

Adding more bits to the fractional part , will __ improve the
precision__ of the represented numbers , and of the calculations . For example , when

`4`

bits are used for the fractional part , `0.1`

in decimal can be represented as
`0.0001`

in binary .Even with a wider number of digits , the __ problems discussed earlier will
always persist__ . Some numbers cannot be represented accurately using a limited number of
digits , such as the irrational , and having a limited number of digits , a limited number of fractional
parts is available, as such a limited precision .

## IEEE floating point format

When using the IEEE floating point number representation , __ in addition
to the precision__ , which is the number of bits selected for the fractional part , there is
the sign bits , and the exponent bits .

What this means , is that additional values can be generated , because of the presence of the exponent ,
and the sign bits , so simply put , this is __ just a way to improve__ the precision .

An IEEE floating point number , has a bit representation , and a numerical
value . The IEEE specified the __ bit format__ that a floating point number must have , and by
which its numerical value is to be interpreted . This bit format can be put to work or accustomed to
different word length . A word is formed of a number of bits , so this format can be put to work on 32
bits , 64 bits , 128 bits and so on … It just works the same.

A __ single precision__ floating point number has a 32 bit representation , a

__floating point number has a 64 bit representation , and a__

*double precision*__floating point number has a 128 bits representation .__

*quadruple precision*### Infinite values

When the exponent bits are set all to `1`

, and the precision bits
are set all to `0`

, the floating point number bits sequence , represent an infinite value .

There are __ two infinite values__ , positive and negative infinity
. When the sign bit is

`0`

, this is positive infinity , when the sign bit is `1`

,
this is negative infinity .Infinity __ can be caused__ for example , when dividing a nonzero
number by

`0`

, or when the result of an operation , after rounding has the infinite value .
### Not A Number value

When the exponent bits are all set to `1`

, and the precision bits
are set to anything but `0`

, the bit sequence , of the floating point number ,
__ represents __ .

`NAN`

The value `NAN`

, means not a number , and __ it can happen
__ for example , when dividing

`0`

by `0`

, or when adding positive and
negative infinity .There is no negative or positive NAN , so the __ sign bit does not
matter__ .

### Denormalized values

When the exponent bits are all set to `0`

, the floating point
number is in what is called a __ denormalized form__ .

__ The rational value __ of the floating point number , is equal ,
to the sign value , multiplied by

`2`

to the exponent value , multiplied by the precision value
.The __ sign value__ is either

`1`

, if the sign bit value
is `0`

or `-1`

, if the sign bit value is `1`

.The __ exponent value __ is calculated using the formula

`1 - bias`

. The bias is calculated using the formula :The __ precision value__ is the same as the precision bits value .
The precision bits value is calculated as a fractional positional binary number .

__ To illustrate this__ , let us say , that the floating points are
encoded using

`7`

bits . `1`

bit is a sign bit , 3 bits are exponent bits , and 3
bits are precision bits .The __ sign value is__ equal to

`1`

, when the sign bit
is `0`

, and to `-1`

when the sign bit is 1 .The __ exponent value__ is equal to

`1 - bias`

, which is
equal to `-2`

.The __ precision bits values__ are , calculated as positional binary
fractions .

The __ precision values__ are the same as the precision bits values
.

As such when in denormalized form , the `7`

bits floating point
numbers , __ possible rational values __ are :

### Normalized values

__ When the exponent bits __ are not all set to

`0`

, or
are not all set to `1`

, the floating point number bits , are in normalized form .The floating point number __ rational values __, are equal to the
sign value , multiplied by two to the power of the exponent value , multiplied by the precision value .

The __ sign value__ , is equal to

`1`

, when the sign bit
is `0`

, and is equal to `-1`

, when the sign bit is set to `1`

.The __ exponent value__ , is equal to the exponent bits values ,
minus the bias . The exponent bits values , are calculated , as if the exponent bits are in the binary
positional numeral system . The bias is calculated as it was calculated in denormalized form .

The __ precision values__ , is equal to one , plus the precision
bits values . The precision bits values , are calculated as if the precision bits , are a fractional base
two number .

To __ illustrate this__ , let’s say that the floating point
number is encoded using

`7`

bits . `1`

bit is used for the sign , `3`

bits are used for the exponent , and `3`

bits are used for the precision .The __ sign value is equal__ to

`1`

, when the sign bit
is `0`

, and `-1`

when the sign bit is `1`

.The __ possible exponent values__ are :

The __ possible precision values__ are :

The 7 bits floating point number , possible __ normalized form rational
values __ are :

### Visualizing floating points possible values , when the encoding is done on 7 bits

### Commutativity , associativity , and distributivity

IEEE floating point addition commutative : a + b = b + a not associative : ( 5 + 1e40 ) - 1e40 != 5 + ( 1e40 - 1e40 ) because 0 != 5 IEEE floating point multiplication commutative : a * b = b * a not associative : 1e40 * ( 1e300 * 1e-300 ) != ( 1e40 * 1e300 ) * 1e-300 because 1e40 != Infinity not distributive over addition and subtraction 0 * (1e308 + 1e308 ) != ( 0 * 1e308 ) + ( 0 * 1e308 ) because NaN != 0 1e30 * ( 1e300 - 1e300 ) != 1e30 * 1e300 - 1e200 *1e300 because 0 != NaN IEEE floating point division not commutative : ( 1.0 / 2 ) != ( 2.0 / 1 ) not associative : ( 1.0 / 2 ) / 3 != 1 / ( 2.0 / 3 ) because 0.1666 != 1.5 IEEE floating point subtraction not commutative : 1.0 - 2 != 2 - 1.0 not associative : -4 - ( -4 - -3.0 ) != ( -4 - -4.0 ) - -3 -4 - ( -4 - -3 ) = -4 - -1 = -3 ( -4 - -4 ) - -3 = 0 - -3 = 3

## Floating points format in C

Floating point formats are used to represent real , or rational numbers , in
the computer . The C standard has __ three data types__ to be used with a floating point
standard , they are :

`float`

, `double`

, and ` long double`

.The c standard , does not specify __ which floating point standard__
is to be used , but usually , the IEEE floating point format is used , as such

`float`

is
mapped to the IEEE single precision floating point format , and `double`

is mapped to the IEEE
double precision floating point format .By default a floating point literal such as `1.0`

, has the type
`double`

, unless suffixed with `f`

, in which case it will have the type
`float`

, or suffixed with `l`

, in which case it will have the type
`long double`

.

To *detect if a floating point value is*`NaN`

, it
can be done by using the `isnan`

function , and to detect if a floating point number is
infinity it can be done using the `isinf`

function .

The `isnan`

, and `isinf`

functions are both defined in the `math.h`

header .

#include<stdio.h> #include<math.h> int main ( void ){ printf( "%d\n", isnan( 0.0/0.0 ) ); /* 0.0 divided by 0.0 , results in not a number , as such isnan returns a nonzero value . */ // Output : 1 printf( "%d\n" , isinf( 1.0/0.0 ) ); /* 1.0 divided by 0.0 results in + infinity , as such isinf returns a nonzero value */ // Output : 1 }