A __ number , can be thought__ of as a projection , used for example for measuring , be it time or distance , or for countig . A number is a discrete way of measuring , discrete as in opposed to continuous , which can be useful under certain condition .

In mathematics __ multiple kind of numbers__ exist . There are for example , the

__, such as__

*nonnegative whole numbers*`0`

or `1`

, and which are represented by the set `N`

.There are the __ integer numbers__ which are formed of :

` 0 `

, the negative whole numbers such as `-1`

or `-2`

, and the positive whole numbers . The integer numbers are represented by the set `Z`

.There is also , the __ rational numbers__ , of the form

`p/q`

, where p and q are integers , and q is different from `0`

. Rational numbers are represented by the set `Q`

.Numbers are represented in the computer , by using an algorithm . The set `N`

is represented by using the unsigned number representation , the set `Z`

is represented by using the signed number representation , and the set `Q`

is represented using the floating point number representation .

Table of Contents

## Limitations of positional numeral systems

Some numbers cannot be represented using a __ finite sequence of fractional digits__ , in a positional numeral system .

__ Irrational numbers__ , such as

`pi`

, simply cannot be represented using a finite sequence of digits , in a positional numeral system .__ Rational numbers__ of the form

`p/q`

, cannot be represented using a limited sequence of fractional digits such as `.12`

, when the chosen base , or the base multiplied by `1,2 ... q-1`

is not divisible by `q`

.As an example `1/4`

can be represented in the __ decimal positional numeral system__ , because

`10 * 2`

is divisible by `4`

, as such `1/4`

can be written in the decimal positional numeral system , as `0.25`

.`1/3`

is representable in the __ ternary positional numeral system__ , because when borrowing in this case , we are borrowing multiples of 3 , so

`1/3`

can be written as `0.1`

, in base `3`

.`1/3`

has a __ repeating representation__ in the decimal positional numeral system , because

`10 * 1 `

, and `10 * 2 `

are not divisible by `3`

. As such `1/3`

is represented in the decimal positional numeral system , as the repeating sequence `0.33333...`

.It can be proven that __ any repeating fractional number__ , can be written as a rational number , for example :

__ Problems arise , when __ limiting the fractional part of a number in a positional numeral system , to a limited number of digits .

When the number of digits is limited , this means that only a limited number of fractional values can be generated . The question hence to ask , is how to represent non generated fractional values , is it by one of the generated values , or by just stating it cannot be represented .

Let’s take as an example , the set of fractional parts generated in base 2 , *when limiting the number of fractional bits to three .*

`0.1`

in base 10 , has the following representation in base `2`

.

`0.1`

is smaller than any nonzero value , in the selected base `2`

subset , which has a limited number of fractional digits , as such it cannot be represented in this subset . This is __ called an underflow__ .

Larger values , such as values larger than `0.1`

, and which are not present in the generated set , can only be represented by approximation . `0.4`

for example , can be represented as `.011`

which is equal to `0.375`

in decimal . The __ difference between the__ stored value , and the actual value is as such :

`0.4 - 0.375`

, which is equal to `0.025`

.The stored value of `0.77`

, when subtracted from the stored value of `0.57`

, is equal to `.110 - .100`

, which is equal to `.010`

. The stored value of `0.3`

is `.010`

, as such in our representation of the fractional parts in base 2 , `0.77 - 0.57 `

is equal to `0.3`

and is *not equal to*`0.2`

.

Adding more bits to the fractional part , will __ improve the precision__ of the represented numbers , and of the calculations . For example , when

`4`

bits are used for the fractional part , `0.1`

in decimal can be represented as `0.0001`

in binary .Even with a wider number of digits , the __ problems discussed earlier will always persist__ . Some numbers cannot be represented accurately using a limited number of digits , such as the irrational , and having a limited number of digits , a limited number of fractional parts is available, as such a limited precision .

## IEEE floating point format

When using the IEEE floating point number representation , __ in addition to the precision__ , which is the number of bits selected for the fractional part , there is the sign bits , and the exponent bits .

What this means , is that additional values can be generated , because of the presence of the exponent , and the sign bits , so simply put , this is __ just a way to improve__ the precision .

An IEEE floating point number , has a bit representation , and a numerical value . The IEEE specified the __ bit format__ that a floating point number must have , and by which its numerical value is to be interpreted . This bit format can be put to work or accustomed to different word length . A word is formed of a number of bits , so this format can be put to work on 32 bits , 64 bits , 128 bits and so on … It just works the same.

A __ single precision__ floating point number has a 32 bit representation , a

__floating point number has a 64 bit representation , and a__

*double precision*__floating point number has a 128 bits representation .__

*quadruple precision*### Infinite values

When the exponent bits are set all to `1`

, and the precision bits are set all to `0`

, the floating point number bits sequence , represent an infinite value .

There are __ two infinite values__ , positive and negative infinity . When the sign bit is

`0`

, this is positive infinity , when the sign bit is `1`

, this is negative infinity .Infinity __ can be caused__ for example , when dividing a nonzero number by

`0`

, or when the result of an operation , after rounding has the infinite value .### Not A Number value

When the exponent bits are all set to `1`

, and the precision bits are set to anything but `0`

, the bit sequence , of the floating point number , __ represents __ .

`NAN`

The value `NAN`

, means not a number , and __ it can happen __ for example , when dividing

`0`

by `0`

, or when adding positive and negative infinity .There is no negative or positive NAN , so the __ sign bit does not matter__ .

### Denormalized values

When the exponent bits are all set to `0`

, the floating point number is in what is called a __ denormalized form__ .

__ The rational value __ of the floating point number , is equal , to the sign value , multiplied by

`2`

to the exponent value , multiplied by the precision value .The __ sign value__ is either

`1`

, if the sign bit value is `0`

or `-1`

, if the sign bit value is `1`

.The __ exponent value __ is calculated using the formula

`1 - bias`

. The bias is calculated using the formula :The __ precision value__ is the same as the precision bits value . The precision bits value is calculated as a fractional positional binary number .

__ To illustrate this__ , let us say , that the floating points are encoded using

`7`

bits . `1`

bit is a sign bit , 3 bits are exponent bits , and 3 bits are precision bits .The __ sign value is__ equal to

`1`

, when the sign bit is `0`

, and to `-1`

when the sign bit is 1 .The __ exponent value__ is equal to

`1 - bias`

, which is equal to `-2`

.The __ precision bits values__ are , calculated as positional binary fractions .

The __ precision values__ are the same as the precision bits values .

As such when in denormalized form , the `7`

bits floating point numbers , __ possible rational values __ are :

### Normalized values

__ When the exponent bits __ are not all set to

`0`

, or are not all set to `1`

, the floating point number bits , are in normalized form .The floating point number __ rational values __, are equal to the sign value , multiplied by two to the power of the exponent value , multiplied by the precision value .

The __ sign value__ , is equal to

`1`

, when the sign bit is `0`

, and is equal to `-1`

, when the sign bit is set to `1`

.The __ exponent value__ , is equal to the exponent bits values , minus the bias . The exponent bits values , are calculated , as if the exponent bits are in the binary positional numeral system . The bias is calculated as it was calculated in denormalized form .

The __ precision values__ , is equal to one , plus the precision bits values . The precision bits values , are calculated as if the precision bits , are a fractional base two number .

To __ illustrate this__ , let’s say that the floating point number is encoded using

`7`

bits . `1`

bit is used for the sign , `3`

bits are used for the exponent , and `3`

bits are used for the precision .The __ sign value is equal__ to

`1`

, when the sign bit is `0`

, and `-1`

when the sign bit is `1`

.The __ possible exponent values__ are :

The __ possible precision values__ are :

The 7 bits floating point number , possible __ normalized form rational values __ are :

### Visualizing floating points possible values , when the encoding is done on 7 bits

### Commutativity , associativity , and distributivity

IEEE floating point addition commutative : a + b = b + a not associative : ( 5 + 1e40 ) - 1e40 != 5 + ( 1e40 - 1e40 ) because 0 != 5 IEEE floating point multiplication commutative : a * b = b * a not associative : 1e40 * ( 1e300 * 1e-300 ) != ( 1e40 * 1e300 ) * 1e-300 because 1e40 != Infinity not distributive over addition and subtraction 0 * (1e308 + 1e308 ) != ( 0 * 1e308 ) + ( 0 * 1e308 ) because NaN != 0 1e30 * ( 1e300 - 1e300 ) != 1e30 * 1e300 - 1e200 *1e300 because 0 != NaN IEEE floating point division not commutative : ( 1.0 / 2 ) != ( 2.0 / 1 ) not associative : ( 1.0 / 2 ) / 3 != 1 / ( 2.0 / 3 ) because 0.1666 != 1.5 IEEE floating point subtraction not commutative : 1.0 - 2 != 2 - 1.0 not associative : -4 - ( -4 - -3.0 ) != ( -4 - -4.0 ) - -3 -4 - ( -4 - -3 ) = -4 - -1 = -3 ( -4 - -4 ) - -3 = 0 - -3 = 3

## Floating points format in C

Floating point formats are used to represent real , or rational numbers , in the computer . The C standard has __ three data types__ to be used with a floating point standard , they are :

`float`

, `double`

, and ` long double`

.The c standard , does not specify __ which floating point standard__ is to be used , but usually , the IEEE floating point format is used , as such

`float`

is mapped to the IEEE single precision floating point format , and `double`

is mapped to the IEEE double precision floating point format .By default a floating point literal such as `1.0`

, has the type `double`

, unless suffixed with `f`

, in which case it will have the type `float`

, or suffixed with `l`

, in which case it will have the type `long double`

.

To *detect if a floating point value is*`NaN`

, it can be done by using the `isnan`

function , and to detect if a floating point number is infinity it can be done using the `isinf`

function .

The `isnan`

, and `isinf`

functions are both defined in the `math.h`

header .

#include<stdio.h> #include<math.h> int main ( void ){ printf( "%d\n", isnan( 0.0/0.0 ) ); /* 0.0 divided by 0.0 , results in not a number , as such isnan returns a nonzero value . */ // Output : 1 printf( "%d\n" , isinf( 1.0/0.0 ) ); /* 1.0 divided by 0.0 results in + infinity , as such isinf returns a nonzero value */ // Output : 1 }