## Floating-Point Numbers

“Floating point” refers to a set of data types that encode real numbers, including fractions and decimals. Floating-point data types allow for a varying number of digits after the decimal point, while fixed-point data types have a specific number of digits reserved before and after the decimal point. So, floating-point data types can represent a wider range of numbers than fixed-point data types.

Due to limited memory for number representation and storage, computers can represent a finite set of floating-point numbers that have finite precision. This finite precision can limit accuracy for floating-point computations that require exact values or high precision, as some numbers are not represented exactly. Despite their limitations, floating-point numbers are widely used due to their fast calculations and sufficient precision and range for solving real-world problems.

### Floating-Point Numbers in MATLAB

MATLAB^{®} has data types for double-precision (`double`

) and single-precision (`single`

) floating-point numbers following IEEE^{®} Standard 754. By default, MATLAB represents floating-point numbers in double precision. Double precision
allows you to represent numbers to greater precision but requires more memory than single
precision. To conserve memory, you can convert a number to single precision by using the
`single`

function.

You can store numbers between approximately –3.4 × 10^{38} and
3.4 × 10^{38} using either double or single precision. If you have
numbers outside of that range, store them using double precision.

#### Create Double-Precision Data

Because the default numeric type for MATLAB is type `double`

, you can create a double-precision
floating-point number with a simple assignment statement.

x = 10; c = class(x)

c = 'double'

You can convert numeric data, characters or strings, and logical data to double
precision by using the `double`

function. For example, convert a
signed integer to a double-precision floating-point number.

x = int8(-113); y = double(x)

y = -113

#### Create Single-Precision Data

To create a single-precision number, use the `single`

function.

x = single(25.783);

You can also convert numeric data, characters or strings, and logical data to single
precision by using the `single`

function. For example, convert a
signed integer to a single-precision floating-point number.

x = int8(-113); y = single(x)

y = single -113

#### How MATLAB Stores Floating-Point Numbers

MATLAB constructs its `double`

and `single`

floating-point data types according to IEEE format and follows the *round to nearest, ties to
even* rounding mode by default.

A floating-point number *x* has the form:

$$x=-{1}^{s}\cdot (1+f)\cdot {2}^{e}$$

where:

*s*determines the sign.*f*is the fraction, or*mantissa*, which satisfies 0 ≤*f*< 1.*e*is the exponent.

*s*, *f*, and *e* are each
determined by a finite number of bits in memory, with *f* and
*e* depending on the precision of the data type.

Storage of a `double`

number requires 64 bits, as shown in this
table.

Bits | Width | Usage |
---|---|---|

`63` | `1` | Stores the sign, where `0` is positive and
`1` is negative |

`62` to `52` | `11` | Stores the exponent, biased by `1023` |

`51` to `0` | `52` | Stores the mantissa |

Storage of a `single`

number requires 32 bits, as shown in this
table.

Bits | Width | Usage |
---|---|---|

`31` | `1` | Stores the sign, where `0` is positive and
`1` is negative |

`30` to `23` | `8` | Stores the exponent, biased by `127` |

`22` to `0` | `23` | Stores the mantissa |

### Largest and Smallest Values for Floating-Point Data Types

The double- and single-precision data types have a largest and smallest value that you can represent. Numbers outside of the representable range are assigned positive or negative infinity. However, some numbers within the representable range cannot be stored exactly due to the gaps between consecutive floating-point numbers, and these numbers can have round-off errors.

#### Largest and Smallest Double-Precision Values

Find the largest and smallest positive values that can be represented with the
`double`

data type by using the `realmax`

and `realmin`

functions, respectively.

m = realmax

m = 1.7977e+308

n = realmin

n = 2.2251e-308

`realmax`

and `realmin`

return normalized
IEEE values. You can find the largest and smallest negative values by
multiplying `realmax`

and `realmin`

by
`-1`

. Numbers greater than `realmax`

or less than
`–realmax`

are assigned the values of positive or negative infinity,
respectively.

#### Largest and Smallest Single-Precision Values

Find the largest and smallest positive values that can be represented with the
`single`

data type by calling the `realmax`

and
`realmin`

functions with the argument
`"single"`

.

`m = realmax("single")`

m = single 3.4028e+38

`n = realmin("single")`

n = single 1.1755e-38

You can find the largest and smallest negative values by multiplying
`realmax("single")`

and `realmin("single")`

by
`–1`

. Numbers greater than `realmax("single")`

or
less than `–realmax("single")`

are assigned the values of positive or
negative infinity, respectively.

#### Largest Consecutive Floating-Point Integers

Not all integers are representable using floating-point data types. The
*largest consecutive integer*, *x*, is the
greatest integer for which all integers less than or equal to *x* can
be exactly represented, but *x* + 1 cannot be represented in
floating-point format. The `flintmax`

function returns the largest
consecutive integer. For example, find the largest consecutive integer in
double-precision floating-point format, which is 2^{53}, by
using the `flintmax`

function.

x = flintmax

x = 9.0072e+15

Find the largest consecutive integer in single-precision floating-point format,
which is 2^{24}.

`y = flintmax("single")`

y = single 16777216

When you convert an integer data type to a floating-point data type, integers that
are not exactly representable in floating-point format lose accuracy.
`flintmax`

, which is a floating-point number, is less than the
greatest integer representable by integer data types using the same number of bits. For
example, `flintmax`

for double precision is
2^{53}, while the maximum value for type
`int64`

is 2^{64} – 1. Therefore, converting
an integer greater than 2^{53} to double precision results in a
loss of accuracy.

### Accuracy of Floating-Point Data

The accuracy of floating-point data can be affected by several factors:

Limitations of your computer hardware — For example, hardware with insufficient memory truncates the results of floating-point calculations.

Gaps between each floating-point number and the next larger floating-point number — These gaps are present on any computer and limit precision.

#### Gaps Between Floating-Point Numbers

You can determine the size of a gap between consecutive floating-point numbers by
using the `eps`

function. For example, find the
distance between `5`

and the next larger double-precision
number.

e = eps(5)

e = 8.8818e-16

You cannot represent numbers between `5`

and ```
5 +
eps(5)
```

in double-precision format. If a double-precision computation returns
the answer `5`

, the result is accurate within
`eps(5)`

. This radius of accuracy is often called *machine
epsilon*.

The gaps between floating-point numbers are not equal. For example, the gap between
`1e10`

and the next larger double-precision number is larger than the
gap between `5`

and the next larger double-precision number.

e = eps(1e10)

e = 1.9073e-06

Similarly, find the distance between `5`

and the next larger
single-precision number.

x = single(5); e = eps(x)

e = single 4.7684e-07

Gaps between single-precision numbers are larger than the gaps between double-precision numbers because there are fewer single-precision numbers. So, results of single-precision calculations are less precise than results of double-precision calculations.

When you convert a double-precision number to a single-precision number, you can
determine the upper bound for the amount the number is rounded by using the
`eps`

function. For example, when you convert the double-precision
number `3.14`

to single precision, the number is rounded by at most
`eps(single(3.14))`

.

#### Gaps Between Consecutive Floating-Point Integers

The `flintmax`

function returns the largest consecutive integer
in floating-point format. Above this value, consecutive floating-point integers have a
gap greater than `1`

.

Find the gap between `flintmax`

and the next floating-point number
by using `eps`

:

```
format long
x = flintmax
```

x = 9.007199254740992e+15

e = eps(x)

e = 2

Because `eps(x)`

is `2`

, the next larger
floating-point number that can be represented exactly is ```
x +
2
```

.

y = x + e

y = 9.007199254740994e+15

If you add `1`

to `x`

, the result is rounded to
`x`

.

z = x + 1

z = 9.007199254740992e+15

### Arithmetic Operations on Floating-Point Numbers

You can use a range of data types in arithmetic operations with floating-point numbers, and the data type of the result depends on the input types. However, when you perform operations with different data types, some calculations may not be exact due to approximations or intermediate conversions.

#### Double-Precision Operands

You can perform basic arithmetic operations with `double`

and any
of the following data types. If one or more operands are an integer scalar or array, the
`double`

operand must be a scalar. The result is of type
`double`

, except where noted otherwise.

`single`

— The result is of type`single`

.`double`

`int8`

,`int16`

,`int32`

,`int64`

— The result is of the same data type as the integer operand.`uint8`

,`uint16`

,`uint32`

,`uint64`

— The result is of the same data type as the integer operand.`char`

`logical`

#### Single-Precision Operands

You can perform basic arithmetic operations with `single`

and any
of the following data types. The result is of type `single`

.

`single`

`double`

`char`

`logical`

### Unexpected Results with Floating-Point Arithmetic

Almost all operations in MATLAB are performed in double-precision arithmetic conforming to IEEE Standard 754. Because computers represent numbers to a finite precision, some computations can yield mathematically nonintuitive results. Some common issues that can arise while computing with floating-point numbers are round-off error, cancellation, swamping, and intermediate conversions. The unexpected results are not bugs in MATLAB and occur in any software that uses floating-point numbers. For exact rational representations of numbers, consider using the Symbolic Math Toolbox™.

#### Round-Off Error

Round-off error can occur due to the finite-precision representation of
floating-point numbers. For example, the number `4/3`

cannot be
represented exactly as a binary fraction. As such, this calculation returns the quantity
`eps(1)`

, rather than `0`

.

e = 1 - 3*(4/3 - 1)

e = 2.2204e-16

Similarly, because `pi`

is not an exact representation of π,
`sin(pi)`

is not exactly zero.

x = sin(pi)

x = 1.2246e-16

Round-off error is most noticeable when many operations are performed on floating-point numbers, allowing errors to accumulate and compound. A best practice is to minimize the number of operations whenever possible.

#### Cancellation

Cancellation can occur when you subtract a number from another number of roughly the
same magnitude, as measured by `eps`

. For example,
`eps(2^53)`

is `2`

, so the numbers ```
2^53 +
1
```

and `2^53`

have the same floating-point
representation.

x = (2^53 + 1) - 2^53

x = 0

When possible, try rewriting computations in an equivalent form that avoids cancellations.

#### Swamping

Swamping can occur when you perform operations on floating-point numbers that differ by many orders of magnitude. For example, this calculation shows a loss of precision that makes the addition insignificant.

x = 1 + 1e-16

x = 1

#### Intermediate Conversions

When you perform arithmetic with different data types, intermediate calculations and
conversions can yield unexpected results. For example, although `x`

and
`y`

are both `0.2`

, subtracting them yields a
nonzero result. The reason is that `y`

is first converted to
`double`

before the subtraction is performed. This subtraction result
is then converted to `single`

, `z`

.

```
format long
x = 0.2
```

x = 0.200000000000000

y = single(0.2)

y = single 0.2000000

z = x - y

z = single -2.9802323e-09

#### Linear Algebra

Common issues in floating-point arithmetic, such as the ones described above, can
compound when applied to linear algebra problems because the related calculations
typically consist of multiple steps. For example, when solving the system of linear
equations `Ax = b`

, MATLAB warns that the results may be inaccurate because operand matrix
`A`

is ill conditioned.

A = diag([2 eps]); b = [2; eps]; x = A\b;

Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. RCOND = 1.110223e-16.

## References

[1] Moler, Cleve. *Numerical Computing with
MATLAB*. Natick, MA: The MathWorks, Inc., 2004.