Accelerating the pace of engineering and science

# Documentation

## Precision

### Limitations on Precision

Computer words consist of a finite numbers of bits. This means that the binary encoding of variables is only an approximation of an arbitrarily precise real-world value. Therefore, the limitations of the binary representation automatically introduce limitations on the precision of the value. For a general discussion of range and precision, refer to Range and Precision.

The precision of a fixed-point word depends on the word size and binary point location. Extending the precision of a word can always be accomplished with more bits, but you face practical limitations with this approach. Instead, you must carefully select the data type, word size, and scaling such that numbers are accurately represented. Rounding and padding with trailing zeros are typical methods implemented on processors to deal with the precision of binary words.

### Rounding

The result of any operation on a fixed-point number is typically stored in a register that is longer than the number's original format. When the result is put back into the original format, the extra bits must be disposed of. That is, the result must be rounded. Rounding involves going from high precision to lower precision and produces quantization errors and computational noise.

### Choose a Rounding Mode

To choose the most suitable rounding mode for your application, you need to consider your system requirements and the properties of each rounding mode. The most important properties to consider are:

• Cost — Independent of the hardware being used, how much processing expense does the rounding method require?

• Bias — What is the expected value of the rounded values minus the original values?

• Possibility of overflow — Does the rounding method introduce the possibility of overflow?

For more information on when to use each rounding mode, see Rounding Methods in the Fixed-Point Designer™ User's Guide.

#### Choosing a Rounding Mode for Diagnostic Purposes

Rounding toward ceiling and rounding toward floor are sometimes useful for diagnostic purposes. For example, after a series of arithmetic operations, you may not know the exact answer because of word-size limitations, which introduce rounding. If every operation in the series is performed twice, once rounding to positive infinity and once rounding to negative infinity, you obtain an upper limit and a lower limit on the correct answer. You can then decide if the result is sufficiently accurate or if additional analysis is necessary.

### Rounding Modes for Fixed-Point Simulink Blocks

Fixed-point Simulink® blocks support the rounding modes shown in the expanded drop-down menu of the following dialog box.

The following table illustrates the differences between these rounding modes:

Rounding ModeDescriptionTie Handling
CeilingRounds to the nearest representable number in the direction of positive infinity. N/A
FloorRounds to the nearest representable number in the direction of negative infinity. N/A
ZeroRounds to the nearest representable number in the direction of zero.N/A
ConvergentRounds to the nearest representable number. Ties are rounded toward the nearest even integer.
NearestRounds to the nearest representable number. Ties are rounded to the closest representable number in the direction of positive infinity.
RoundRounds to the nearest representable number.

For positive numbers, ties are rounded toward the closest representable number in the direction of positive infinity.

For negative numbers, ties are rounded toward the closest representable number in the direction of negative infinity.

SimplestAutomatically chooses between Floor and Zero to produce generated code that is as efficient as possible. N/A

### Rounding Mode: Ceiling

When you round toward ceiling, both positive and negative numbers are rounded toward positive infinity. As a result, a positive cumulative bias is introduced in the number.

In the MATLAB® software, you can round to ceiling using the ceil function. Rounding toward ceiling is shown in the following figure.

### Rounding Mode: Convergent

Convergent rounds toward the nearest representable value with ties rounding toward the nearest even integer. It eliminates bias due to rounding. However, it introduces the possibility of overflow.

In the MATLAB software, you can perform convergent rounding using the convergent function. Convergent rounding is shown in the following figure.

### Rounding Mode: Floor

When you round toward floor, both positive and negative numbers are rounded to negative infinity. As a result, a negative cumulative bias is introduced in the number.

In the MATLAB software, you can round to floor using the floor function. Rounding toward floor is shown in the following figure.

### Rounding Mode: Nearest

When you round toward nearest, the number is rounded to the nearest representable value. In the case of a tie, nearest rounds to the closest representable number in the direction of positive infinity.

In the Fixed-Point Designer software, you can round to nearest using the nearest function. Rounding toward nearest is shown in the following figure.

### Rounding Mode: Round

Round rounds to the closest representable number. In the case of a tie, it rounds:

• Positive numbers to the closest representable number in the direction of positive infinity.

• Negative numbers to the closest representable number in the direction of negative infinity.

As a result:

• A small negative bias is introduced for negative samples.

• No bias is introduced for samples with evenly distributed positive and negative values.

• A small positive bias is introduced for positive samples.

In the MATLAB software, you can perform this type of rounding using the round function. The rounding mode Round is shown in the following figure.

### Rounding Mode: Simplest

The simplest rounding mode attempts to reduce or eliminate the need for extra rounding code in your generated code using a combination of techniques, discussed in the following sections:

In nearly all cases, the simplest rounding mode produces the most efficient generated code. For a very specialized case of division that meets three specific criteria, round to floor might be more efficient. These three criteria are:

• Fixed-point/integer signed division

• Denominator is an invariant constant

• Denominator is an exact power of two

For this case, set the rounding mode to floor and the Model Configuration Parameters > Hardware Implementation > Production Hardware > Signed integer division rounds to parameter to describe the rounding behavior of your production target.

#### Optimize Rounding for Casts

The Data Type Conversion block casts a signal with one data type to another data type. When the block casts the signal to a data type with a shorter word length than the original data type, precision is lost and rounding occurs. The simplest rounding mode automatically chooses the best rounding for these cases based on the following rules:

• When casting from one integer or fixed-point data type to another, the simplest mode rounds toward floor.

• When casting from a floating-point data type to an integer or fixed-point data type, the simplest mode rounds toward zero.

#### Optimize Rounding for High-Level Arithmetic Operations

The simplest rounding mode chooses the best rounding for each high-level arithmetic operation. For example, consider the operation y = u1 × u2 / u3 implemented using a Product block:

As stated in the C standard, the most efficient rounding mode for multiplication operations is always floor. However, the C standard does not specify the rounding mode for division in cases where at least one of the operands is negative. Therefore, the most efficient rounding mode for a divide operation with signed data types can be floor or zero, depending on your production target.

The simplest rounding mode:

• Rounds to floor for all nondivision operations.

• Rounds to zero or floor for division, depending on the setting of the Model Configuration Parameters > Hardware Implementation > Production Hardware > Signed integer division rounds to parameter.

To get the most efficient code, you must set the Signed integer division rounds to parameter to specify whether your production target rounds to zero or to floor for integer division. Most production targets round to zero for integer division operations. Note that Simplest rounding enables "mixed-mode" rounding for such cases, as it rounds to floor for multiplication and to zero for division.

If the Signed integer division rounds to parameter is set to Undefined, the simplest rounding mode might not be able to produce the most efficient code. The simplest mode rounds to zero for division for this case, but it cannot rely on your production target to perform the rounding, because the parameter is Undefined. Therefore, you need additional rounding code to ensure rounding to zero behavior.

 Note:   For signed fixed-point division where the denominator is an invariant constant power of 2, the simplest rounding mode does not generate the most efficient code. In this case, set the rounding mode to floor.

#### Optimize Rounding for Intermediate Arithmetic Operations

For fixed-point arithmetic with nonzero slope and bias, the simplest rounding mode also chooses the best rounding for each intermediate arithmetic operation. For example, consider the operation y = u1 / u2 implemented using a Product block, where u1 and u2 are fixed-point quantities:

As discussed in Fixed-Point Numbers, each fixed-point quantity is calculated using its slope, bias, and stored integer. So in this example, not only is there the high-level divide called for by the block operation, but intermediate additions and multiplies are performed:

$y=\frac{{u}_{1}}{{u}_{2}}=\frac{{S}_{1}{Q}_{1}+{B}_{1}}{{S}_{2}{Q}_{2}+{B}_{2}}$

The simplest rounding mode performs the best rounding for each of these operations, high-level and intermediate, to produce the most efficient code. The rules used to select the appropriate rounding for intermediate arithmetic operations are the same as those described in Optimize Rounding for High-Level Arithmetic Operations. Again, this enables mixed-mode rounding, with the most common case being round toward floor used for additions, subtractions, and multiplies, and round toward zero used for divides.

Remember that generating the most efficient code using the simplest rounding mode requires you to set the Model Configuration Parameters > Hardware Implementation > Production Hardware > Signed integer division rounds to parameter to describe the rounding behavior of your production target.

 Note:   For signed fixed-point division where the denominator is an invariant constant power of 2, the simplest rounding mode does not generate the most efficient code. In this case, set the rounding mode to floor.

### Rounding Mode: Zero

Rounding towards zero is the simplest rounding mode computationally. All digits beyond the number required are dropped. Rounding towards zero results in a number whose magnitude is always less than or equal to the more precise original value. In the MATLAB software, you can round to zero using the fix function.

Rounding toward zero introduces a cumulative downward bias in the result for positive numbers and a cumulative upward bias in the result for negative numbers. That is, all positive numbers are rounded to smaller positive numbers, while all negative numbers are rounded to smaller negative numbers. Rounding toward zero is shown in the following figure.

#### Rounding to Zero Versus Truncation

Rounding to zero and truncation or chopping are sometimes thought to mean the same thing. However, the results produced by rounding to zero and truncation are different for unsigned and two's complement numbers. For this reason, the ambiguous term "truncation" is not used in this guide, and explicit rounding modes are used instead.

To illustrate this point, consider rounding a 5-bit unsigned number to zero by dropping (truncating) the two least significant bits. For example, the unsigned number 100.01 = 4.25 is truncated to 100 = 4. Therefore, truncating an unsigned number is equivalent to rounding to zero or rounding to floor.

Now consider rounding a 5-bit two's complement number by dropping the two least significant bits. At first glance, you may think truncating a two's complement number is the same as rounding to zero. For example, dropping the last two digits of -3.75 yields -3.00. However, digital hardware performing two's complement arithmetic yields a different result. Specifically, the number 100.01 = -3.75 truncates to 100 = -4, which is rounding to floor.

Padding with trailing zeros involves extending the least significant bit (LSB) of a number with extra bits. This method involves going from low precision to higher precision.

For example, suppose two numbers are subtracted from each other. First, the exponents must be aligned, which typically involves a right shift of the number with the smaller value. In performing this shift, significant digits can "fall off" to the right. However, when the appropriate number of extra bits is appended, the precision of the result is maximized. Consider two 8-bit fixed-point numbers that are close in value and subtracted from each other:

$1.0000000×{2}^{q}-1.1111111×{2}^{q-1},$

where q is an integer. To perform this operation, the exponents must be equal:

$\begin{array}{c}1.0000000×{2}^{q}\\ \frac{-0.1111111×{2}^{q}}{\text{\hspace{0.17em}}\text{\hspace{0.17em}}0.0000001×{2}^{q}}.\end{array}$

If the top number is padded by two zeros and the bottom number is padded with one zero, then the above equation becomes

$\begin{array}{c}1.000000000×{2}^{q}\\ \frac{-0.111111110×{2}^{q}}{\text{\hspace{0.17em}}\text{\hspace{0.17em}}0.000000010×{2}^{q}},\end{array}$

which produces a more precise result. An example of padding with trailing zeros in a Simulink model is illustrated in Digital Controller Realization.

### Limitations on Precision and Errors

Fixed-point variables have a limited precision because digital systems represent numbers with a finite number of bits. For example, suppose you must represent the real-world number 35.375 with a fixed-point number. Using the encoding scheme described in Scaling, the representation is

$V\approx \stackrel{˜}{V}=SQ+B={2}^{-2}Q+32,$

where V = 35.375.

The two closest approximations to the real-world value are Q = 13 and Q = 14:

$\begin{array}{l}\stackrel{˜}{V}={2}^{-2}\left(13\right)+32=35.25,\\ \stackrel{˜}{V}={2}^{-2}\left(14\right)+32=35.50.\end{array}$

In either case, the absolute error is the same:

$|\stackrel{˜}{V}-V|=0.125=\frac{S}{2}=\frac{F{2}^{E}}{2}.$

For fixed-point values within the limited range, this represents the worst-case error if round-to-nearest is used. If other rounding modes are used, the worst-case error can be twice as large:

$|\stackrel{˜}{V}-V|

### Maximize Precision

Precision is limited by slope. To achieve maximum precision, you should make the slope as small as possible while keeping the range adequately large. The bias is adjusted in coordination with the slope.

Assume the maximum and minimum real-world values are given by max(V) and min(V), respectively. These limits might be known based on physical principles or engineering considerations. To maximize the precision, you must decide upon a rounding scheme and whether overflows saturate or wrap. To simplify matters, this example assumes the minimum real-world value corresponds to the minimum encoded value, and the maximum real-world value corresponds to the maximum encoded value. Using the encoding scheme described in Scaling, these values are given by

$\begin{array}{c}\mathrm{max}\left(V\right)=F{2}^{E}\left(\mathrm{max}\left(Q\right)\right)+B\\ \mathrm{min}\left(V\right)=F{2}^{E}\left(\mathrm{min}\left(Q\right)\right)+B.\end{array}$

Solving for the slope, you get

$F{2}^{E}=\frac{\mathrm{max}\left(V\right)-\mathrm{min}\left(V\right)}{\mathrm{max}\left(Q\right)-\mathrm{min}\left(Q\right)}=\frac{\mathrm{max}\left(V\right)-\mathrm{min}\left(V\right)}{{2}^{ws}-1}.$

This formula is independent of rounding and overflow issues, and depends only on the word size, ws.

### Net Slope and Net Bias Precision

#### What are Net Slope and Net Bias?

You can represent a fixed-point number by a general slope and bias encoding scheme

$V\approx \stackrel{˜}{V}=SQ+B,$

where:

• $V$ is an arbitrarily precise real-world value.

• $\stackrel{˜}{V}$ is the approximate real-world value.

• Q, the stored value, is an integer that encodes V.

• S = F2E is the slope.

• B is the bias.

For a cast operation,

${S}_{a}{Q}_{a}+{B}_{a}={S}_{b}{Q}_{b}+{B}_{b}$

or

${Q}_{a}=\frac{{S}_{b}{Q}_{b}}{{S}_{a}}+\left(\frac{{B}_{b}-{B}_{a}}{{S}_{a}}\right),$

where:

• $\frac{{S}_{b}}{{S}_{a}}$ is the net slope.

• $\frac{{B}_{b}-{B}_{a}}{{S}_{a}}$ is the net bias.

#### Detecting Net Slope and Net Bias Precision Issues

Precision issues might occur in the fixed-point constants, net slope and net bias, due to quantization errors when you convert from floating point to fixed point. These fixed-point constant precision issues can result in numerical inaccuracy in your model.

You can configure your model to alert you when fixed-point constant precision issues occur. For more information, see Detect Net Slope and Bias Precision Issues. The Fixed-Point Designer software provides the following information:

• The type of precision issue: underflow, overflow, or precision loss.

• The original value of the fixed-point constant.

• The quantized value of the fixed-point constant.

• The error in the value of the fixed-point constant.

• The block that introduced the error.

This information warns you that the outputs from this block are not accurate. If possible, change the data types in your model to fix the issue.

#### Fixed-Point Constant Underflow

Fixed-point constant underflow occurs when the Fixed-Point Designer software encounters a fixed-point constant whose data type does not have enough precision to represent the ideal value of the constant, because the ideal value is too close to zero. Casting the ideal value to the fixed-point data type causes the value of the fixed-point constant to become zero. Therefore the value of the fixed-point constant differs from its ideal value.

#### Fixed-Point Constant Overflow

Fixed-point constant overflow occurs when the Fixed-Point Designer software converts a fixed-point constant to a data type whose range is not large enough to accommodate the ideal value of the constant with reasonable precision. The data type cannot accurately represent the ideal value because the ideal value is either too large or too small. Casting the ideal value to the fixed-point data type causes overflow. For example, suppose the ideal value is 200 and the converted data type is int8. Overflow occurs in this case because the maximum value that int8 can represent is 127.

The Fixed-Point Designer software reports an overflow error if the quantized value differs from the ideal value by more than the precision for the data type. The precision for a data type is approximately equal to the default scaling (for more information, see Fixed-Point Data Type Parameters.) Therefore, for positive values, the Fixed-Point Designer software treats errors greater than the slope as overflows. For negative values, it treats errors greater than or equal to the slope as overflows.

For example, the maximum value that int8 can represent is 127. The precision for int8 is 1.0. An ideal value of 127.3 quantizes to 127 with an absolute error of 0.3. Although the ideal value 127.3 is greater than the maximum representable value for int8, the quantization error is small relative to the precision of int8. Therefore the Fixed-Point Designer software does not report an overflow. However, an ideal value of 128.1 does cause an overflow because the quantization error is 1.1, which is larger than the precision for int8.

 Note:   Fixed-point constant overflow differs from fixed-point constant precision loss. Precision loss occurs when the ideal fixed-point constant value is within the range of the current data type and scaling, but the software cannot represent this value exactly.

#### Fixed-Point Constant Precision Loss

Fixed-point constant precision loss occurs when the Fixed-Point Designer software converts a fixed-point constant to a data type without enough precision to represent the exact value of the constant. As a result, the quantized value differs from the ideal value. For an example of this behavior, see Detect Fixed-Point Constant Precision Loss.

 Note:   Fixed-point constant precision loss differs from fixed-point constant overflow. Overflow occurs when the range of the parameter data type, that is, the maximum value that it can represent, is smaller than the ideal value of the parameter.

### Detect Net Slope and Bias Precision Issues

To receive alerts when fixed-point constant precision issues occur, use these options available in the Simulink Configuration Parameters dialog box, on the Diagnostics > Type Conversion pane. Set the parameters to warning or error so that Simulink alerts you when precision issues occur.

Configuration ParameterSpecifiesDefault
Detect underflow Diagnostic action when a fixed-point constant underflow occurs during simulationDoes not generate a warning or error.
Detect overflowDiagnostic action when a fixed-point constant overflow occurs during simulationDoes not generate a warning or error.
Detect precision lossDiagnostic action when a fixed-point constant precision loss occurs during simulationDoes not generate a warning or error.

### Detect Fixed-Point Constant Precision Loss

This example shows how to detect fixed-point constant precision loss. The example uses the following model.

For the Data Type Conversion block in this model, the:

• Input slope, SU = 1

• Output slope, SY = 1.000001

• Net slope, SU/SY = 1/1.000001

When you simulate the model, a net slope quantization error occurs.

To set up the model and run the simulation:

1. For the Inport block, set the Output data type to int16.

2. For the Data Type Conversion block, set the Output data type to fixdt(1,16, 1.000001, 0).

3. Set the Diagnostics > Type Conversion > Detect precision loss configuration parameter to error.