Documentation |
On this page… |
---|
Fixed-point arithmetic refers to how signed or unsigned binary words are operated on. The simplicity of fixed-point arithmetic functions such as addition and subtraction allows for cost-effective hardware implementations.
The sections that follow describe the rules that the Simulink^{®} software follows when arithmetic operations are performed on inputs and parameters. These rules are organized into four groups based on the operations involved: addition and subtraction, multiplication, division, and shifts. For each of these four groups, the rules for performing the specified operation are presented with an example using the rules.
The core architecture of many processors contains several computational units including arithmetic logic units (ALUs), multiply and accumulate units (MACs), and shifters. These computational units process the binary data directly and provide support for arithmetic computations of varying precision. The ALU performs a standard set of arithmetic and logic operations as well as division. The MAC performs multiply, multiply/add, and multiply/subtract operations. The shifter performs logical and arithmetic shifts, normalization, denormalization, and other operations.
Addition is the most common arithmetic operation a processor performs. When two n-bit numbers are added together, it is always possible to produce a result with n + 1 nonzero digits due to a carry from the leftmost digit. For two's complement addition of two numbers, there are three cases to consider:
If both numbers are positive and the result of their addition has a sign bit of 1, then overflow has occurred; otherwise the result is correct.
If both numbers are negative and the sign of the result is 0, then overflow has occurred; otherwise the result is correct.
If the numbers are of unlike sign, overflow cannot occur and the result is always correct.
Consider the summation of two numbers. Ideally, the real-world values obey the equation
$${V}_{a}=\pm {V}_{b}\pm {V}_{c},$$
where V_{b} and V_{c} are the input values and V_{a} is the output value. To see how the summation is actually implemented, the three ideal values should be replaced by the general [Slope Bias] encoding scheme described in Scaling:
$${V}_{i}={F}_{i}{2}^{{E}_{i}}{Q}_{i}+{B}_{i}.$$
The equation in Addition gives the solution of the resulting equation for the stored integer, Q_{a}. Using shorthand notation, that equation becomes
$${Q}_{a}=\pm {F}_{sb}{2}^{{E}_{b}-{E}_{a}}{Q}_{b}\pm {F}_{sc}{2}^{{E}_{c}-{E}_{a}}{Q}_{c}+{B}_{net},$$
where F_{sb} and F_{sc} are the adjusted fractional slopes and B_{net} is the net bias. The offline conversions and online conversions and operations are discussed below.
Offline Conversions. F_{sb}, F_{sc}, and B_{net} are computed offline using round-to-nearest and saturation. Furthermore, B_{net} is stored using the output data type.
Online Conversions and Operations. The remaining operations are performed online by the fixed-point processor, and depend on the slopes and biases for the input and output data types. The worst (most inefficient) case occurs when the slopes and biases are mismatched. The worst-case conversions and operations are given by these steps:
The initial value for Q_{a} is given by the net bias, B_{net}:
$${Q}_{a}={B}_{net}.$$
The first input integer value, Q_{b}, is multiplied by the adjusted slope, F_{sb}:
$${Q}_{RawProduct}={F}_{sb}{Q}_{b}.$$
The previous product is converted to the modified output data type where the slope is one and the bias is zero:
$${Q}_{Temp}=convert\left({Q}_{RawProduct}\right).$$
This conversion includes any necessary bit shifting, rounding, or overflow handling.
The summation operation is performed:
$${Q}_{a}={Q}_{a}\pm {Q}_{Temp}.$$
This summation includes any necessary overflow handling.
It is important to note that bit shifting, rounding, and overflow handling are applied to the intermediate steps (3 and 4) and not to the overall sum.
If the scaling of the input and output signals is matched, the number of summation operations is reduced from the worst (most inefficient) case. For example, when an input has the same fractional slope as the output, step 2 reduces to multiplication by one and can be eliminated. Trivial steps in the summation process are eliminated for both simulation and code generation. Exclusive use of binary-point-only scaling for both input signals and output signals is a common way to eliminate mismatched slopes and biases, and results in the most efficient simulations and generated code.
Suppose you want to sum three numbers. Each of these numbers is represented by an 8-bit word, and each has a different binary-point-only scaling. Additionally, the output is restricted to an 8-bit word with binary-point-only scaling of 2^{-3}.
The summation is shown in the following model for the input values 19.875, 5.4375, and 4.84375.
Applying the rules from the previous section, the sum follows these steps:
Because the biases are matched, the initial value of Q_{a} is trivial:
$${Q}_{a}=\mathrm{00000.000.}$$
The first number to be summed (19.875) has a fractional slope that matches the output fractional slope. Furthermore, the binary points and storage types are identical, so the conversion is trivial:
$$\begin{array}{l}{Q}_{b}=10011.111,\\ {Q}_{Temp}={Q}_{b}.\end{array}$$
The summation operation is performed:
$${Q}_{a}={Q}_{a}+{Q}_{Temp}=\mathrm{10011.111.}$$
The second number to be summed (5.4375) has a fractional slope that matches the output fractional slope, so a slope adjustment is not needed. The storage data types also match, but the difference in binary points requires that both the bits and the binary point be shifted one place to the right:
$$\begin{array}{l}{Q}_{c}=0101.0111,\\ {Q}_{Temp}=convert\left({Q}_{c}\right)\\ {Q}_{Temp}=\mathrm{00101.011.}\end{array}$$
Note that a loss in precision of one bit occurs, with the resulting value of Q_{Temp} determined by the rounding mode. For this example, round-to-floor is used. Overflow cannot occur in this case because the bits and binary point are both shifted to the right.
The summation operation is performed:
$$\begin{array}{c}{Q}_{a}={Q}_{a}+{Q}_{Temp}\\ \text{}10011.111\\ =\frac{+\text{\hspace{0.17em}}00101.011}{\text{\hspace{0.17em}}\text{\hspace{0.17em}}11001.010}\begin{array}{c}\\ =\mathrm{25.250.}\end{array}\end{array}$$
Note that overflow did not occur, but it is possible for this operation.
The third number to be summed (4.84375) has a fractional slope that matches the output fractional slope, so a slope adjustment is not needed. The storage data types also match, but the difference in binary points requires that both the bits and the binary point be shifted two places to the right:
$$\begin{array}{l}{Q}_{d}=100.11011,\\ {Q}_{Temp}=convert\left({Q}_{d}\right)\\ {Q}_{Temp}=\mathrm{00100.110.}\end{array}$$
Note that a loss in precision of two bit occurs, with the resulting value of Q_{Temp} determined by the rounding mode. For this example, round-to-floor is used. Overflow cannot occur in this case because the bits and binary point are both shifted to the right.
The summation operation is performed:
$$\begin{array}{c}{Q}_{a}={Q}_{a}+{Q}_{Temp}\\ \text{}11001.010\\ =\frac{+\text{\hspace{0.17em}}00100.110}{\text{\hspace{0.17em}}\text{\hspace{0.17em}}11110.000}\begin{array}{c}\\ =\mathrm{30.000.}\end{array}\end{array}$$
Note that overflow did not occur, but it is possible for this operation.
As shown here, the result of step 7 differs from the ideal sum:
$$\begin{array}{c}\text{}10011.111\\ \text{0}101.0111\\ =\frac{+\text{\hspace{0.17em}}100.11011}{\text{\hspace{0.17em}}\text{\hspace{0.17em}}11110.001}\begin{array}{c}\\ =\mathrm{30.125.}\end{array}\end{array}$$
Blocks that perform addition and subtraction include the Sum, Gain, and Discrete FIR Filter blocks.
The multiplication of an n-bit binary number with an m-bit binary number results in a product that is up to m + n bits in length for both signed and unsigned words. Most processors perform n-bit by n-bit multiplication and produce a 2n-bit result (double bits) assuming there is no overflow condition.
Consider the multiplication of two numbers. Ideally, the real-world values obey the equation
$${V}_{a}={V}_{b}{V}_{c}.$$
where V_{b} and V_{c} are the input values and V_{a} is the output value. To see how the multiplication is actually implemented, the three ideal values should be replaced by the general [Slope Bias] encoding scheme described in Scaling:
$${V}_{i}={F}_{i}{2}^{{E}_{i}}{Q}_{i}+{B}_{i}.$$
The solution of the resulting equation for the output stored integer, Q_{a}, is given below:
$$\begin{array}{c}{Q}_{a}=\frac{{F}_{b}{F}_{c}}{{F}_{a}}{2}^{{E}_{b}+{E}_{c}-{E}_{a}}{Q}_{b}{Q}_{c}+\frac{{F}_{b}{B}_{c}}{{F}_{a}}{2}^{{E}_{b}-{E}_{a}}{Q}_{b}\\ +\frac{{F}_{c}{B}_{b}}{{F}_{a}}{2}^{{E}_{c}-{E}_{a}}{Q}_{c}+\frac{{B}_{b}{B}_{c}-{B}_{a}}{{F}_{a}}{2}^{-{E}_{a}}.\end{array}$$
Multiplication with Nonzero Biases and Mismatched Fractional Slopes. The worst-case implementation of the above equation occurs when the slopes and biases of the input and output signals are mismatched. In such cases, several low-level integer operations are required to carry out the high-level multiplication (or division). Implementation choices made about these low-level computations can affect the computational efficiency, rounding errors, and overflow.
In Simulink blocks, the actual multiplication or division operation is always performed on fixed-point variables that have zero biases. If an input has nonzero bias, it is converted to a representation that has binary-point-only scaling before the operation. If the result is to have nonzero bias, the operation is first performed with temporary variables that have binary-point-only scaling. The result is then converted to the data type and scaling of the final output.
If both the inputs and the output have nonzero biases, then the operation is broken down as follows:
$$\begin{array}{l}{V}_{1Temp}={V}_{1},\\ {V}_{2Temp}={V}_{2},\\ {V}_{3Temp}={V}_{1Temp}{V}_{2Temp},\\ {V}_{3}={V}_{3Temp},\end{array}$$
where
$$\begin{array}{l}{V}_{1Temp}={2}^{{E}_{1Temp}}{Q}_{1Temp},\\ {V}_{2Temp}={2}^{{E}_{2Temp}}{Q}_{2Temp},\\ {V}_{3Temp}={2}^{{E}_{3Temp}}{Q}_{3Temp}.\end{array}$$
These equations show that the temporary variables have binary-point-only scaling. However, the equations do not indicate the signedness, word lengths, or values of the fixed exponent of these variables. The Simulink software assigns these properties to the temporary variables based on the following goals:
Represent the original value without overflow.
The data type and scaling of the original value define a maximum and minimum real-world value:
$${V}_{Max}=F{2}^{E}{Q}_{MaxInteger}+B,$$
$${V}_{Min}=F{2}^{E}{Q}_{MinInteger}+B.$$
The data type and scaling of the temporary value must be able to represent this range without overflow. Precision loss is possible, but overflow is never allowed.
Use a data type that leads to efficient operations.
This goal is relative to the target that you will use for production deployment of your design. For example, suppose that you will implement the design on a 16-bit fixed-point processor that provides a 32-bit long, 16-bit int, and 8-bit short or char. For such a target, preserving efficiency means that no more than 32 bits are used, and the smaller sizes of 8 or 16 bits are used if they are sufficient to maintain precision.
Maintain precision.
Ideally, every possible value defined by the original data type and scaling is represented perfectly by the temporary variable. However, this can require more bits than is efficient. Bits are discarded, resulting in a loss of precision, to the extent required to preserve efficiency.
For example, consider the following, assuming a 16-bit microprocessor target:
$${V}_{Original}={Q}_{Original}+\text{-}43.25,$$
where Q_{Original} is an 8-bit, unsigned data type. For this data type,
$$\begin{array}{c}{Q}_{MaxInteger}=225,\\ {Q}_{MinInteger}=0,\end{array}$$
so
$$\begin{array}{c}{V}_{Max}=211.75,\\ {V}_{Min}=-\mathrm{43.25.}\end{array}$$
The minimum possible value is negative, so the temporary variable must be a signed integer data type. The original variable has a slope of 1, but the bias is expressed with greater precision with two digits after the binary point. To get full precision, the fixed exponent of the temporary variable has to be -2 or less. The Simulink software selects the least possible precision, which is generally the most efficient, unless overflow issues arise. For a scaling of 2^{-2}, selecting signed 16-bit or signed 32-bit avoids overflow. For efficiency, the Simulink software selects the smaller choice of 16 bits. If the original variable is an input, then the equations to convert to the temporary variable are
$$\begin{array}{l}\begin{array}{cc}\text{uint8\_T}& {Q}_{Original},\\ \text{uint16\_T}& {Q}_{Temp},\text{}\end{array}\\ {Q}_{Temp}=\left(\left(\text{uint16\_T}\right){Q}_{Original}\ll 2\right)-173.\end{array}$$
Multiplication with Zero Biases and Mismatched Fractional Slopes. When the biases are zero and the fractional slopes are mismatched, the implementation reduces to
$${Q}_{a}=\frac{{F}_{b}{F}_{c}}{{F}_{a}}{2}^{{E}_{b}+{E}_{c}-{E}_{a}}{Q}_{b}{Q}_{c}.$$
The quantity
$${F}_{Net}=\frac{{F}_{b}{F}_{c}}{{F}_{a}}$$
is calculated offline using round-to-nearest and saturation. F_{Net} is stored using a fixed-point data type of the form
$${2}^{{E}_{Net}}{Q}_{Net},$$
where E_{Net} and Q_{Net} are selected automatically to best represent F_{Net}.
Online Conversions and Operations
The integer values Q_{b} and Q_{c} are multiplied:
$${Q}_{RawProduct}={Q}_{b}{Q}_{c}.$$
To maintain the full precision of the product, the binary point of Q_{RawProduct} is given by the sum of the binary points of Q_{b} and Q_{c}.
The previous product is converted to the output data type:
$${Q}_{Temp}=convert\left({Q}_{RawProduct}\right).$$
This conversion includes any necessary bit shifting, rounding, or overflow handling. Signal Conversions discusses conversions.
$${Q}_{2RawProduct}={Q}_{Temp}{Q}_{Net}$$
is performed.
The previous product is converted to the output data type:
$${Q}_{a}=convert\left({Q}_{2RawProduct}\right).$$
This conversion includes any necessary bit shifting, rounding, or overflow handling. Signal Conversions discusses conversions.
Steps 1 through 4 are repeated for each additional number to be multiplied.
Multiplication with Zero Biases and Matching Fractional Slopes. When the biases are zero and the fractional slopes match, the implementation reduces to
$${Q}_{a}={2}^{{E}_{b}+{E}_{c}-{E}_{a}}{Q}_{b}{Q}_{c}.$$
No offline conversions are performed.
Online Conversions and Operations
The integer values Q_{b} and Q_{c} are multiplied:
$${Q}_{RawProduct}={Q}_{b}{Q}_{c}.$$
To maintain the full precision of the product, the binary point of Q_{RawProduct} is given by the sum of the binary points of Q_{b} and Q_{c}.
The previous product is converted to the output data type:
$${Q}_{a}=convert\left({Q}_{RawProduct}\right).$$
This conversion includes any necessary bit shifting, rounding, or overflow handling. Signal Conversions discusses conversions.
Steps 1 and 2 are repeated for each additional number to be multiplied.
Suppose you want to multiply three numbers. Each of these numbers is represented by a 5-bit word, and each has a different binary-point-only scaling. Additionally, the output is restricted to a 10-bit word with binary-point-only scaling of 2^{-4}. The multiplication is shown in the following model for the input values 5.75, 2.375, and 1.8125.
Applying the rules from the previous section, the multiplication follows these steps:
The first two numbers (5.75 and 2.375) are multiplied:
$$\begin{array}{c}{Q}_{RawProduct}=\text{1}01.11\\ \text{}\frac{\times \text{\hspace{0.17em}}10.011}{\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{1}01.11\cdot {2}^{-3}}\\ \text{}\frac{\begin{array}{l}\text{1}01.11\cdot {2}^{-2}\\ +\text{\hspace{0.17em}}\text{1}01.11\cdot {2}^{1}\end{array}}{\text{011}01.10101}\begin{array}{c}\\ =\mathrm{13.65625.}\end{array}\end{array}$$
Note that the binary point of the product is given by the sum of the binary points of the multiplied numbers.
The result of step 1 is converted to the output data type:
$$\begin{array}{c}{Q}_{Temp}=convert\left({Q}_{RawProduct}\right)\\ =001101.1010=\mathrm{13.6250.}\end{array}$$
Signal Conversions discusses conversions. Note that a loss in precision of one bit occurs, with the resulting value of Q_{Temp} determined by the rounding mode. For this example, round-to-floor is used. Furthermore, overflow did not occur but is possible for this operation.
The result of step 2 and the third number (1.8125) are multiplied:
$$\begin{array}{c}{Q}_{RawProduct}=\text{011}01.1010\\ \text{}\frac{\text{}\times \text{\hspace{0.17em}}1.1101}{\text{\hspace{0.17em}}\text{11}01.1010\cdot {2}^{-4}}\\ \text{}\frac{\begin{array}{l}\text{11}01.1010\cdot {2}^{-2}\\ \text{11}01.1010\cdot {2}^{-1}\\ +\text{\hspace{0.17em}}\text{11}01.1010\cdot {2}^{0}\end{array}}{\text{0011000}.10110010}\begin{array}{c}\\ =\mathrm{24.6953125.}\end{array}\end{array}$$
Note that the binary point of the product is given by the sum of the binary points of the multiplied numbers.
The product is converted to the output data type:
$$\begin{array}{c}{Q}_{a}=convert\left({Q}_{RawProduct}\right)\\ =011000.1011=\mathrm{24.6875.}\end{array}$$
Signal Conversions discusses conversions. Note that a loss in precision of 4 bits occurred, with the resulting value of Q_{Temp} determined by the rounding mode. For this example, round-to-floor is used. Furthermore, overflow did not occur but is possible for this operation.
Blocks that perform multiplication include the Product, Discrete FIR Filter, and Gain blocks.
This section discusses the division of quantities with zero bias.
Note: When any input to a division calculation has nonzero bias, the operations performed exactly match those for multiplication described in Multiplication with Nonzero Biases and Mismatched Fractional Slopes. |
Consider the division of two numbers. Ideally, the real-world values obey the equation
$${V}_{a}={V}_{b}/{V}_{c},$$
where V_{b} and V_{c} are the input values and V_{a} is the output value. To see how the division is actually implemented, the three ideal values should be replaced by the general [Slope Bias] encoding scheme described in Scaling:
$${V}_{i}={F}_{i}{2}^{{E}_{i}}{Q}_{i}+{B}_{i}.$$
For the case where the slope adjustment factors are one and the biases are zero for all signals, the solution of the resulting equation for the output stored integer, Q_{a}, is given by the following equation:
$${Q}_{a}={2}^{{E}_{b}-{E}_{c}-{E}_{a}}\left({Q}_{b}/{Q}_{c}\right).$$
This equation involves an integer division and some bit shifts. If E_{a} > E_{b}–E_{c}, then any bit shifts are to the right and the implementation is simple. However, if E_{a} < E_{b}–E_{c}, then the bit shifts are to the left and the implementation can be more complicated. The essential issue is that the output has more precision than the integer division provides. To get full precision, a fractional division is needed. The C programming language provides access to integer division only for fixed-point data types. Depending on the size of the numerator, you can obtain some of the fractional bits by performing a shift prior to the integer division. In the worst case, it might be necessary to resort to repeated subtractions in software.
In general, division of values is an operation that should be avoided in fixed-point embedded systems. Division where the output has more precision than the integer division (i.e., E_{a} < E_{b}–E_{c}) should be used with even greater reluctance.
Suppose you want to divide two numbers. Each of these numbers is represented by an 8-bit word, and each has a binary-point-only scaling of 2^{-4}. Additionally, the output is restricted to an 8-bit word with binary-point-only scaling of 2^{-4}.
The division of 9.1875 by 1.5000 is shown in the following model.
For this example,
$$\begin{array}{c}{Q}_{a}={2}^{-4-\left(-4\right)-\left(-4\right)}\left({Q}_{b}/{Q}_{c}\right)\\ ={2}^{4}\left({Q}_{b}/{Q}_{c}\right).\end{array}$$
Assuming a large data type was available, this could be implemented as
$${Q}_{a}=\frac{\left({2}^{4}{Q}_{b}\right)}{{Q}_{c}},$$
where the numerator uses the larger data type. If a larger data type was not available, integer division combined with four repeated subtractions would be used. Both approaches produce the same result, with the former being more efficient.
Nearly all microprocessors and digital signal processors support well-defined bit-shift (or simply shift) operations for integers. For example, consider the 8-bit unsigned integer 00110101. The results of a 2-bit shift to the left and a 2-bit shift to the right are shown in the following table.
Shift Operation | Binary Value | Decimal Value |
---|---|---|
No shift (original number) | 00110101 | 53 |
Shift left by 2 bits | 11010100 | 212 |
Shift right by 2 bits | 00001101 | 13 |
You can perform a shift using the Simulink Shift Arithmetic block. Use this block to perform a bit shift, a binary point shift, or both
The special case of shifting bits to the right requires consideration of the treatment of the leftmost bit, which can contain sign information. A shift to the right can be classified either as a logical shift right or an arithmetic shift right. For a logical shift right, a 0 is incorporated into the most significant bit for each bit shift. For an arithmetic shift right, the most significant bit is recycled for each bit shift.
The Shift Arithmetic block performs an arithmetic shift right and, therefore, recycles the most significant bit for each bit shift right. For example, given the fixed-point number 11001.011 (-6.625), a bit shift two places to the right with the binary point unmoved yields the number 11110.010 (-1.75), as shown in the model below:
To perform a logical shift right on a signed number using the Shift Arithmetic block, use the Data Type Conversion block to cast the number as an unsigned number of equivalent length and scaling, as shown in the following model. The model shows that the fixed-point signed number 11001.001 (-6.625) becomes 00110.010 (6.25).