Recommendations for Arithmetic and Scaling

Arithmetic Operations and Fixed-Point Scaling

The sections that follow describe the relationship between arithmetic operations and fixed-point scaling, and offer some basic recommendations that may be appropriate for your fixed-point design. For each arithmetic operation,

The general [Slope Bias] encoding scheme described in Scaling is used.
The scaling of the result is automatically selected based on the scaling of the two inputs. In other words, the scaling is inherited.
Scaling choices are based on
- Minimizing the number of arithmetic operations of the result
- Maximizing the precision of the result
Additionally, binary-point-only scaling is presented as a special case of the general encoding scheme.

In embedded systems, the scaling of variables at the hardware interface (the ADC or DAC) is fixed. However for most other variables, the scaling is something you can choose to give the best design. When scaling fixed-point variables, it is important to remember that

Your scaling choices depend on the particular design you are simulating.
There is no best scaling approach. All choices have associated advantages and disadvantages. It is the goal of this section to expose these advantages and disadvantages to you.

Addition

Consider the addition of two real-world values:

$V_{a} = V_{b} + V_{c} .$

These values are represented by the general [Slope Bias] encoding scheme described in Scaling:

$V_{i} = F_{i} 2^{E_{i}} Q_{i} + B_{i} .$

In a fixed-point system, the addition of values results in finding the variable Q_a:

$Q_{a} = \frac{F_{b}}{F_{a}} 2^{E_{b} - E_{a}} Q_{b} + \frac{F_{c}}{F_{a}} 2^{E_{c} - E_{a}} Q_{c} + \frac{B_{b} + B_{c} - B_{a}}{F_{a}} 2^{- E_{a}} .$

This formula shows

In general, Q_a is not computed through a simple addition of Q_b and Q_c.
In general, there are two multiplications of a constant and a variable, two additions, and some additional bit shifting.

Inherited Scaling for Speed

In the process of finding the scaling of the sum, one reasonable goal is to simplify the calculations. Simplifying the calculations should reduce the number of operations, thereby increasing execution speed. The following choices can help to minimize the number of arithmetic operations:

Set B_a = B_b + B_c. This eliminates one addition.
Set F_a = F_b or F_a = F_c. Either choice eliminates one of the two constant times variable multiplications.

The resulting formula is

$Q_{a} = 2^{E_{b} - E_{a}} Q_{b} + \frac{F_{c}}{F_{a}} 2^{E_{c} - E_{a}} Q_{c}$

$Q_{a} = \frac{F_{b}}{F_{a}} 2^{E_{b} - E_{a}} Q_{b} + 2^{E_{c} - E_{a}} Q_{c} .$

These equations appear to be equivalent. However, your choice of rounding and precision may make one choice stand out over the other. To further simplify matters, you could choose E_a = E_c or E_a = E_b. This will eliminate some bit shifting.

Inherited Scaling for Maximum Precision

In the process of finding the scaling of the sum, one reasonable goal is maximum precision. You can determine the maximum-precision scaling if the range of the variable is known. Maximize Precision shows that you can determine the range of a fixed-point operation from max(V_a) and min(V_a). For a summation, you can determine the range from

$\begin{array}{l} \min ({\tilde{V}}_{a}) = \min ({\tilde{V}}_{b}) + \min ({\tilde{V}}_{c}), \\ \max ({\tilde{V}}_{a}) = \max ({\tilde{V}}_{b}) + \max ({\tilde{V}}_{c}) . \end{array}$

You can now derive the maximum-precision slope:

$\begin{matrix} F_{a} 2^{E_{a}} = \frac{\max ({\tilde{V}}_{a}) - \min ({\tilde{V}}_{a})}{2^{w s_{a}} - 1} \\ = \frac{F_{a} 2^{E_{b}} (2^{w s_{b}} - 1) + F_{c} 2^{E_{c}} (2^{w s_{c}} - 1)}{2^{w s_{a}} - 1} . \end{matrix}$

In most cases the input and output word sizes are much greater than one, and the slope becomes

$F_{a} 2^{E_{a}} \approx F_{b} 2^{E_{b} + w s_{b} - w s_{a}} + F_{c} 2^{E_{c} + w s_{c} - w s_{a}},$

which depends only on the size of the input and output words. The corresponding bias is

$B_{a} = \min ({\tilde{V}}_{a}) - F_{a} 2^{E_{a}} \min (Q_{a}) .$

The value of the bias depends on whether the inputs and output are signed or unsigned numbers.

If the inputs and output are all unsigned, then the minimum values for these variables are all zero and the bias reduces to a particularly simple form:

$B_{a} = B_{b} + B_{c} .$

If the inputs and the output are all signed, then the bias becomes

$\begin{array}{l} B_{a} \approx B_{b} + B_{c} + F_{b} 2^{E_{b}} (- 2^{w s_{b} - 1} + 2^{w s_{b} - 1}) + F_{c} 2^{E_{c}} (- 2^{w s_{c} - 1} + 2^{w s_{c} - 1}), \\ B_{a} \approx B_{b} + B_{c} . \end{array}$

Binary-Point-Only Scaling

For binary-point-only scaling, finding Q_a results in this simple expression:

$Q_{a} = 2^{E_{b} - E_{a}} Q_{b} + 2^{E_{c} - E_{a}} Q_{c} .$

This scaling choice results in only one addition and some bit shifting. The avoidance of any multiplications is a big advantage of binary-point-only scaling.

Note

The subtraction of values produces results that are analogous to those produced by the addition of values.

Accumulation

The accumulation of values is closely associated with addition:

$V_{a_n e w} = V_{a_o l d} + V_{b} .$

Finding Q_{a_new} involves one multiplication of a constant and a variable, two additions, and some bit shifting:

$Q_{a_n e w} = Q_{a_o l d} + \frac{F_{b}}{F_{a}} 2^{E_{b} - E_{a}} Q_{b} + \frac{B_{b}}{F_{a}} 2^{- E_{a}} .$

The important difference for fixed-point implementations is that the scaling of the output is identical to the scaling of the first input.

Binary-Point-Only Scaling

For binary-point-only scaling, finding Q_{a_new} results in this simple expression:

$Q_{a_n e w} = Q_{a_o l d} + 2^{E_{b} - E_{a}} Q_{b} .$

This scaling option only involves one addition and some bit shifting.

Note

The negative accumulation of values produces results that are analogous to those produced by the accumulation of values.

Multiplication

Consider the multiplication of two real-world values:

$V_{a} = V_{b} V_{c} .$

These values are represented by the general [Slope Bias] encoding scheme described in Scaling:

$V_{i} = F_{i} 2^{E_{i}} Q_{i} + B_{i} .$

In a fixed-point system, the multiplication of values results in finding the variable Q_a:

$\begin{matrix} Q_{a} = \frac{F_{b} F_{c}}{F_{a}} 2^{E_{b} + E_{c} - E_{a}} Q_{b} Q_{c} + \frac{F_{b} B_{c}}{F_{a}} 2^{E_{b} - E_{a}} Q_{b} \\ + \frac{F_{c} B_{b}}{F_{a}} 2^{E_{c} - E_{a}} Q_{c} + \frac{B_{b} B_{c} - B_{a}}{F_{a}} 2^{- E_{a}} . \end{matrix}$

This formula shows

In general, Q_a is not computed through a simple multiplication of Q_b and Q_c.
In general, there is one multiplication of a constant and two variables, two multiplications of a constant and a variable, three additions, and some additional bit shifting.

Inherited Scaling for Speed

The number of arithmetic operations can be reduced with these choices:

Set B_a = B_bB_c. This eliminates one addition operation.
Set F_a = F_bF_c. This simplifies the triple multiplication—certainly the most difficult part of the equation to implement.
Set E_a = E_b + E_c. This eliminates some of the bit shifting.

The resulting formula is

$Q_{a} = Q_{b} Q_{c} + \frac{B_{c}}{F_{c}} 2^{- E_{c}} Q_{b} + \frac{B_{b}}{F_{b}} 2^{- E_{b}} Q_{c} .$

Inherited Scaling for Maximum Precision

You can determine the maximum-precision scaling if the range of the variable is known. Maximize Precision shows that you can determine the range of a fixed-point operation from

$\max ({\tilde{V}}_{a})$

and

$\min ({\tilde{V}}_{a}) .$

For multiplication, you can determine the range from

$\begin{matrix} \min ({\tilde{V}}_{a}) = \min (V_{L L}, V_{L H}, V_{H L}, V_{H H}), \\ \max ({\tilde{V}}_{a}) = \max (V_{L L}, V_{L H}, V_{H L}, V_{H H}), \end{matrix}$

where

$\begin{array}{l} V_{L L} = \min ({\tilde{V}}_{b}) \cdot \min ({\tilde{V}}_{c}), \\ V_{L H} = \min ({\tilde{V}}_{b}) \cdot \max ({\tilde{V}}_{c}), \\ V_{H L} = \max ({\tilde{V}}_{b}) \cdot \min ({\tilde{V}}_{c}), \\ V_{H H} = \max ({\tilde{V}}_{b}) \cdot \max ({\tilde{V}}_{c}) . \end{array}$

Binary-Point-Only Scaling

For binary-point-only scaling, finding Q_a results in this simple expression:

$Q_{a} = 2^{E_{b} + E_{c} - E_{a}} Q_{b} Q_{c} .$

Gain

Consider the multiplication of a constant and a variable

$V_{a} = K V_{b},$

where K is a constant called the gain. Since V_a results from the multiplication of a constant and a variable, finding Q_a is a simplified version of the general fixed-point multiplication formula:

$Q_{a} = (\frac{K F_{b} 2^{E_{b}}}{F_{a} 2^{E_{a}}}) Q_{b} + (\frac{K B_{b} - B_{a}}{F_{a} 2^{E_{a}}}) .$

Note that the terms in the parentheses can be calculated offline. Therefore, there is only one multiplication of a constant and a variable and one addition.

To implement the above equation without changing it to a more complicated form, the constants need to be encoded using a binary-point-only format. For each of these constants, the range is the trivial case of only one value. Despite the trivial range, the binary point formulas for maximum precision are still valid. The maximum-precision representations are the most useful choices unless there is an overriding need to avoid any shifting. The encoding of the constants is

$\begin{array}{l} (\frac{K F_{b} 2^{E_{b}}}{F_{a} 2^{E_{a}}}) = 2^{E_{X}} Q_{X} \\ (\frac{K B_{b} - B_{a}}{F_{a} 2^{E_{a}}}) = 2^{E_{Y}} Q_{Y} \end{array}$

resulting in the formula

$Q_{a} = 2^{E_{X}} Q_{X} Q_{B} + 2^{E_{Y}} Q_{Y} .$

Inherited Scaling for Speed

The number of arithmetic operations can be reduced with these choices:

Set B_a = KB_b. This eliminates one constant term.
Set F_a = KF_b and E_a = E_b. This sets the other constant term to unity.
The resulting formula is simply
$Q_{a} = Q_{b} .$

If the number of bits is different, then either handling potential overflows or performing sign extensions is the only possible operation involved.

Inherited Scaling for Maximum Precision

The scaling for maximum precision does not need to be different from the scaling for speed unless the output has fewer bits than the input. If this is the case, then saturation should be avoided by dividing the slope by 2 for each lost bit. This prevents saturation but causes rounding to occur.

Division

Division of values is an operation that should be avoided in fixed-point embedded systems, but it can occur in places. Therefore, consider the division of two real-world values:

$V_{a} = V_{b} / V_{c} .$

These values are represented by the general [Slope Bias] encoding scheme described in Scaling:

$V_{i} = F_{i} 2^{E_{i}} Q_{i} + B_{i} .$

In a fixed-point system, the division of values results in finding the variable Q_a:

$Q_{a} = \frac{F_{b} 2^{E_{b}} Q_{b} + B_{b}}{F_{c} F_{a} 2^{E_{c} + E_{a}} Q_{c} + B_{c} F_{a} 2^{E_{a}}} - \frac{B_{a}}{F_{a}} 2^{- E_{a}} .$

This formula shows

In general, Q_a is not computed through a simple division of Q_b by Q_c.
In general, there are two multiplications of a constant and a variable, two additions, one division of a variable by a variable, one division of a constant by a variable, and some additional bit shifting.

Inherited Scaling for Speed

The number of arithmetic operations can be reduced with these choices:

Set B_a = 0. This eliminates one addition operation.
If B_c = 0, then set the fractional slope F_a = F_b/F_c. This eliminates one constant times variable multiplication.

The resulting formula is

$Q_{a} = \frac{Q_{b}}{Q_{c}} 2^{E_{b} - E_{c} - E_{a}} + \frac{(B_{b} / F_{b})}{Q_{c}} 2^{- E_{c} - E_{a}} .$

If B_c ≠ 0, then no clear recommendation can be made.

Inherited Scaling for Maximum Precision

You can determine the maximum-precision scaling if the range of the variable is known. Maximize Precision shows that you can determine the range of a fixed-point operation from

$\max ({\tilde{V}}_{a})$

and

$\min ({\tilde{V}}_{a}) .$

For division, you can determine the range from

$\begin{matrix} \min ({\tilde{V}}_{a}) = \min (V_{L L}, V_{L H}, V_{H L}, V_{H H}), \\ \max ({\tilde{V}}_{a}) = \max (V_{L L}, V_{L H}, V_{H L}, V_{H H}), \end{matrix}$

where for nonzero denominators

$\begin{array}{l} V_{L L} = \min ({\tilde{V}}_{b}) / \min ({\tilde{V}}_{c}), \\ V_{L H} = \min ({\tilde{V}}_{b}) / \max ({\tilde{V}}_{c}), \\ V_{H L} = \max ({\tilde{V}}_{b}) / \min ({\tilde{V}}_{c}), \\ V_{H H} = \max ({\tilde{V}}_{b}) / \max ({\tilde{V}}_{c}) . \end{array}$

Binary-Point-Only Scaling

For binary-point-only scaling, finding Q_a results in this simple expression:

$Q_{a} = \frac{Q_{b}}{Q_{c}} 2^{E_{b} - E_{c} - E_{a}} .$

Note

For the last two formulas involving Q_a, a divide by zero and zero divided by zero are possible. In these cases, the hardware will give some default behavior but you must make sure that these default responses give meaningful results for the embedded system.

Summary

From the previous analysis of fixed-point variables scaled within the general [Slope Bias] encoding scheme, you can conclude

Addition, subtraction, multiplication, and division can be very involved unless certain choices are made for the biases and slopes.
Binary-point-only scaling guarantees simpler math, but generally sacrifices some precision.

Note that the previous formulas don't show the following:

Constants and variables are represented with a finite number of bits.
Variables are either signed or unsigned.
Rounding and overflow handling schemes. You must make these decisions before an actual fixed-point realization is achieved.