Documentation Center

  • Trial Software
  • Product Updates

Fixed-Point Filter Design

Overview of Fixed-Point Filters

The most common use of fixed-point filters is in the DSP chips, where the data storage capabilities are limited, or embedded systems and devices where low-power consumption is necessary. For example, the data input may come from a 12 bit ADC, the data bus may be 16 bit, and the multiplier may have 24 bits. Within these space constraints, DSP System Toolbox™ software enables you to design the best possible fixed-point filter.

What Is a Fixed-Point Filter?

lA fixed-point filter uses fixed-point arithmetic and is represented by an equation with fixed-point coefficients. To learn about fixed-point arithmetic, see Arithmetic Operations.

Data Types for Filter Functions

Data Type Support

There are three different data types supported in DSP System Toolbox software:

  • Fixed — Requires Fixed-Point Designer™ and is supported by packages listed in Fixed Data Type Support.

  • Double — Double precision, floating point and is the default data type for DSP System Toolbox software; accepted by all functions

  • Single — Single precision, floating point and is supported by specific packages outlined in Single Data Type Support.

Fixed Data Type Support

To use fixed data type, you must have Fixed-Point Designer. Type ver at the MATLAB® command prompt to get a listing of all installed products.

The fixed data type is reserved for any filter whose property arithmetic is set to fixed. Furthermore all functions that work with this filter, whether in analysis or design, also accept and support the fixed data types.

To set the filter's arithmetic property:

f = fdesign.bandpass(.35,.45,.55,.65,60,1,60);
Hf = design(f, 'equiripple');
Hf.Arithmetic = 'fixed';

Single Data Type Support

The support of the single data types comes in two varieties. First, input data of type single can be fed into a double filter, where it is immediately converted to double. Thus, while the filter still operates in the double mode, the single data type input does not break it. The second variety is where the filter itself is set to single precision. In this case, it accepts only single data type input, performs all calculations, and outputs data in single precision. Furthermore, such analyses as noisepsd and freqrespest also operate in single precision.

To set the filter to single precision:

>> f = fdesign.bandpass(.35,.45,.55,.65,60,1,60);
>> Hf = design(f, 'equiripple');
>> Hf.Arithmetic = 'single';

Floating-Point to Fixed-Point Filter Conversion

Process Overview

The conversion from floating point to fixed point consists of two main parts: quantizing the coefficients and performing the dynamic range analysis. Quantizing the coefficients is a process of converting the coefficients to fixed-point numbers. The dynamic range analysis is a process of fine tuning the scaling of each node to ensure that the fraction lengths are set for full input range coverage and maximum precision. The following steps describe this conversion process.

Design the Filter

Start by designing a regular, floating-point, equiripple bandpass filter, as shown in the following figure.

where the passband is from .45 to .55 of normalized frequency, the amount of ripple acceptable in the passband is 1 dB, the first stopband is from 0 to .35 (normalized), the second stopband is from .65 to 1 (normalized), and both stopbands provide 60 dB of attenuation.

To design this filter, evaluate the following code, or type it at the MATLAB command prompt:

f = fdesign.bandpass(.35,.45,.55,.65,60,1,60);
Hd = design(f, 'equiripple');

The last line of code invokes the Filter Visualization Tool, which displays the designed filter. You use Hd, which is a double, floating-point filter, both as the baseline and a starting point for the conversion.

Quantize the Coefficients

The first step in quantizing the coefficients is to find the valid word length for the coefficients. Here again, the hardware usually dictates the maximum allowable setting. However, if this constraint is large enough, there is room for some trial and error. Start with the coefficient word length of 8 and determine if the resulting filter is sufficient for your needs.

To set the coefficient word length of 8, evaluate or type the following code at the MATLAB command prompt:

Hf = Hd;
Hf.Arithmetic = 'fixed';
set(Hf, 'CoeffWordLength', 8);

The resulting filter is shown in the following figure.

As the figure shows, the filter design constraints are not met. The attenuation is not complete, and there is noise at the edges of the stopbands. You can experiment with different coefficient word lengths if you like. For this example, however, the word length of 12 is sufficient.

To set the coefficient word length of 12, evaluate or type the following code at the MATLAB command prompt:

set(Hf, 'CoeffWordLength', 12);

The resulting filter satisfies the design constraints, as shown in the following figure.

Now that the coefficient word length is set, there are other data width constraints that might require attention. Type the following at the MATLAB command prompt:

>> info(Hf)
Discrete-Time FIR Filter (real)
Filter Structure  : Direct-Form FIR
Filter Length     : 48             
Stable            : Yes            
Linear Phase      : Yes (Type 2)   
Arithmetic        : fixed          
Numerator         : s12,14 -> [-1.250000e-001 1.250000e-001)
Input             : s16,15 -> [-1 1)
Filter Internals  : Full Precision  
  Output          : s31,29 -> [-2 2)  (auto determined)
  Product         : s27,29 -> [-1.250000e-001 1.250000e-001)...
								(auto determined)
  Accumulator     : s31,29 -> [-2 2)  (auto determined)
  Round Mode      : No rounding
  Overflow Mode   : No overflow

You see the output is 31 bits, the accumulator requires 31 bits and the multiplier requires 27 bits. A typical piece of hardware might have a 16 bit data bus, a 24 bit multiplier, and an accumulator with 4 guard bits. Another reasonable assumption is that the data comes from a 12 bit ADC. To reflect these constraints type or evaluate the following code:

set (Hf, 'InputWordLength', 12);
set (Hf, 'FilterInternals', 'SpecifyPrecision');
set (Hf, 'ProductWordLength', 24);
set (Hf, 'AccumWordLength', 28);
set (Hf, 'OutputWordLength', 16);

Although the filter is basically done, if you try to filter some data with it at this stage, you may get erroneous results due to overflows. Such overflows occur because you have defined the constraints, but you have not tuned the filter coefficients to handle properly the range of input data where the filter is designed to operate. Next, the dynamic range analysis is necessary to ensure no overflows.

Dynamic Range Analysis

The purpose of the dynamic range analysis is to fine tune the scaling of the coefficients. The ideal set of coefficients is valid for the full range of input data, while the fraction lengths maximize precision. Consider carefully the range of input data to use for this step. If you provide data that covers the largest dynamic range in the filter, the resulting scaling is more conservative, and some precision is lost. If you provide data that covers a very narrow input range, the precision can be much greater, but an input out of the design range may produce an overflow. In this example, you use the worst-case input signal, covering a full dynamic range, in order to ensure that no overflow ever occurs. This worst-case input signal is a scaled version of the sign of the flipped impulse response.

To scale the coefficients based on the full dynamic range, type or evaluate the following code:

x = 1.9*sign(fliplr(impz(Hf)));
Hf = autoscale(Hf, x);

To check that the coefficients are in range (no overflows) and have maximum possible precision, type or evaluate the following code:

fipref('LoggingMode', 'on', 'DataTypeOverride', 'ForceOff');
y = filter(Hf, x);
fipref('LoggingMode', 'off');
R = qreport(Hf)

Where R is shown in the following figure:

The report shows no overflows, and all data falls within the designed range. The conversion has completed successfully.

Compare Magnitude Response and Magnitude Response Estimate

You can use the fvtool GUI to analysis on your quantized filter, to see the effects of the quantization on stopband attenuation, etc. Two important last checks when analyzing a quantized filter are the Magnitude Response Estimate and the Round-off Noise Power Spectrum. The value of the Magnitude Response Estimate analysis can be seen in the following example.

View the Magnitude Response Estimate

Begin by designing a simple lowpass filter using the command.

h = design(fdesign.lowpass, 'butter','SOSScaleNorm','Linf'); 

Now set the arithmetic to fixed-point.

h.arithmetic = 'fixed';                                     

Open the filter using fvtool.


When fvtool displays the filter using the Magnitude response view, the quantized filter seems to match the original filter quite well.

However if you look at the Magnitude Response Estimate plot from the Analysis menu, you will see that the actual filter created may not perform nearly as well as indicated by the Magnitude Response plot.

This is because by using the noise-based method of the Magnitude Response Estimate, you estimate the complex frequency response for your filter as determined by applying a noise- like signal to the filter input. Magnitude Response Estimate uses the Monte Carlo trials to generate a noise signal that contains complete frequency content across the range 0 to Fs. For more information about analyzing filters in this way, refer to the section titled Analyzing Filters with a Noise-Based Method in the User Guide.

For more information, refer to McClellan, et al., Computer-Based Exercises for Signal Processing Using MATLAB 5, Prentice-Hall, 1998. See Project 5: Quantization Noise in Digital Filters, page 231.

Create an FIR Filter Using Integer Coefficients

Review of Fixed-Point Numbers

Terminology of Fixed-Point Numbers.  DSP System Toolbox functions assume fixed-point quantities are represented in two's complement format, and are described using the WordLength and FracLength parameters. It is common to represent fractional quantities of WordLength 16 with the leftmost bit representing the sign and the remaining bits representing the fraction to the right of the binary point. Often the FracLength is thought of as the number of bits to the right of the binary point. However, there is a problem with this interpretation when the FracLength is larger than the WordLength, or when the FracLength is negative.

To work around these cases, you can use the following interpretation of a fixed-point quantity:

The register has a WordLength of B, or in other words it has B bits. The bits are numbered from left to right from 0 to B-1. The most significant bit (MSB) is the leftmost bit, bB-1. The least significant bit is the right-most bit, b0. You can think of the FracLength as a quantity specifying how to interpret the bits stored and resolve the value they represent. The value represented by the bits is determined by assigning a weight to each bit:

In this figure, L is the integer FracLength. It can assume any value, depending on the quantization step size. L is necessary to interpret the value that the bits represent. This value is given by the equation


The value 2–L is the smallest possible difference between two numbers represented in this format, otherwise known as the quantization step. In this way, it is preferable to think of the FracLength as the negative of the exponent used to weigh the right-most, or least-significant, bit of the fixed-point number.

To reduce the number of bits used to represent a given quantity, you can discard the least-significant bits. This method minimizes the quantization error since the bits you are removing carry the least weight. For instance, the following figure illustrates reducing the number of bits from 4 to 2:

This means that the FracLength has changed from L to L – 2.

You can think of integers as being represented with a FracLength of L = 0, so that the quantization step becomes .

Suppose B = 16 and L = 0. Then the numbers that can be represented are the integers .

If you need to quantize these numbers to use only 8 bits to represent them, you will want to discard the LSBs as mentioned above, so that B=8 and L = 0–8 = –8. The increments, or quantization step then becomes . So you will still have the same range of values, but with less precision, and the numbers that can be represented become .

With this quantization the largest possible error becomes about 256/2 when rounding to the nearest, with a special case for 32767.

Integers and Fixed-Point Filters

This section provides an example of how you can create a filter with integer coefficients. In this example, a raised-cosine filter with floating-point coefficients is created, and the filter coefficients are then converted to integers.

Define the Filter Coefficients.  To illustrate the concepts of using integers with fixed-point filters, this example will use a raised-cosine filter:

b = rcosdesign(.25, 12.5, 8, 'sqrt');

The coefficients of b are normalized so that the passband gain is equal to 1, and are all smaller than 1. In order to make them integers, they will need to be scaled. If you wanted to scale them to use 18 bits for each coefficient, the range of possible values for the coefficients becomes:

Because the largest coefficient of b is positive, it will need to be scaled as close as possible to 131071 (without overflowing) in order to minimize quantization error. You can determine the exponent of the scale factor by executing:

B = 18; % Number of bits
L = floor(log2((2^(B-1)-1)/max(b)));  % Round towards zero to avoid overflow
bsc = b*2^L;

Alternatively, you can use the fixed-point numbers autoscaling tool as follows:

bq = fi(b, true, B);  % signed = true, B = 18 bits
L = bq.FractionLength;

It is a coincidence that B and L are both 18 in this case, because of the value of the largest coefficient of b. If, for example, the maximum value of b were 0.124, L would be 20 while B (the number of bits) would remain 18.

Build the FIR Filter.  First create the filter using the direct form, tapped delay line structure:

h = dfilt.dffir(bsc);

In order to set the required parameters, the arithmetic must be set to fixed-point:

h.Arithmetic = 'fixed';
h.CoeffWordLength = 18;

You can check that the coefficients of h are all integers:

all(h.Numerator == round(h.Numerator))

ans = 


Now you can examine the magnitude response of the filter using fvtool:

fvtool(h, 'Color', 'white')

This shows a large gain of 117 dB in the passband, which is due to the large values of the coefficients— this will cause the output of the filter to be much larger than the input. A method of addressing this will be discussed in the following sections.

Set the Filter Parameters to Work with Integers.  You will need to set the input parameters of your filter to appropriate values for working with integers. For example, if the input to the filter is from a A/D converter with 12 bit resolution, you should set the input as follows:

h.InputWordLength = 12;
h.InputFracLength = 0;

The info method returns a summary of the filter settings.

Discrete-Time FIR Filter (real)              
Filter Structure  : Direct-Form FIR          
Filter Length     : 101     
Stable            : Yes     
Linear Phase      : Yes (Type 1)             
Arithmetic        : fixed   
Numerator         : s18,0 -> [-131072 131072)
Input             : s12,0 -> [-2048 2048)    
Filter Internals  : Full Precision           
  Output          : s31,0 -> [-1073741824 1073741824)  (auto determined)
  Product         : s29,0 -> [-268435456 268435456)  (auto determined)  
  Accumulator     : s31,0 -> [-1073741824 1073741824)  (auto determined)
  Round Mode      : No rounding              
  Overflow Mode   : No overflow   

In this case, all the fractional lengths are now set to zero, meaning that the filter h is set up to handle integers.

Create a Test Signal for the Filter.  You can generate an input signal for the filter by quantizing to 12 bits using the autoscaling feature, or you can follow the same procedure that was used for the coefficients, discussed previously. In this example, create a signal with two sinusoids:

n = 0:999;
f1 = 0.1*pi;  % Normalized frequency of first sinusoid
f2 = 0.8*pi;  % Normalized frequency of second sinusoid
x = 0.9*sin(0.1*pi*n) + 0.9*sin(0.8*pi*n);
xq = fi(x, true, 12);  % signed = true, B = 12
xsc = fi(, true, 12, 0);

Filter the Test Signal.  To filter the input signal generated above, enter the following:

ysc = filter(h, xsc);

Here ysc is a full precision output, meaning that no bits have been discarded in the computation. This makes ysc the best possible output you can achieve given the 12–bit input and the 18–bit coefficients. This can be verified by filtering using double-precision floating-point and comparing the results of the two filtering operations:

hd = double(h);
xd = double(xsc);
yd = filter(hd, xd);

ans =


Now you can examine the output compared to the input. This example is plotting only the last few samples to minimize the effect of transients:

idx = 800:950;
xscext = double(xsc(idx)');
gd = grpdelay(h, [f1 f2]);
yidx = idx + gd(1);
yscext = double(ysc(yidx)');
stem(n(idx)', [xscext, yscext]);
axis([800 950 -2.5e8 2.5e8]);
legend('input', 'output');
set(gcf, 'color', 'white');

It is difficult to compare the two signals in this figure because of the large difference in scales. This is due to the large gain of the filter, so you will need to compensate for the filter gain:

stem(n(idx)', [2^18*xscext, yscext]);
axis([800 950 -5e8 5e8]);
legend('scaled input', 'output');

You can see how the signals compare much more easily once the scaling has been done, as seen in the above figure.

Truncate the Output WordLength.  If you examine the output wordlength,


ans =


you will notice that the number of bits in the output is considerably greater than in the input. Because such growth in the number of bits representing the data may not be desirable, you may need to truncate the wordlength of the output. As discussed in Terminology of Fixed-Point Numbersthe best way to do this is to discard the least significant bits, in order to minimize error. However, if you know there are unused high order bits, you should discard those bits as well.

To determine if there are unused most significant bits (MSBs), you can look at where the growth in WordLength arises in the computation. In this case, the bit growth occurs to accommodate the results of adding products of the input (12 bits) and the coefficients (18 bits). Each of these products is 29 bits long (you can verify this using info(h)). The bit growth due to the accumulation of the product depends on the filter length and the coefficient values- however, this is a worst-case determination in the sense that no assumption on the input signal is made besides, and as a result there may be unused MSBs. You will have to be careful though, as MSBs that are deemed unused incorrectly will cause overflows.

Suppose you want to keep 16 bits for the output. In this case, there is no bit-growth due to the additions, so the output bit setting will be 16 for the wordlength and –14 for the fraction length.

Since the filtering has already been done, you can discard some bits from ysc:

yout = fi(ysc, true, 16, -14);

Alternatively, you can set the filter output bit lengths directly (this is useful if you plan on filtering many signals):

h.OutputWordLength = 16;
h.OutputFracLength = -14;
yout2 = filter(h, xsc);

You can verify that the results are the same either way:

norm(double(yout) - double(yout2))

ans =


However, if you compare this to the full precision output, you will notice that there is rounding error due to the discarded bits:


ans =


In this case the differences are hard to spot when plotting the data, as seen below:

stem(n(yidx), [double(yout(yidx)'), double(ysc(yidx)')]);
axis([850 950 -2.5e8 2.5e8]);
legend('Scaled Input', 'Output');
set(gcf, 'color', 'white');

Scale the Output.  Because the filter in this example has such a large gain, the output is at a different scale than the input. This scaling is purely theoretical however, and you can scale the data however you like. In this case, you have 16 bits for the output, but you can attach whatever scaling you choose. It would be natural to reinterpret the output to have a weight of 2^0 (or L = 0) for the LSB. This is equivalent to scaling the output signal down by a factor of 2^(-14). However, there is no computation or rounding error involved. You can do this by executing the following:

yri = fi(, true, 16, 0);
stem(n(idx)', [xscext, double(yri(yidx)')]);
axis([800 950 -1.5e4 1.5e4]);
legend('input', 'rescaled output');

This plot shows that the output is still larger than the input. If you had done the filtering in double-precision floating-point, this would not be the case— because here more bits are being used for the output than for the input, so the MSBs are weighted differently. You can see this another way by looking at the magnitude response of the scaled filter:

[H,w] = freqz(h);
plot(w/pi, 20*log10(2^(-14)*abs(H)));

This plot shows that the passband gain is still above 0 dB.

To put the input and output on the same scale, the MSBs must be weighted equally. The input MSB has a weight of 2^11, whereas the scaled output MSB has a weight of 2^(29–14) = 2^15. You need to give the output MSB a weight of 2^11 as follows:

yf = fi(zeros(size(yri)), true, 16, 4);
yf.bin = yri.bin;
stem(n(idx)', [xscext, double(yf(yidx)')]);
legend('input', 'rescaled output');

This operation is equivalent to scaling the filter gain down by 2^(-18).

[H,w] = freqz(h);
plot(w/pi, 20*log10(2^(-18)*abs(H)));

The above plot shows a 0 dB gain in the passband, as desired.

With this final version of the output, yf is no longer an integer. However this is only due to the interpretation- the integers represented by the bits in yf are identical to the ones represented by the bits in yri. You can verify this by comparing them:

max(abs( -

ans =


Configure Filter Parameters to Work with Integers Using the set2int Method

Set the Filter Parameters to Work with Integers.  The set2int method provides a convenient way of setting filter parameters to work with integers. The method works by scaling the coefficients to integer numbers, and setting the coefficients and input fraction length to zero. This makes it possible for you to use floating-point coefficients directly.

h = dfilt.dffir(b);
h.Arithmetic = 'fixed';

The coefficients are represented with 18 bits and the input signal is represented with 12 bits:

g = set2int(h, 18, 12);
g_dB = 20*log10(g)

g_dB =


The set2int method returns the gain of the filter by scaling the coefficients to integers, so the gain is always a power of 2. You can verify that the gain we get here is consistent with the gain of the filter previously. Now you can also check that the filter h is set up properly to work with integers:

Discrete-Time FIR Filter (real)              
Filter Structure  : Direct-Form FIR          
Filter Length     : 101     
Stable            : Yes     
Linear Phase      : Yes (Type 1)             
Arithmetic        : fixed   
Numerator         : s18,0 -> [-131072 131072)
Input             : s12,0 -> [-2048 2048)    
Filter Internals  : Full Precision           
  Output     : s31,0 -> [-1073741824 1073741824) (auto determined)
  Product    : s29,0 -> [-268435456 268435456) (auto determined)  
  Accumulator: s31,0 -> [-1073741824 1073741824) (auto determined)
  Round Mode      : No rounding              
  Overflow Mode   : No overflow        

Here you can see that all fractional lengths are now set to zero, so this filter is set up properly for working with integers.

Reinterpret the Output.  You can compare the output to the double-precision floating-point reference output, and verify that the computation done by the filter h is done in full precision.

yint = filter(h, xsc);
norm(yd - double(yint))

ans =


You can then truncate the output to only 16 bits:

yout = fi(yint, true, 16);
stem(n(yidx), [xscext, double(yout(yidx)')]);
axis([850 950 -2.5e8 2.5e8]);
legend('input', 'output');

Once again, the plot shows that the input and output are at different scales. In order to scale the output so that the signals can be compared more easily in a plot, you will need to weigh the MSBs appropriately. You can compute the new fraction length using the gain of the filter when the coefficients were integer numbers:

WL = yout.WordLength;
FL = yout.FractionLength + log2(g);
yf2 = fi(zeros(size(yout)), true, WL, FL);
yf2.bin = yout.bin;

stem(n(idx)', [xscext, double(yf2(yidx)')]);
axis([800 950 -2e3 2e3]);
legend('input', 'rescaled output');

This final plot shows the filtered data re-scaled to match the input scale.

Fixed-Point Filtering in Simulink

Fixed-Point Filtering Blocks

The following DSP System Toolbox blocks enable you to design and/or realize a variety of fixed-point filters:

Filter Implementation Blocks

The FIR Decimation, FIR Interpolation, Two-Channel Analysis Subband Filter, Two-Channel Synthesis Subband Filter, and Digital Filter blocks are all implementation blocks. They allow you to implement filters for which you already know the filter coefficients. The first four blocks each implement their respective filter type, while the Digital Filter block can create a variety of filter structures. All filter structures supported by the Digital Filter block support fixed-point signals.

Filter Design and Implementation Blocks

The Filter Realization Wizard block invokes part of the Filter Design and Analysis Tool from Signal Processing Toolbox™ software. This block allows you both to design new filters and to implement filters for which you already know the coefficients. In its implementation stage, the Filter Realization Wizard creates a filter realization using Sum, Gain, and Delay blocks. You can use this block to design and/or implement numerous types of fixed-point and floating-point single-channel filters. See the Filter Realization Wizard reference page for more information about this block.

The CIC Decimation and CIC Interpolation blocks allow you to design and implement Cascaded Integrator-Comb filters. See their block reference pages for more information.

Was this topic helpful?