HDL Distributed Arithmetic for FIR Filters

Open Live Script

This example illustrates how to generate HDL code for a lowpass FIR filter with Distributed Arithmetic (DA) architecture.

Distributed Arithmetic

Distributed Arithmetic is a popular architecture for implementing FIR filters without the use of multipliers. DA realizes the sum of products computation required for FIR filters efficiently using LUTs, shifters and adders. Since these operations map efficiently onto an FPGA, DA is a favored architecture on these devices.

Design the Filter

Use a sampling rate of 48 kHz, passband edge frequency of 9.6 kHz and stop frequency of 12k. Set the allowable peak-to-peak passband ripple to 1 dB and the stopband attenuation to -90 dB. Then, design the filter using fdesign.lowpass, and create the System object filter as a direct form FIR filter.

Fs           = 48e3;         % Sampling Frequency in Hz
Fpass        = 9.6e3;        % Passband Frequency in Hz
Fstop        = 12e3;         % Stopband Frequency in Hz
Apass        = 1;            % Passband Ripple in dB
Astop        = 90;           % Stopband Attenuation in dB

lpSpec = fdesign.lowpass( 'Fp,Fst,Ap,Ast',...
    Fpass, Fstop, Apass, Astop, Fs);

lpFilter = design(lpSpec, 'equiripple', 'filterstructure', 'dffir',...
    'SystemObject', true);

Quantize the Filter

Since DA implements the FIR filter by serializing the input data bits, it requires a quantized filter. Assume that 12 bit input and output word lengths with 11 fractional bits are required (due to of fixed data path requirements or input ADC/output DAC widths). Apply these fixed point settings.

inputDataType = numerictype(1,12,11);
outputDataType = inputDataType;
coeffsDataType = numerictype(1,16,16);

lpFilter.FullPrecisionOverride = false;
lpFilter.CoefficientsDataType = 'Custom';
lpFilter.CustomCoefficientsDataType = coeffsDataType;
lpFilter.OutputDataType = 'Custom';
lpFilter.CustomOutputDataType = outputDataType;

% Now check the filter response with fvtool.
fvtool(lpFilter,'Fs',Fs,'Arithmetic','fixed');

Generate HDL Code with DA Architecture

To generate HDL Code with DA architecture, invoke the generatehdl command, passing in a valid value to the 'DALUTPartition' property. The 'DALUTPartition' property directs the code generator to use DA architecture, and divides the LUT into a specified number of partitions. The 'DALUTPartition' property specifies the number of LUT partitions, and the number of the taps associated with each partition. For a filter with many taps it is best to divide the taps into a number of LUTs, with each LUT storing the sum of coefficients for only the taps associated with it. The sum of the LUT outputs is computed in a tree structure of adders.

Check the filter length by getting the number of coefficients.

FL = length(lpFilter.Numerator);

Assume that you have 8 input LUTs; calculate the value of the DALUTPartition property such that you use as many of these LUTs as possible per partition.

dalut = [ones(1, floor(FL/8))*8, mod(FL, 8)];

Generate HDL with DA architecture. By default, VHDL code is generated. To generate Verilog code, pass in the 'TargetLanguage' property with the value 'Verilog'.

workingdir = tempname;
generatehdl(lpFilter, 'DALUTPartition', dalut, ...
                'TargetDirectory', workingdir, ...
                'InputDataType', inputDataType);

### Structure fir has symmetric coefficients, consider converting to structure symmetricfir for reduced area.
### Starting VHDL code generation process for filter: firfilt
### Generating: /tmp/Bdoc24a_2528353_248711/tpf02517ae_0de5_4e2f_93a6_cbd26d308939/firfilt.vhd
### Starting generation of firfilt VHDL entity
### Starting generation of firfilt VHDL architecture
### Clock rate is 12 times the input sample rate for this architecture.
### Successful completion of VHDL code generation process for filter: firfilt
### HDL latency is 3 samples

Convert the Filter Structure to 'Direct form symmetric' and Generate HDL

A symmetrical filter structure offers advantages in hardware, as it halves the number of coefficients to work with. This reduces the hardware complexity substantially. Create a new FIR filter System object 'lpSymFilter' with a 'Direct form symmetric' structure and the same fixed point settings.

lpSymFilter = design(lpSpec, 'equiripple', 'filterstructure', 'dfsymfir',...
    'SystemObject', true);

lpSymFilter.FullPrecisionOverride = false;
lpSymFilter.CoefficientsDataType = 'Custom';
lpSymFilter.CustomCoefficientsDataType = coeffsDataType;
lpSymFilter.OutputDataType = 'Custom';
lpSymFilter.CustomOutputDataType = outputDataType;

% Calculate filter length FL for lpSymFilter for the purpose of calculating 'DALUTPartition'
FL = ceil(length(lpSymFilter.Numerator)/2);

% Generate the  value for 'DALUTPartition' as done previously for lpFilter.
dalut_sym = [ones(1, floor(FL/8))*8, mod(FL, 8)];

% Generate HDL code for default radix of 2
generatehdl(lpSymFilter, 'DALUTPartition', dalut_sym, ...
                   'TargetDirectory', workingdir, ...
                   'InputDataType', inputDataType);

### Starting VHDL code generation process for filter: firfilt
### Generating: /tmp/Bdoc24a_2528353_248711/tpf02517ae_0de5_4e2f_93a6_cbd26d308939/firfilt.vhd
### Starting generation of firfilt VHDL entity
### Starting generation of firfilt VHDL architecture
### Clock rate is 13 times the input sample rate for this architecture.
### Successful completion of VHDL code generation process for filter: firfilt
### HDL latency is 3 samples

Notice that a symmetrical filter takes one additional clock cycle before the output is obtained. This is because of the carry bit that is added to the input word length as the input data from the symmetrical taps are summed together. The clock rate for 'lpSymFilter' is 13 times the input sample rate, whereas for 'lpFilter' the clock rate was 12 times the input sample rate.

DARadix

The default architecture is a Radix 2 implementation, which operates on one bit of input data on each clock cycle. The number of clock cycles elapsed before an output is obtained is equal to the number of bits in the input data. Thus DA can potentially limit the throughput. To improve the throughput of DA, you can configure DA to process multiple bits in parallel. The 'DARadix' property is provided for this purpose. For example, you can set 'DARadix' to 2^3 to operate on 3 bits in parallel. For a 12 bit input word length, you can specify processing of 1, 2, 3, 4, 6 or 12 bits at a time by specifying corresponding 'DARadix' values of 2^1, 2^2, 2^3, 2^4, 2^6, or 2^12 respectively.

In selecting different 'DARadix' values, you trade off speed vs. area within the DA architecture. The number of bits operated in parallel determines the factor by which the clock rate needs to be increased. This is known as folding factor. For example, the default 'DARadix of 2^1, implying 1 bit at a time, results in a clock rate 12 times the input sample rate or a folding factor of 12. A 'DARadix' of 2^3 results in a clock rate only 4 times the input sample rate, but requires 3 identical sets of LUTs, one for each bit being processed in parallel.

Information Regarding DA Architecture

As explained in previous section, DA architecture presents a lot of options both in terms of LUT sizes and the folding factor. You can use hdlfilterdainfo function to get information regarding various filter lengths based on the value of coefficients. This function also displays two other tables, one for all possible values of DARadix property with corresponding folding factors. The second table displays details of LUT sets with the corresponding values of DALUTPartition property.

hdlfilterdainfo(lpFilter, 'InputDataType', inputDataType);

   | Total Coefficients | Zeros | Effective |
   ------------------------------------------
   |         58         |   0   |     58    |

Effective filter length for SerialPartition value is 58.

  Table of 'DARadix' values with corresponding values of 
  folding factor and multiple for LUT sets for the given filter.

   | Folding Factor | LUT-Sets Multiple | DARadix |
   ------------------------------------------------
   |        1       |         12        |   2^12  |
   |        2       |         6         |   2^6   |
   |        3       |         4         |   2^4   |
   |        4       |         3         |   2^3   |
   |        6       |         2         |   2^2   |
   |       12       |         1         |   2^1   |

  Details of LUTs with corresponding 'DALUTPartition' values.

   | Max Address Width | Size(bits) |                          LUT Details                         |    DALUTPartition   |
   -----------------------------------------------------------------------------------------------------------------------
   |         12        |   259072   |1x1024x13, 1x4096x13, 1x4096x14, 1x4096x15, 1x4096x18         |[12 12 12 12 10]     |
   |         11        |   147544   |2x2048x13, 2x2048x14, 1x2048x18, 1x8x11                       |[11 11 11 11 11 3]   |
   |         10        |    78080   |3x1024x13, 1x1024x16, 1x1024x18, 1x256x13                     |[10 10 10 10 10 8]   |
   |         9         |    43712   |1x16x12, 1x512x12, 2x512x13, 1x512x14, 1x512x15, 1x512x18     |[9 9 9 9 9 9 4]      |
   |         8         |    25384   |4x256x13, 1x256x14, 1x256x15, 1x256x18, 1x4x10                |[8 8 8 8 8 8 8 2]    |
   |         7         |    14248   |2x128x12, 3x128x13, 1x128x14, 1x128x16, 1x128x18, 1x4x10      |[7 7 7 7 7 7 7 7 2]  |
   |         6         |    8000    |1x16x12, 4x64x12, 1x64x13, 2x64x14, 1x64x16, 1x64x17          |[ones(1,9)*6, 4]     |
   |         5         |    4696    |1x32x11, 4x32x12, 3x32x13, 1x32x14, 1x32x15, 1x32x17, 1x8x11  |[ones(1,11)*5, 3]    |
   |         4         |    2904    |3x16x11, 5x16x12, 2x16x13, 2x16x14, 1x16x15, 1x16x17, 1x4x10  |[ones(1,14)*4, 2]    |
   |         3         |    1926    |1x2x7, 5x8x11, 8x8x12, 1x8x13, 2x8x14, 2x8x15, 1x8x17         |[ones(1,19)*3, 1]    |
   |         2         |    1412    |2x4x10, 12x4x11, 6x4x12, 2x4x13, 4x4x14, 2x4x15, 1x4x17       |ones(1,29)*2         |

Notes:
1. LUT Details indicates number of LUTs with their sizes. e.g. 1x1024x18
   implies 1 LUT of 1024 18-bit wide locations.

You can use optional properties for LUT and folding factors to display specific information. You can choose one of the two LUT properties, 'LUTInputs' or 'DALUTPartition' to display all the folding factor options available for the specific LUT inputs.

hdlfilterdainfo(lpFilter, 'InputDataType', inputDataType, ...
                    'LUTInputs', 4);

   | Folding Factor | LUT Inputs | LUT Size |                             LUT Details                             |
   ----------------------------------------------------------------------------------------------------------------
   |        1       |      4     |   34848  |12 x (3x16x11, 5x16x12, 2x16x13, 2x16x14, 1x16x15, 1x16x17, 1x4x10)  |
   |        2       |      4     |   17424  |6 x (3x16x11, 5x16x12, 2x16x13, 2x16x14, 1x16x15, 1x16x17, 1x4x10)   |
   |        3       |      4     |   11616  |4 x (3x16x11, 5x16x12, 2x16x13, 2x16x14, 1x16x15, 1x16x17, 1x4x10)   |
   |        4       |      4     |   8712   |3 x (3x16x11, 5x16x12, 2x16x13, 2x16x14, 1x16x15, 1x16x17, 1x4x10)   |
   |        6       |      4     |   5808   |2 x (3x16x11, 5x16x12, 2x16x13, 2x16x14, 1x16x15, 1x16x17, 1x4x10)   |
   |       12       |      4     |   2904   |1 x (3x16x11, 5x16x12, 2x16x13, 2x16x14, 1x16x15, 1x16x17, 1x4x10)   |

You can also choose one of the two folding factor related properties, 'FoldingFactor' or 'DARadix' to display all the LUT options for the specific folding factor.

hdlfilterdainfo(lpFilter, 'InputDataType', inputDataType, ...
                    'Foldingfactor', 6);

   | Folding Factor | LUT Inputs | LUT Size |                             LUT Details                            |
   ---------------------------------------------------------------------------------------------------------------
   |        6       |     12     |  518144  |2 x (1x1024x13, 1x4096x13, 1x4096x14, 1x4096x15, 1x4096x18)         |
   |        6       |     11     |  295088  |2 x (2x2048x13, 2x2048x14, 1x2048x18, 1x8x11)                       |
   |        6       |     10     |  156160  |2 x (3x1024x13, 1x1024x16, 1x1024x18, 1x256x13)                     |
   |        6       |      9     |   87424  |2 x (1x16x12, 1x512x12, 2x512x13, 1x512x14, 1x512x15, 1x512x18)     |
   |        6       |      8     |   50768  |2 x (4x256x13, 1x256x14, 1x256x15, 1x256x18, 1x4x10)                |
   |        6       |      7     |   28496  |2 x (2x128x12, 3x128x13, 1x128x14, 1x128x16, 1x128x18, 1x4x10)      |
   |        6       |      6     |   16000  |2 x (1x16x12, 4x64x12, 1x64x13, 2x64x14, 1x64x16, 1x64x17)          |
   |        6       |      5     |   9392   |2 x (1x32x11, 4x32x12, 3x32x13, 1x32x14, 1x32x15, 1x32x17, 1x8x11)  |
   |        6       |      4     |   5808   |2 x (3x16x11, 5x16x12, 2x16x13, 2x16x14, 1x16x15, 1x16x17, 1x4x10)  |
   |        6       |      3     |   3852   |2 x (1x2x7, 5x8x11, 8x8x12, 1x8x13, 2x8x14, 2x8x15, 1x8x17)         |
   |        6       |      2     |   2824   |2 x (2x4x10, 12x4x11, 6x4x12, 2x4x13, 4x4x14, 2x4x15, 1x4x17)       |

Notice that LUT details indicate a factor by which the LUT sets need to be replicated to achieve the corresponding folding factor. Also, total LUT size is calculated with above factor.

You can use output arguments to return the values of DALUTPartition and DARadix for a specific configuration and use it with generatehdl command. Let us assume that you can intend to raise the clock rate by 4 times the sample rate and want to use 6 input LUTs. You can verify that the LUT details meet your area requirements.

hdlfilterdainfo(lpFilter, 'InputDataType', inputDataType, ...
                    'FoldingFactor', 4, ...
                    'LUTInputs',  6);

   | Folding Factor |     LUT Size     |                         LUT Details                        |  DALUTPartition  | DARAdix |
   -------------------------------------------------------------------------------------------------------------------------------
   |        4       | 3 x 8000 = 24000 |3 x (1x16x12, 4x64x12, 1x64x13, 2x64x14, 1x64x16, 1x64x17)  | [ones(1,9)*6, 4] |   2^3   |

Now generate HDL with the above constraints by first storing the required values of DALUTPartition and DARadix in variables by using the output arguments to the hdlfilterdainfo function. You can then invoke generatehdl command using these variables.

[dalut, dr] = hdlfilterdainfo(lpFilter, 'InputDataType', inputDataType, ...
                                  'FoldingFactor', 4, ...
                                  'LUTInputs', 6);

generatehdl(lpFilter, 'InputDataType', inputDataType, ...
                'DALUTPartition', dalut, ...
                'DAradix', dr, ...
                'TargetDirectory', workingdir);

### Structure fir has symmetric coefficients, consider converting to structure symmetricfir for reduced area.
### Starting VHDL code generation process for filter: firfilt
### Generating: /tmp/Bdoc24a_2528353_248711/tpf02517ae_0de5_4e2f_93a6_cbd26d308939/firfilt.vhd
### Starting generation of firfilt VHDL entity
### Starting generation of firfilt VHDL architecture
### Clock rate is 4 times the input sample rate for this architecture.
### Successful completion of VHDL code generation process for filter: firfilt
### HDL latency is 3 samples

Conclusion

You designed a lowpass direct form FIR filter to meet the given specification. You then quantized and checked your design. You generated VHDL code for DA with various radices and explored speed vs. area trade-offs within DA by replicating LUTs and operating on multiple bits in parallel.

You can generate a test bench with a standard stimulus and/or your own defined stimulus, and use an HDL Simulator to verify the generated HDL code for DA architectures. You can use a synthesis tool to compare the area and speed of these architectures.