Fast Fourier transform—optimized for HDL code generation
DSP System Toolbox/Transforms
dspxfrm3
The FFT HDL Optimized block provides two architectures to optimize either throughput or area. Use the streaming Radix 2^2 architecture for highthroughput applications. This architecture supports scalar or vector input data. You can achieve gigasamplepersecond (GSPS) throughput using vector input. Use the burst Radix 2 architecture for a minimum resource implementation, especially with large FFT sizes. Your system must be able to tolerate bursty data and higher latency. This architecture supports only scalar input data. The block accepts real or complex data, provides hardwarefriendly control signals, and has optional output frame control signals.
The FFT HDL Optimized block replaces the HDL Streaming FFT block and the HDL Minimum Resource FFT block.
This FFT HDL Optimized block icon shows all optional ports.
Port  Direction  Description  Data Type 

dataIn  Input  Scalar or column vector of real or complex input data. Vector input is supported with
Streaming Radix 2^2
architecture only. The vector size must be a power
of 2 from 1 through 64 that is not greater than
the FFT length. 

validIn  Input  Indicates that the input data is valid. When validIn is true ,
the block captures the value on dataIn .  boolean 
reset  Input  Optional. Reset internal state. When reset is true ,
the block stops the current calculation and clears all internal state.
The block begins fresh calculations when reset is false and validIn starts
a new frame.  boolean 
dataOut  Output  Frequency channel output data. The output order is bit reversed by default.  Same as dataIn . If
scaling is disabled, the output word length grows
to avoid overflow. See the Divide
butterfly outputs by two
parameter. 
validOut  Output  Indicates that the output data is valid. The block sets validOut to true with
each valid sample on dataOut .  boolean 
ready  Output  This port appears when you select the burst architecture. Indicates when the block has memory available for new input data.  boolean 
startOut  Output  Optional. When this port is enabled, the block sets startOut to
true during the first valid
cycle of a frame of output data.  boolean 
endOut  Output  Optional. When this port is enabled, the block sets endOut to true during
the last valid cycle of a frame of output data.  boolean 
Specify the number of data points used for one FFT calculation. The default value is 1024. For HDL code generation, the FFT length must be a power of 2 between 2^{3} and 2^{16}.
Streaming Radix
2^2
(default) — Lowlatency
architecture. Supports gigasamplepersecond
(GSPS) throughput when you use vector
input.
Burst Radix 2
— Minimum resource architecture. Vector
input is not supported when you select this
architecture.
For details of both architectures, see Algorithm.
Select the HDL implementation of complex multipliers. Each multiplication is implemented with
either 3 multipliers and 5 adders, or 4
multipliers and 2 adders. The faster or smaller
option depends on your synthesis tool and target
device. This option applies only when you set
Architecture to
Streaming Radix 2^2
.
When you select this check box, the output elements are bit reversed relative to the input order. Clear the check box to output elements in linear order. By default, the check box is selected. The FFT algorithm calculates output in the reverse order to the input. If you specify the output to be in the same order as the input, the algorithm performs an extra reversal operation. For vector data, input and output data must be in opposite orders, so select only one of Output in bitreversed order or Input in bitreversed order. For more information, see Linear and BitReversed Output Order.
When you select this check box, the block expects input data in bitreversed order. By default, the check box is cleared and input is expected in linear order. The FFT algorithm calculates output in the reverse order to the input. If you specify the output to be in the same order as the input, the algorithm performs an extra reversal operation. For vector data, input and output data must be in opposite orders, so select only one of Output in bitreversed order or Input in bitreversed order. For more information, see Linear and BitReversed Output Order.
When you select this check box, the block implements an overall 1/N scale factor by scaling the output of each butterfly multiplication by 2. This adjustment keeps the output of the FFT in the same amplitude range as its input. If scaling is disabled, the block avoids overflow by increasing the word length by one bit after each butterfly multiplication. The bit growth is the same for both architectures. By default, the check box is not selected.
The default rounding method for
internal fixed point calculations is
Floor
. When the input
is any integer or fixedpoint data type, the FFT
block uses fixedpoint arithmetic for internal
calculations. This option does not apply when the
input is single or double type. Rounding applies
to twiddle factor multiplication and scaling
operations.
Select this check box to enable the reset
port. When
reset
is
true
, the block stops the
current calculation and clears all internal state.
The block begins fresh calculations when
reset
is
false
and
validIn
starts a new frame. By
default, the check box is not selected.
Select this check box to enable the startOut
port. This output signal is
asserted (true
) for the first
cycle of an output frame. By default, the check
box is not selected.
Select this check box to enable the endOut
port. This output signal is
asserted (true
) for the last
cycle of an output frame. By default, the check
box is not selected.
The streaming Radix 2^2 architecture implements a lowlatency architecture. It saves resources compared to a streaming Radix 2 implementation by factoring and grouping the FFT equation. The architecture has log_{4}(N) stages. Each stage contains two singlepath delay feedback (SDF) butterflies with memory controllers. When you use vector input, each stage operates on fewer input samples, so some stages reduce to a simple butterfly, without SDF.
The first SDF stage is a regular butterfly. The second stage multiplies by –j by swapping the real and imaginary parts of the input, and swapping the imaginary parts of the output. Each stage rounds the result of the twiddle factor multiplication to the input word length. The twiddle factors have the same bit width as the input data. They use two integer bits, and the remainder are fractional bits.
If you enable scaling, the algorithm divides the result of each butterfly stage by 2. Scaling at each stage avoids overflow, keeps the word length the same as the input, and results in an overall scale factor of 1/N. If scaling is disabled, the algorithm avoids overflow by increasing the word length by 1 bit at each stage. The diagram shows the butterflies and internal word lengths of each stage, not including the memory.
The burst Radix 2 architecture implements the FFT by using a single complex butterfly multiplier. The algorithm cannot start until it has stored the entire input frame, and it cannot accept the next frame until computations are complete. The ready
output signal indicates when the algorithm is ready for new data. The diagram shows the burst architecture, with pipeline registers.
The algorithm processes input data only when validIn
is high. Output data is valid only when validOut
is high.
When the optional reset
input signal is high, the algorithm stops the current calculation and clears all internal state. The algorithm begins fresh calculations when reset
is low and validIn
starts a new frame.
This diagram shows validIn
and validOut
signals for
contiguous scalar input data, streaming Radix 2^2 architecture, an FFT length of 1024, and a
vector size of 16.
The diagram also shows the optional startOut
and endOut
signals
that indicate frame boundaries. If you enable startOut
,
it pulses for one cycle with the first validOut
of
the frame. If you enable endOut
, it pulses for
one cycle with the last validOut
of the frame.
If you apply continuous input frames, the output will also be continuous, after the initial latency.
The validIn
signal can be noncontiguous. Data accompanied by a
validIn
signal is processed as it arrives, and the output is stored
until a frame is filled. Then the algorithm returns contiguous output samples in a frame of
N (FFT length) cycles. This diagram shows noncontiguous input and
contiguous output for an FFT length of 512 and a vector size of 16.
When you use the burst architecture, you cannot provide the next frame of input data until
memory space is available. The ready
signal indicates when the algorithm
can accept new input data.
The latency varies with the FFT length and input vector size. After you update the model, the block icon displays the latency. The displayed latency is the number of cycles between the first valid input and the first valid output, assuming the input is contiguous.
When using the burst architecture with contiguous input, if your design waits for
ready=0
before deasserting validIn
, then one extra
cycle of data arrives at the input. This data sample is the first sample of the next frame.
The algorithm can save one sample while processing the current frame. Due to this one sample
advance, the observed latency of the later frames (validIn
to
validOut
) is one cycle shorter than the reported latency. The number
of cycles between ready
low and validOut
high is
always latency – FFTLength.
This block supports HDL code generation using HDL Coder™. HDL Coder provides additional configuration options that affect HDL implementation and synthesized logic. For more information on implementations, properties, and restrictions for HDL code generation, see FFT HDL Optimized in the HDL Coder documentation.
These resource and performance data are the synthesis results from the generated HDL targeted to a Xilinx^{®} Virtex^{®}6 (XC6VLX75T1FF484) FPGA. The examples in the tables have this configuration:
1024 FFT length (default)
Complex multiplication using 4 multipliers, 2 adders
Output scaling enabled
16bit complex input data
Clock enables minimized (HDL Coder parameter)
Performance of the synthesized HDL code varies with your target and synthesis options. For instance, naturalorder output uses more RAM than bitreversed output, and real input uses less RAM than complex input.
For a scalar input Radix 2^2 configuration, the design achieves 326 MHz clock frequency. The latency is 1116 cycles. The design uses these resources.
Resource  Number Used 

LUT  4597 
FFS  5353 
Xilinx LogiCORE^{®} DSP48  12 
Block RAM (16K)  6 
When you vectorize the same Radix 2^2 implementation to process two 16bit input samples in parallel, the design achieves 316 MHz clock frequency. The latency is 600 cycles. The design uses these resources.
Resource  Number Used 

LUT  7653 
FFS  9322 
Xilinx LogiCORE DSP48  24 
Block RAM (16K)  8 
The burst Radix 2 implementation is supported with scalar input data only. The burst design achieves 309 MHz clock frequency. The latency is 5811 cycles. The design uses these resources.
Resource  Number Used 

LUT  971 
FFS  1254 
Xilinx LogiCORE DSP48  3 
Block RAM (16K)  6 
[1] Algnabi, Y.S, F.A. Aldaamee, R. Teymourzadeh, M. Othman, and M.S. Islam. “Novel architecture of pipeline Radix 2^2 SDF FFT Based on digitslicing technique.” 10th IEEE International Conference on Semiconductor Electronics (ICSE). 2012, pp. 470–474.