This example shows how to implement the contrast-limited adaptive histogram equalization (CLAHE) algorithm for FPGA, including an external memory interface.
Xilinx® Zynq® ZC706 evaluation kit + FMC-HDMI-CAM mezzanine card
Video processing algorithms often store a full frame of video data in memory. Implementing this storage on an FPGA increases BRAM utilization and can result in input video resolution constraints. This example shows how to implement vision algorithms on FPGAs by using an external memory resource to reduce use of BRAM and enable processing of higher resolution input video.
The external memory interface in this example uses AXI4 protocols and verifies the design against memory contention. The AXI4 Random Access interface provides a simple, direct interface to the memory interconnect. This protocol enables the algorithm to act as a memory master by providing the addresses and managing the burst transfer directly. The AXI4 Master Write Controller and AXI4 Master Read Controller blocks in this example model a simplified AXI-4 interface in Simulink™. When you generate HDL code using the HDL Coder™ product, the generated code includes a fully compliant AXI4 interface IP.
You can use SoC Blockset™ blocks and visualization tools for modeling, simulating, and analyzing hardware and software architectures for ASICs, FPGAs, and systems on a chip (SoC). These features can help you build system architecture using memory models, bus models, and interface models and help you simulate the architecture together with the algorithms. This example models external memory using two blocks from the SoC Blockset libraries:
Memory Channel: This block streams data through external memory. It models data transfer between the read and write master algorithms through shared memory. This example uses the AXI4 Random Access channel type.
Memory Controller: This block arbitrates between masters and grants them unique access to shared memory. It is configured to support multiple channels with various arbitration protocols. This block also logs and displays memory performance data. This feature enables you to analyze and debug the performance of the system at simulation time.
The CLAHE algorithm has three steps: tiling, histogram equalization, and bilinear interpolation. The bilinear interpolation step uses the pixel intensities from the input frame. Storing the full input frame of video data until the bilinear interpolation step requires external memory.
The figure shows the top level of the example model. The HDMI Rx block processes the video input and passes it to the CLAHEAlgorithm_fpga subsystem. The HDMI Rx block converts raw video data to a YCbCr 4:2:2 pixel stream format. The output data is a pixel stream suitable for hardware algorithm design. The HDMI Rx block also directs the SoC Builder tool to generate the IP blocks necessary to receive video data from the FMC-HDMI-CAM card that is attached to the hardware board.
In the model, the AXI4-Master Write Controller and AXI4-Master Read Controller blocks model the AXI4 memory mapped interfaces. The AXI4-Master Write Controller block writes the input frame into the external memory, and the AXI4-Master Read Controller block reads the frame from the external memory for bilinear interpolation. The AXI Read FIFO block sends the output pixel stream to the HDMI Tx block. The HDMI Tx block converts a pixel stream in YCbCr 4:2:2 format to raw video data for display during simulation. This block also directs the SoC Builder tool to generate the IP blocks that transmit video data back to the FMC-HDMI-CAM card. To indicate the status of the AXI Read FIFO and AXI Write FIFO blocks when running the design on hardware, four debug signals from these blocks are connected to LEDs on the board.
The next figure shows the CLAHEAlgorithm_fpga reference model. The input pixel stream connects to a Video Stream Connector block. This block provides a video streaming interface to connect any two IPs in the FPGA implementation. In this example, the Video Stream Connector blocks connect the HDMI input and output blocks with the rest of the FPGA algorithm.
The next figure shows the CLAHEAlgorithm_fpga/CLAHE subsystem, which implements the AXI write and read from external memory, and the CLAHE algorithm.
The subsystem contains these areas: * AXI Write to Memory: This section writes the input data into the DDR. It consists of an AXI4 Master Write Controller block that receives the input video control information from the HDMI Rx block and models the AXI4 memory mapped interface for writing data into the DDR. It generates five signals:
wr_valid signal is an input to the AXI Write FIFO block, which stores the incoming pixel intensities. The SoC Bus Creator block generates the
wrCtrlOut master to slave bus for writing the data into the DDR. The model writes one line of data per burst. After writing tileHeight/ 2 lines (where tileHeight corresponds to the height of each tile in CLAHE), the model asserts the
rd_start signal to begin the read request. The
frame signal indicates the input frame count.
AXI Read from Memory: This section reads the data from the DDR. It consists of an AXI4-Master Read Controller block that receives the
rd_start signal from the AXI4-Master Write Controller block. The AXI4-Master Read Controller block generates the
rd_dready signals. An SoC Bus Creator block combines these signals into a bus. The AXI4-Master Read Controller block also generates the
pixelcontrol bus corresponding to the
rd_data. The model slices the 32-bit
rd_data signal to retrieve the 8-bit (LSB) luminance component and then writes it into the cache memory block of the CLAHE algorithm.
CLAHE: For a detailed description of the implementation of the CLAHE algorithm for hardware, see the Contrast Limited Adaptive Histogram Equalization example. The CLAHEHDLAlgorithm subsystem operates on 8-bit grayscale images, which is why the 8-bit luminance (Y) component is separated from the 16-bit YCbCr pixel data.
The CLAHEHDLAlgorithm subsystem performs the three steps of CLAHE: tiling, histogram equalization, and bilinear interpolation. In the first step, the input frame is divided into an 8-by-8 grid of tiles. In the second step, the histogram of each tile is calculated, and then performs distribution, redistribution, and CDF calculations. The calculated CDF values are stored in a buffer for further processing. The third step calculates the output pixel intensities by using a bilinear interpolation of the CDF values. The pixel intensities of the input frame are used as the address to the buffer that stores the CDF values. These pixel intensities are read from the external memory that stores the original input frame.
Because the data read back from the external memory is in burst mode, it cannot be used directly for bilinear interpolation. The cache buffer stores the burst of lines read from the external memory. The depth of the cache is enough to store a number of lines equal to tileHeight. The
rdValid signal from the CLAHEHDLAlgorithm subsystem generates the
rd_addr signal to read the data from the cache. The data read from the cache (
pixValue) is then returned to the CLAHEHDLAlgorithm subsystem to complete the bilinear interpolation to calculate the output pixel intensity.
The SoC Builder tool builds, loads, and executes the model on the FPGA board. The hardware board used in this example is the Xilinx® Zynq® ZC706 evaluation kit. To build, load, and execute the design on the hardware, follow these steps.
Set up the Vivado® tool for synthesis, implementation, and generation of the FPGA bitstream.
The example model runs in
Accelerator mode by default to speed up the simulation. However, the SoC Builder tool requires
Normal simulation mode. In Simulink Configuration Parameters, set Simulation mode to
Launch the SoC Builder tool by clicking Configure, Build, & Deploy in the Simulink Toolstrip.
Select Build model, and then review the memory map in the Review Memory Map pane.
In the Select Project Folder pane, specify your project folder. In the Select Build Action pane, select Build, load, and run.
In the Validate Model pane, click Validate to check the compatibility of the model for implementation. Then click Build to begin building the model. When FPGA synthesis starts ,an external shell opens.
When the bitstream generation is complete, in the Connect Hardware pane, select Test Connection to test the connection between the host computer and the hardware board. Load the bitstream on the hardware by clicking Load.
This figure shows the final SoC Builder results after these steps are complete.
This example uses an input video of size 480-by-640 pixels. This size is configured in the HDMI Rx block. For the Xilinx Zynq ZC706 evaluation kit, the PL DDR controller is configured with a 64-bit AXI4-Slave interface running at 200 MHz. The resulting bandwidth is 1600 MB/s. This example has two AXI masters connected to the DDR controller. These AXI masters are the DUT AXI4 read and write interfaces. The YCbCr 4:2:2 video format requires 2 bytes per pixel. For the DUT AXI4 read and write interfaces, each pixel is zero-padded to 4 bytes. In this case, the read and write interfaces have a throughput requirement of 2*4*480*640*60 = 147.456 MB/s.
This figure shows the performance plot of the Memory Controller block. To view the performance plot, first open the Memory Controller block. Then, on the Performance tab, click View performance plots. Select all masters under Bandwidth, and then click Update. After the DUT starts writing and reading data into external memory, the throughput remains around 154 MB/s, which is within the required throughput of 147.456 MB/s.
The signals in the example model are logged during simulation. View these signals by using the Logic Analyzer app. This figure shows the logged data of input and output frames.
This figure shows the input and output frames from the model. The result shows the improved contrast in the output image.
 Zuiderveld, Karel. "Contrast Limited Adaptive Histogram Equalization." In Graphics Gems IV, edited by Paul S. Heckbert, 474-485. AP Professional, 1994.