Histogram Equalization Using Video Frame Buffer
Video processing applications often store a full frame of video data to process the frame and modify the next frame. In such designs video frames are stored in external memory while FPGA resources are used to process same data. This example shows how to design a video application with HDMI input and output performing histogram equalization using external memory for video frame buffering.
Supported hardware platform
Xilinx® Zynq® ZC706 evaluation kit + FMC-HDMI-CAM mezzanine card
Design Task and System Requirements
Consider an application involving continuous streaming of video data through the FPGA. In the top model soc_histogram_equalization_top the FPGA calculates the histogram of the incoming video stream, in the
FPGA subsystem, while streaming the same video stream to external memory for storage. Once the histogram has been calculated and accumulated across the entire video frame, a synchronization signal is toggled to trigger the read back of the stored frame from external memory. The accumulated histogram vector is then applied to the video stream read back from external memory to perform the equalization algorithm. The external memory frame buffer is modeled using the AXI4 Video Frame Buffer block.
The HDMI Input block reads a video file and provides video data and control signals to downstream FPGA processing blocks. Video data is in YCbCr 4:2:2 format, and the control signals are in the
pixel control bus format. The HDMI Output block reads video data and control signals, in the same format as output by the HDMI Input block, and provides a visual output using the Video Display block.
The Push Button block enables bypassing of the histogram equalization algorithm, routing the unprocessed output from the external memory frame buffer to the output.
There are a number of requirements to consider when designing an application that interfaces with external memory:
Throughput: What is the rate that you need to transfer data to/from memory to satisfy the requirements of your algorithm? Specifically for vision applications, what is the frame-size and frame-rate that you must be able to maintain?
Latency: What is the maximum amount of time that your algorithm can tolerate between requesting and receiving data? For vision applications, do you need a continuous stream of data, without gaps? Are you able to buffer samples internal to your algorithm in order to prevent data loss when access to the memory is blocked?
For this histogram equalization example, we have defined the following requirements:
Throughput must be sufficient to maintain a 1920x1080p video stream at 60 frames-per-second.
Latency must be sufficiently low so as not to drop frames.
With the above throughput requirement, we can calculate the value that is required for the frame buffer:
As the video format is YCbCr 4:2:2, we require 2 bytes-per-pixel (BPP), this equates to a throughput requirement of
Because the algorithm must both write and read the video data to/from the external memory, this throughput requirement must be doubled, for a total throughput requirement of
Design Using SoC Blockset
In general, your algorithm will be a part of a larger SoC application. In such applications, it is likely that there will be other algorithms also requiring access to external memory. In this scenario, you must consider the impact of other algorithm's memory accesses on the performance and requirements of your algorithm. Assuming that your algorithm shares the memory channel with other components, you should consider the following:
What is the total available memory bandwidth in the SoC system?
How will your algorithm adapt to shared memory bandwidth?
Can your algorithm tolerate an increased read/write latency?
By appropriate modeling of additional memory consumers in the overall application, you can systematically design your algorithm to meet your requirements in situations where access to the memory is not exclusive to your algorithm.
To avoid modeling of all memory readers and writers in the overall system, you can use Memory Traffic Generator blocks to consume read/write bandwidth in your system by creating access requests. In this way, you can simulate additional memory accesses within your system without explicit modeling.
Modeling Additional Memory Consumers
Simulate the system without additional memory consumers and view the memory performance plot. Open the Frame Buffer block, on the Performance tab, under Memory Controller, click View performance plots.
Here, the memory masters are as follows:
Master 1: Frame Buffer write
Master 2: Frame Buffer read
Master 3: Contention (Memory Traffic Generator) (commented out)
Note that both active masters are consuming 248.8 MB/s of memory bandwidth.
More Memory Consumers: Consider that your algorithm is part of a larger system, and a secondary algorithm is being developed by a colleague or third-party. In this scenario, the secondary algorithm will be developed separately for the interest of time and division of work. Rather than combine the two algorithms into a single simulation, you can model the memory access of the secondary algorithm using a Memory Traffic Generator, and simulate the impact, if any, that it will have on your algorithm.
For example, assume that you are provided with the following memory requirements for the secondary algorithm:
Throughput: 1150 MB/s
Given that the primary algorithm consumes ~500 MB/s of the memory bandwidth, and the total available memory bandwidth is 1600 MB/s, we know that the total bandwidth requirement for our system exceeds the total available bandwidth by ~50 MB/s.
To enable the modeling of the secondary algorithm memory access, uncomment the
Contention Memory Traffic Generator block. The block mask settings are shown below.
Simulating the system with the secondary algorithm's memory accesses, results in the following Memory Bandwidth Usage plot.
As you can see, at around 0.03s - when the secondary algorithm begins memory access requests, the other masters do not achieve their required throughput. Looking at the logic analyzer waveform, we can see this manifested as dropped buffers for the Frame Buffer write master and the idle state for the Frame Buffer read master.
Implement and Run on Hardware
Following products are required for this section:
SoC Blockset Support Package for Xilinx Devices. For more information about the support package, see SoC Blockset Supported Hardware
To implement the model on a supported SoC board use the SoC Builder tool. Open the mask of
FPGA subsystem and set Model variant to
Pixel based processing.
Comment out the Contention block.
To open SoC Builder, in the Simulink® toolstrip, on the System on Chip tab, click Configure, Build, & Deploy.
On the Setup screen, select Build model. Click Next.
On the Select Build Action screen, select Build, load, and run. Click Next.
On the Select Project Folder screen, specify the project folder. Click Next.
On the Review Memory Map screen, view the memory map by clicking View/Edit. Click Next.
On the Validate Model screen, check the compatibility of the model for implementation by clicking Validate. Click Next.
On the Build Model screen, begin building the model by clicking Build. An external shell opens when FPGA synthesis begins. Click Next.
On the Load Bitstream screen, click Next.
The FPGA synthesis can take more than 30 minutes to complete. To save time, you may want to use the provided pre-generated bitstream by following these steps:
Close the external shell to terminate synthesis.
Copy pre-generated bitstream to your project folder by running this command.
copyfile(fullfile(matlabshared.supportpkg.getSupportPackageRoot,'toolbox','soc',... 'supportpackages','xilinxsoc','xilinxsocexamples','bitstreams',... 'soc_histogram_equalization_top-zc706.bit'), './soc_prj');
Load pre-generated bitstream and run the model on the SoC board by clicking Load and Run.
Now the model is running on hardware. To get the memory bandwidth usage in hardware, execute the following axi-manager test bench for soc_histogram_equalization_top_aximaster.
The following figure shows the Memory Bandwidth usage when the application is deployed on hardware.
You designed a video application with real time HDMI I/O and frame buffering in external memory. You explored effects of other consumers of memory on overall bandwidth. You used SoC Builder to implement the model on hardware and verify the design.