Video processing applications often store a full frame of video data to process the frame and modify the next frame. In such designs video frames are stored in external memory while FPGA resources are used to process same data. This example shows how to design a video application with HDMI input and output performing histogram equalization using external memory for video frame buffering.
Supported hardware platform
Xilinx® Zynq® ZC706 evaluation kit + FMC-HDMI-CAM mezzanine card
Consider an application involving continuous streaming of video data through the FPGA. In the top model soc_histogram_equalization_top the FPGA calculates the histogram of the incoming video stream, in the 'FPGA' subsystem, while streaming the same video stream to external memory for storage. Once the histogram has been calculated and accumulated across the entire video frame, a synchronization signal is toggled to trigger the read back of the stored frame from external memory. The accumulated histogram vector is then applied to the video stream read back from external memory to perform the equalization algorithm. The external memory frame buffer is modeled using the 'Memory Channel' block in
AXI4-Stream Video Frame Buffer mode.
The 'HDMI Input' block reads a video file and provides video data and control signals to downstream FPGA processing blocks. Video data is in YCbCr 4:2:2 format, and the control signals are in the
pixel control bus format. The 'HDMI Output' block reads video data and control signals, in the same format as output by the 'HDMI Input' block, and provides a visual output using the Video Display block.
The Push Button block enables bypassing of the histogram equalization algorithm, routing the unprocessed output from the external memory frame buffer to the output.
There are a number of requirements to consider when designing an application that interfaces with external memory:
Throughput: What is the rate that you need to transfer data to/from memory to satisfy the requirements of your algorithm? Specifically for vision applications, what is the frame-size and frame-rate that you must be able to maintain?
Latency: What is the maximum amount of time that your algorithm can tolerate between requesting and receiving data? For vision applications, do you need a continuous stream of data, without gaps? Are you able to buffer samples internal to your algorithm in order to prevent data loss when access to the memory is blocked?
For this histogram equalization example, we have defined the following requirements:
Throughput must be sufficient to maintain a 1920x1080p video stream at 60 frames-per-second.
Latency must be sufficiently low so as not to drop frames.
With the above throughput requirement, we can calculate the value that is required for the frame buffer:
As the video format is YCbCr 4:2:2, we require 2 bytes-per-pixel (BPP), this equates to a throughput requirement of
Because the algorithm must both write and read the video data to/from the external memory, this throughput requirement must be doubled, for a total throughput requirement of
In general, your algorithm will be a part of a larger SoC application. In such applications, it is likely that there will be other algorithms also requiring access to external memory. In this scenario, you must consider the impact of other algorithm's memory accesses on the performance and requirements of your algorithm. Assuming that your algorithm shares the memory channel with other components, you should consider the following:
What is the total available memory bandwidth in the SoC system?
How will your algorithm adapt to shared memory bandwidth?
Can your algorithm tolerate an increased read/write latency?
By appropriate modeling of additional memory consumers in the overall application, you can systematically design your algorithm to meet your requirements in situations where access to the memory is not exclusive to your algorithm.
To avoid modeling of all memory readers and writers in the overall system, you can use 'Memory Traffic Generator' blocks to consume read/write bandwidth in your system by creating access requests. In this way, you can simulate additional memory accesses within your system without explicit modeling.
Simulate the system without additional memory consumers and view the memory performance plot from the 'Memory Controller' block.
Here, the memory masters are as follows:
Master 1: Frame Buffer write
Master 2: Frame Buffer read
Master 3: Contention (Memory Traffic Generator) (commented out)
Note that both active masters are consuming 248.8 MB/s of memory bandwidth.
More Memory Consumers: Consider that your algorithm is part of a larger system, and a secondary algorithm is being developed by a colleague or third-party. In this scenario, the secondary algorithm will be developed separately for the interest of time and division of work. Rather than combine the two algorithms into a single simulation, you can model the memory access of the secondary algorithm using a Memory Traffic Generator, and simulate the impact, if any, that it will have on your algorithm.
For example, assume that you are provided with the following memory requirements for the secondary algorithm:
Throughput: 1150 MB/s
Given that the primary algorithm consumes ~500 MB/s of the memory bandwidth, and the total available memory bandwidth is 1600 MB/s, we know that the total bandwidth requirement for our system exceeds the total available bandwidth by ~50 MB/s.
To enable the modeling of the secondary algorithm memory access, uncomment the
Contention Memory Traffic Generator block. The block mask settings are shown below.
Simulating the system with the secondary algorithm's memory accesses, results in the following Memory Bandwidth Usage plot.
As you can see, at around 0.03s - when the secondary algorithm begins memory access requests, the other masters do not achieve their required throughput. Looking at the logic analyzer waveform, we can see this manifested as dropped buffers for the Frame Buffer write master and the idle state for the Frame Buffer read master.
Following products are required for this section:
SoC Blockset Support Package for Xilinx Devices. For more information about the support package, see SoC Blockset Supported Hardware
To implement the model on a supported SoC board use the SoC Builder application. Open the mask of 'FPGA' subsystem and set model variant to 'Pixel based processing'.
Comment out the 'Contention' block.
Click, 'Configure, Build, & Deploy' button in the toolstrip to open SoC Builder
Select 'Build Model' on 'Setup' screen. Click 'Next'.
Click 'View/Edit Memory Map' to view the memory map on 'Review Memory Map' screen. Click 'Next'.
Specify project folder on 'Select Project Folder' screen. Click 'Next'.
Select 'Build, load and run' on 'Select Build Action' screen. Click 'Next'.
Click 'Validate' to check the compatibility of model for implementation on 'Validate Model' screen. Click 'Next'.
Click 'Build' to begin building of the model on 'Build Model' screen. An external shell will open when FPGA synthesis begins. Click 'Next'.
Click 'Next' to 'Load Bitstream' screen.
The FPGA synthesis may take more than 30 minutes to complete. To save time, you may want to use the provided pre-generated bitstream by following these steps:
Close the external shell to terminate synthesis.
Copy pre-generated bitstream to your project folder by running the command below and then,
Click 'Load and Run' button to load pre-generated bitstream and run the model on SoC board
copyfile(fullfile(matlabshared.supportpkg.getSupportPackageRoot,'toolbox','soc',... 'supportpackages','xilinxsoc','xilinxsocexamples','bitstreams',... 'soc_histogram_equalization_top-zc706.bit'), './soc_prj');
Now the model is running on hardware. To get the memory bandwidth usage in hardware, execute the following aximaster test bench for soc_histogram_equalization_top_aximaster.
The following figure shows the Memory Bandwidth usage when the application is deployed on hardware.
You designed a video application with real time HDMI I/O and frame buffering in external memory. You explored effects of other consumers of memory on overall bandwidth. You used SoC Builder to implement the model on hardware and verify the design.