# Performing Large Matrix Operation on FPGA using External Memory

This example shows how to:

- Generate an HDL IP core with AXI4 Master interface.
- Access large matrices from the external DDR3 memory on the Xilinx Zynq ZC706 board using the AXI4 Master interface.
- Perform matrix vector multiplication in the HDL IP core and write the output result back to the DDR3 memory using the AXI4 Master interface.

## Contents

## Before You Begin

To run this example, you must have the following software and hardware installed and set up:

- Xilinx Vivado Design Suite, with supported version listed in the HDL Coder documentation
- Xilinx Zynq ZC706 Evaluation Kit
- HDL Coder Support Package for Xilinx Zynq Platform
- HDL Verifier Support Package for Xilinx Zynq Platform

## Introduction

This example models a matrix vector multiplication algorithm and implements the algorithm on the Xilinx Zynq FPGA board. Large matrices may not map efficiently to Block RAMs on the FPGA fabric. Instead, we can store the matrices in the external DDR3 memory on the FPGA board. The AXI4 Master interface can access the data by communicating with vendor-provided memory interface IP cores that interface with the DDR3 memory. This capability enables you to model algorithms that involve large data processing and requires high-throughput DDR access, such as matrix operations, computer vision algorithms, and so on.

The matrix vector multiplication module supports fixed point matrix vector multiplication, with a configurable matrix size ranging from 2 to 4000. The size of the matrix is run-time configurable through AXI4 accessible register.

```
modelname = 'hdlcoder_external_memory';
open_system(modelname);
```

## Model Algorithm Using AXI4 Master Protocol

This example model includes the FPGA implementable `DUT` (Design under test) block, the `DDR` functional behavior block and the test environment to drive inputs and verify the expected outputs.

The `DUT` subsystem contains the AXI4 Master read/write controller along with the matrix vector multiplication module. Using the AXI4 Master interface, the `DUT` subsystem reads data from the external DDR3 memory, feed the data into the `Matrix_Vector_Multiplication` module, and then write the output data to the external DDR3 memory using AXI4 Master interface. The `DUT` module also has several parameter ports. These ports will be mapped to AXI4-Lite accessible registers, so you can adjust these parameters from MATLAB, even after you implement the design onto the FPGA.

The `DDR` module represents the external DDR memory in simulation environment. The interface between the `DUT` and `DDR` modules are the simplified AXI4 Master protocol.

One of the parameter port `matrix_mul_on` controls whether to run the `Matrix_Vector_Multiplication` module. When the input to `matrix_mul_on` is true, the `DUT` subsystem performs matrix vector multiplication as describe above. When the input to `matrix_mul_on` is false, the `DUT` subsystem perform a data loop back mode. In this mode, the `DUT` subsystem read data from the external DDR3 memory, write it into the `Internal_Memory` module, and then write the same data back to the external DDR3 memory. The data loop back mode is a simple way to verify the functionality of the AXI4 Master external DDR3 memory access.

```
open_system('hdlcoder_external_memory/DUT');
```

Inside the `DUT` subsystem, the `DDR_Access` module models the simplified AXI4 Master protocol, and use it to read and writes data on DDR. During the IP Core Generation workflow, HDL Coder will then generate the translator between the simplified AXI4 Master protocol and the actual AXI4 Master protocol in the generated HDL IP core. For more information on the simplified AXI4 Master protocol, refer to the Model Design for AXI4 Master Interface Generation documentation.

Also inside the `DUT` subsystem, the `Matrix_Vector_Multiplication` module uses a multiply-add block to implement a streaming dot-product computation for the inner-product of the matrix vector multiplication.

Lets say, `A` be a matrix of size NxN and `B` is a vector of size Nx1

Then, matrix vector multiplication output will be: `Z` = `A` * `B`, of size Nx1

The first N values from the DDR are treated as the Nx1 size vector, followed by NxN size matrix data. First N values (vector data) are stored into a RAM. From N+1 values onwards, data is directly streamed as matrix data. Vector data will be read from the `Vector_RAM` in parallel. Both matrix and vector inputs are fed into the `Matrix_mul_top` subsystem. The first matrix output is available after N clock cycles and will be stored into output RAM. Again, vector RAM read address is reinitialized to 0 and starts reading same vector data corresponding to new matrix stream. This operation is repeated for all the rows of the matrix.

The follow diagram shows the architecture of the `Matrix_Vector_Multiplication` module.

## Functional Simulation in Simulink

You can simulate this example model, and verify the simulation result by running following script in MATLAB:

hdlcoder_external_memory_simulation;

PASSED: DDR initialization data matches. PASSED: Matrix vector multiplication output matches with the expected data

This script first initializes the parameters like `Matrix_Size`. By default the `Matrix_Size` is 64, which means a 64x64 matrix. The default `Matrix_Size` is kept small so the simulation is faster. After the DUT is implemented onto the FPGA board, larger `Matrix_Size` then can be used as the FPGA calculation is much faster. You can also adjust these parameters in the script.

The script then simulates the model, and verifies the result by comparing the logged simulation result with the expected value.

By default, the Matrix_Multiplication_On is true, the script verifies the matrix vector multiplication result.

When the Matrix_Multiplication_On is false, the script verifies the loop back mode, which means the DUT read `Burst_Length` amount of data from DDR, and write the data back to DDR.

If you have a DSP System Toolbox license, you can view the model signals over time using the Logic Analyzer.

## Generate HDL IP core with AXI4 Master Interface

Next, we start the HDL Workflow Advisor and use the IP Core Generation workflow to deploy this design on the Zynq hardware. For a more detailed step-by-step guide, you can refer to the Getting Started with HW/SW Codesign Workflow for Xilinx Zynq Platform example.

**1.** Set up the Xilinx Vivado synthesis tool path using the following command in the MATLAB command window. Use your own Vivado installation path when you run the command.

hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2016.4\bin\vivado.bat')

**2.** Start the HDL Workflow Advisor from the DUT subsystem, `hdlcoder_external_memory/DUT`. The target interface settings are saved on the model. Notice that **Target workflow** is `IP Core Generation`, **Target platform** is `Xilinx Zynq ZC706 evaluation kit`, **Reference Design** is `Default System with External DDR3 memory access`, and **Target platform interface table** settings are as shown below.

In this example, the input parameter ports like `matrix_mul_on`, `matrix_size`, `burst_len`, `burst_from_ddr` and `burst start` are mapped to the `AXI4-Lite` interface. HDL Coder will generate AXI4 interface accessible registers for these ports. Later, you can use MATLAB to tune these parameters at run-time when the design is running on FPGA board.

The AXI4 Master interface has separate Read and Write channels. The read channel ports like `axim_rd_data`, `axim_rd_s2m`, `axim_rd_m2s` are mapped to `AXI4 Master Read` interface. The write channel ports like `axim_wr_data`, `axim_wr_s2m`, `axim_wr_m2s` are mapped to `AXI4 Master Write` interface.

**3.** Right-click Task 3.2, **Generate RTL Code and IP Core**, and select **Run to Selected Task** to generate the IP core. You can find the register address mapping and other documentation for the IP core in the generated IP Core Report.

**4.** Now Right-click Task 4.2 **Build FPGA Bitstream**, and select **Run to Selected Task** to generate the Vivado project, and then build the FPGA bitstream.

During the project creation, the generated DUT IP core is integrated into the `Default System with External DDR3 Memory Access` reference design. This reference design comprises of a Xilinx Memory Interface Generator IP to communicate with the on-board external DDR3 memory on ZC706 platform. The MATLAB as AXI Master IP is also added to enable MATLAB to control the DUT IP, and to initialize and verify the DDR memory content.

You can click the link in the result window in Task 4.1 "Create Project" to view the generate Vivado project. If you open the Vivado block design, the generated reference design project looks similar to this architecture diagram.

## Run FPGA Implementation on Zynq Hardware

After the FPGA bitstream is generated, you can run Task 4.3 **Program Target Device** to program the FPGA board through JTAG cable.

You can then run the FPGA implementation, and verify the hardware result by running following script in MATLAB:

hdlcoder_external_memory_hw_run

This script first initializes the `Matrix_Size` to 500, which means a 500x500 matrix. You can adjust the `Matrix_Size` up to 4000.

The AXI4 Master Read and Write channel base addresses are then configured. These addresses defines the base address that DUT reads from, and writes to external DDR memory. In this script, the DUT is reading from base address '40000000', and write to base address '50000000'.

Then the MATLAB as AXI Master feature is used to initialize the external DDR3 memory with input vector and matrix data, and also clear the output DDR memory location.

Then the DUT calculation is started by controlling the AXI4-Lite accessible registers. The DUT IP core first read input data from the DDR memory, perform the matrix vector multiplication, and then write the result back to the DDR memory.

Finally, the output result is read back to MATLAB, and compared with the expected value. In this way, the hardware results are verified in MATLAB.