# Simulation Acceleration using System Objects, MATLAB Coder and Parallel Computing Toolbox

This example shows three ways to accelerate the simulation of communications algorithms in MATLAB®. In particular, it showcases the effects that using System objects, MATLAB to C code generation, and parallel processing runs (using the MATLAB `parfor` function) have on simulation speed.

The combined effect of using these methods may speed up a typical simulation time by an order of magnitude. The difference is tantamount to running the simulation overnight or within just a few hours.

The System objects this example features are accessible in the Communications System Toolbox™ product. In order to run the MATLAB to C code generation section of this example, you must have a MATLAB Coder™ license. In order to run the parallel processing section of this example, you must have a Parallel Computing Toolbox™ license.

## Contents

- Introduction
- Structure of the Example
- Start with a Function-based Algorithm as your Baseline
- Use the MATLAB Profiler GUI to Identify Speed Bottlenecks
- Improve the Simulation Time using System objects
- Accelerate the Simulation with MATLAB to C Code Generation
- Achieve Even Faster Simulation using Parallel Processing Runs
- Summary
- Further Exploration
- Note on Numerical Performance
- Appendix
- Selected References

## Introduction

This example examines various implementations of the following transceiver system:

This system is composed of a transmitter, a channel model, and a receiver. The transmitter processes the input bit stream with a convolutional encoder, an interleaver, a modulator, and a MIMO space-time block encoder [ 1 ], [ 2 ]. The transmitted signal is then processed by a 2x2 MIMO channel and an additive white gaussian noise (AWGN) channel. The receiver processes its input signal with a 2x2 MIMO space-time block decoder, a demodulator, a deinterleaver, and a Viterbi decoder to recover the best estimate of the input bit stream at the receiver.

## Structure of the Example

The workflow of this example is as follows:

- Start with a function-based algorithm as your baseline
- Use the MATLAB Profiler GUI to identify speed bottlenecks
- Improve the simulation time using System objects
- Accelerate the simulation with MATLAB to C code generation
- Achieve even faster simulation using parallel processing runs

## Start with a Function-based Algorithm as your Baseline

Start with a function that represents the first version or baseline implementation of this algorithm. The inputs to the commaccelerationbaseline function are the Eb/No value of the current frame (EbNo) and the number of bits the algorithm processes (MaxNumBits). Eb/No is the ratio of energy per bit to noise power spectral density. The function output is the bit error rate (BER) of the algorithm. Note that the baseline algorithm implementation uses functions and data objects from the Communication System Toolbox, along with some user-defined functions, such as localmimoencoder, localmimodecoder and localmimochannel.

```
type commaccelerationbaseline
```

function ber = commaccelerationbaseline(EbNo, maxNumBits) % COMMACCELERATIONBASELINE Simulate a communications link with functions. % BER = COMMACCELERATIONBASELINE(EBNO, MAXNUMBITS) returns the bit error rate % (BER) of a communications link that includes convolutional coding, % interleaving, PSK modulation, an Alamouti space-time block code, and a MIMO % channel with AWGN. EBNO is the Eb/No of the AWGN channel in dB, and % MAXNUMBITS is the number of bits simulated before the simulation stops. The % simulation uses functions to implement the above operations. % Copyright 2011-2014 The MathWorks, Inc. rng default FRM = 2000; % Input Bit Frame M = 4; % Modulation Order k = log2(M); % Bits per Symbol codeRate = 1/2; % Coding Rate adjSNR = EbNo - 10*log10(1/codeRate) + 10*log10(k); tblen = 32; trellis = poly2trellis(7,[171 133]); LEN=4012; permvec=[1:3:LEN 2:3:LEN 3:3:LEN]'; %% Run Simulation totErrs = 0;totBits = 0; while (totBits <= maxNumBits) data = [randi([0 1],FRM,1);false(6,1)]; u1 = convenc(data,trellis); % Convolutional Encoder u2 = intrlv(double(u1),permvec'); % Interleaver u3 = bi2de(reshape(u2,FRM+6,2)); % Convert to M-ary symbols u4 = pskmod(u3,M,pi/4,'gray'); % QPSK Modulator u5 = localmimoencoder(2,u4); % Alamouti Space-Time Block Encoder [u6, u7] = localmimochannel(u5,2); % 2x2 MIMO Channel u8 = awgn(u6, adjSNR, 'measured', [], 'dB'); % AWGN channel u9 = localmimodecoder(u8,u7,2); % Alamouti Space-Time Block Decoder uA = pskdemod(u9,M,pi/4,'gray'); % QPSK Demodulator uB = reshape(de2bi(uA),LEN,1); % Convert to binary uC = deintrlv(uB,permvec'); % Deinterleaver uD = vitdec(uC(:),trellis,tblen, ... % Viterbi Decoder 'term','hard'); bitErrs = biterr(uD(1:FRM),data(1:FRM)); totErrs = totErrs + bitErrs; totBits = totBits + FRM; end ber = totErrs/totBits;

As a starting point, measure the time it takes to run this baseline algorithm in MATLAB. Use the MATLAB timing functions (tic and toc) to record the elapsed time for running this function, as it is called within a for-loop that iterates over Eb/No values from 0 to 7 dB.

MaxSNRdB=7;EbNo=1;MaxNumBits=2e5; N=1;str='Baseline'; commaccelerationbaseline(EbNo,1e4); berBaseline=zeros(size(0:MaxSNRdB)); fprintf(1,'Processing the baseline algorithm.\n'); tic; for EbNo=0:MaxSNRdB y=commaccelerationbaseline(EbNo,MaxNumBits); berBaseline(EbNo+1)=y; end a=toc;

Processing the baseline algorithm.

The result shows the simulation time (in seconds) of the baseline algorithm. Use this measurement as a yardstick to compare with subsequent versions of the algorithm.

commaccelerationreportresults(N,a,a,str);

---------------------------------------------------------------------------------------------- Versions of the Transceiver | Elapsed Time (sec)| Acceleration Ratio 1. Baseline | 48.1736 | 1.0000 ----------------------------------------------------------------------------------------------

## Use the MATLAB Profiler GUI to Identify Speed Bottlenecks

Identify the processing bottlenecks and problem areas of the baseline algorithm using the MATLAB Profiler. Obtain the profiler information by executing the following script:

profile on y=commaccelerationbaseline(EbNo,1e5); profile off

The Profiler report presents the execution time for each function call of the algorithm in descending order. The first few functions that the Profiler window depicts represent the speed bottleneck of the algorithm. In this case, two user-defined functions (localmimodecoder and localmimochannel) and the vitdec function (Viterbi decoder function of the System Toolbox) are identified as the major speed bottlenecks.

## Improve the Simulation Time using System objects

The function commaccelerationsystemobjects implements the second version of the algorithm, which uses System objects from the Communications System Toolbox. Among ten function calls found in the baseline version of this algorithm, nine calls are replaced by corresponding calls to available System objects. The resulting commaccelerationsystemobjects function is composed of two distinct parts:

- Declaration, which creates the System objects
- Execution, which calls the step method for each System object in order to perform specific operations.

```
type commaccelerationsystemobjects
```

function ber = commaccelerationsystemobjects(EbNo, maxNumBits) % COMMACCELERATIONSYSTEMOBJECTS Simulate a comms link with System objects. % BER = COMMACCELERATIONSYSTEMOBJECTS(EBNO, MAXNUMBITS) returns the bit error rate % (BER) of a communications link that includes convolutional coding, % interleaving, PSK modulation, an Alamouti space-time block code, and a MIMO % channel with AWGN. EBNO is the Eb/No of the AWGN channel in dB, and % MAXNUMBITS is the number of bits simulated before the simulation stops. The % simulation uses System objects to implement the above operations. % Copyright 2011 The MathWorks, Inc. %#codegen coder.extrinsic('rng'); rng('default'); FRM = 2000; % Input Bit Frame M=4; % Modulation Order k=log2(M); % Bits per Symbol codeRate=1/2; % Coding Rate adjSNR=EbNo-10*log10(1/codeRate)+10*log10(k); tblen = 32; trellis = poly2trellis(7, [171 133]); LEN=4012;permvec=[1:3:LEN 2:3:LEN 3:3:LEN]'; %% Setup Algorithm Parameters persistent hConvEncoder hViterbi hInterleaver hDeinterleaver persistent hModulator hDeModulator hOSTBCComb hOSTBCEnc hAWGN if isempty(hConvEncoder) hConvEncoder=comm.ConvolutionalEncoder(trellis,... 'TerminationMethod','Terminated'); hModulator = comm.PSKModulator('ModulationOrder',4, ... 'SymbolMapping','gray', ... 'PhaseOffset',0, ... 'BitInput',true); hAWGN=comm.AWGNChannel('NoiseMethod','Variance',... 'VarianceSource','Input port'); hDeModulator = comm.PSKDemodulator('ModulationOrder',4, ... 'SymbolMapping','gray', ... 'PhaseOffset',0, ... 'BitOutput',true, ... 'DecisionMethod','Hard decision'); hViterbi=comm.ViterbiDecoder(trellis,... 'InputFormat','Hard',... 'TracebackDepth',tblen,... 'OutputDataType', 'logical',... 'TerminationMethod','Terminated'); hOSTBCEnc = comm.OSTBCEncoder('NumTransmitAntennas', 2); hOSTBCComb = comm.OSTBCCombiner('NumTransmitAntennas',2,'NumReceiveAntennas',2); hInterleaver= comm.BlockInterleaver('PermutationVector',permvec); hDeinterleaver= comm.BlockDeinterleaver('PermutationVector',permvec); end %% Run Simulation totErrs = 0; totBits = 0; while (totBits < maxNumBits) data = randn(FRM, 1)>0; u1 = step(hConvEncoder, data); % Convolutional Encoder u2 = step(hInterleaver,u1); % Interleaver u3 = step(hModulator, u2); % QPSK Modulator u5 = step(hOSTBCEnc,u3); % Alamouti Space-Time Block Encoder [u6,~,u7] = localmimochannel(u5,2); % 2x2 MIMO Channel sigDB = 10*log10(var(u6(:))); noisevar = real(10.^(0.1*(sigDB-adjSNR))); u8 = step(hAWGN,u6,noisevar); % AWGN channel u9 = step(hOSTBCComb,u8,u7); % Alamouti Space-Time Block Decoder uA = step(hDeModulator, u9); % QPSK Demodulator uB = step(hDeinterleaver,uA); % Deinterleaver uC = step(hViterbi,uB); % Viterbi Decoder bitErrs = sum(abs(uC(1:FRM)-data)); totErrs = totErrs + bitErrs; totBits = totBits + FRM; end ber = totErrs/totBits;

Measure the simulation time for this version of the algorithm. Record the elapsed time for running this function in the same for-loop as before.

N=2; str='Using System objects'; commaccelerationsystemobjects(EbNo,1e4); berSystemobject=zeros(size(berBaseline)); fprintf(1,'Processing the System object version of the algorithm.\n'); tic; for EbNo=0:MaxSNRdB y=commaccelerationsystemobjects(EbNo,MaxNumBits); berSystemobject(EbNo+1)=y; end b=toc;

Processing the System object version of the algorithm.

The result shows the simulation time of the System object version of the algorithm. Note that by replacing calls to the functions with calls to System objects, the algorithm runs faster. This behavior is expected because one of the advantages of using System objects is efficiency. Algorithms using System objects separate declaration from execution, which avoids the repeated input validation and verification routines found in function-based algorithms. The algorithm implementation that uses System objects performs parameter handling and initializations only once, outside the loop, and exploits efficient MEX implementations of the System objects, which improves overall simulation performance.

commaccelerationreportresults(N,a,b,str);

---------------------------------------------------------------------------------------------- Versions of the Transceiver | Elapsed Time (sec)| Acceleration Ratio 1. Baseline | 48.1736 | 1.0000 2. Using System objects | 12.8705 | 3.7429 ----------------------------------------------------------------------------------------------

## Accelerate the Simulation with MATLAB to C Code Generation

MATLAB Coder generates portable and readable C code from algorithms that are part of the MATLAB code generation subset. The function commaccelerationsystemobjects, the second version of the algorithm, uses System objects that support code generation with MATLAB Coder. Therefore, this algorithm can be compiled and linked into MATLAB as a MEX (MATLAB executable) function. Use the MATLAB Coder codegen command to compile the System objects version of the algorithm into a MEX function (called commaccelerationsystemobjects_mex).

Note: You must have a MATLAB Coder license to run this section of the example.

codegen('commaccelerationsystemobjects.m','-args',{EbNo,MaxNumBits})

Measure the simulation time for the MEX version of the algorithm. Record the elapsed time for running this function in the same for-loop as before.

N=3; str='MATLAB to C code generation'; commaccelerationsystemobjects_mex(EbNo,1e4); berCodegen=zeros(size(berBaseline)); fprintf(1,'Processing the MEX function of the second version of the algorithm.\n'); tic; for EbNo=0:MaxSNRdB y=commaccelerationsystemobjects_mex(EbNo,MaxNumBits); berCodegen(EbNo+1)=y; end c=toc;

Processing the MEX function of the second version of the algorithm.

The result shows the simulation time of the MEX version of the algorithm. Note that by compiling the System object algorithm into a MEX function, the MEX version of the algorithm runs faster than both the second and the baseline versions of the algorithm. This behavior is expected because one of the advantages of using MATLAB to C code generation is simulation acceleration. Although the algorithm that uses System objects is highly optimized, code generation can accelerate simulation by locking-down the sizes and data-types of variables inside the function. This process makes the execution more efficient because it removes the overhead of the interpreted language that checks for size and data-type in every line of the code.

commaccelerationreportresults(N,a,c,str);

---------------------------------------------------------------------------------------------- Versions of the Transceiver | Elapsed Time (sec)| Acceleration Ratio 1. Baseline | 48.1736 | 1.0000 2. Using System objects | 12.8705 | 3.7429 3. MATLAB to C code generation | 5.2268 | 9.2167 ----------------------------------------------------------------------------------------------

## Achieve Even Faster Simulation using Parallel Processing Runs

Utilize multiple cores to increase simulation acceleration by running tasks in parallel. Use parallel processing runs (`parfor` loops) in MATLAB to perform the work on the number of available workers. Parallel Computing Toolbox enables you to run different iterations of the simulation in parallel. Use the `parpool` function to reserve a number of MATLAB workers for executing a subsequent `parfor`-loop. In this example, two workers run locally on a MATLAB client machine.

Note: You must have a Parallel Computing Toolbox license to run this section of the example.

if isempty(gcp('nocreate')), parpool; end

Starting parallel pool (parpool) using the 'local' profile ... connected to 8 workers.

Measure the simulation time for the MEX version of the algorithm that executes within a `parfor`-loop rather than a for-loop used in the previous cases.

N=4; str='Parallel simulation runs with parfor'; commaccelerationsystemobjects_mex(EbNo,1e4); berPct=zeros(size(berBaseline)); fprintf(1,'Processing the MEX function of the second version of the algorithm within a parfor-loop.\n'); tic; parfor EbNo=0:MaxSNRdB y=commaccelerationsystemobjects_mex(EbNo,MaxNumBits); berPct(EbNo+1)=y; end d=toc;

Processing the MEX function of the second version of the algorithm within a parfor-loop.

The result shows the simulation time of the MEX version of the second algorithm executing within a `parfor`-loop. Note that by running this version within a `parfor`-loop we get the fastest simulation performance. The basic concept of a `parfor`-loop is the same as the standard MATLAB for-loop. The difference is that `parfor` divides the loop iterations into groups so that each worker executes some portion of the total number of iterations. Because several MATLAB workers can be computing concurrently on the same loop, a `parfor`-loop provides significantly better performance than its analogous for-loop.

commaccelerationreportresults(N,a,d,str);

---------------------------------------------------------------------------------------------- Versions of the Transceiver | Elapsed Time (sec)| Acceleration Ratio 1. Baseline | 48.1736 | 1.0000 2. Using System objects | 12.8705 | 3.7429 3. MATLAB to C code generation | 5.2268 | 9.2167 4. Parallel simulation runs with parfor | 2.8600 | 16.8440 ----------------------------------------------------------------------------------------------

## Summary

The combined effects of

- System objects,
- MATLAB to C code generation and
- Parallel processing runs

can significantly speed up simulations of your communications algorithms.

- System objects help improve simulation speed by performing parameter handling and initializations only once outside the execution loop, by exploiting efficient MEX implementations of the System objects, and by avoiding repeated input validation and verification routines found in function-based algorithms.

- MATLAB to C code generation accelerates simulation by locking-down data-types and sizes of every variable and by reducing the overhead of the interpreted language that checks for the size and data-type of variables in every line of the code.

- Parallel processing runs can substantially accelerate simulation by computing different iterations of your algorithm concurrently on a multitude of available MATLAB workers.

## Further Exploration

In this example we use the function `parpool` to reserve a number of MATLAB workers that run locally on our MATLAB client machine. By modifying the parallel configurations, you can accelerate the simulation even further by running the algorithm on a larger cluster of workers that are not on your MATLAB client machine. For a description of how to manage and use parallel configurations, see Programming with User Configurations page of Parallel Computing Toolbox User's Guide.

## Note on Numerical Performance

In this example, the baseline and System object versions of the algorithm are not numerically identical. The user-defined function localmimodecoder used in the baseline algorithm implements a slightly different version of the MIMO space-time block decoding operation than the one used in the comm.OSTBCCombiner System object of the second version of the algorithm. Nevertheless, when you run each of the first three versions of the algorithm (the baseline version, the System object version, and the MEX function of the second version) with a sufficiently large number of input bits, you will get very similar BER curves indicating similar numerical performance for different versions of the algorithm. Here you find the BER curves obtained by running each of the three versions of the algorithm with the number of input bits set to ten millions (i.e. MaxNumBits=1e7).

## Appendix

The following functions are used in this example.

## Selected References

- S. M. Alamouti, "A simple transmit diversity technique for wireless communications,"
*IEEE® Journal on Selected Areas in Communications*, vol. 16, no. 8, pp. 1451-1458, Oct. 1998. - V. Tarokh, H. Jafarkhami, and A. R. Calderbank, "Space-time block codes from orthogonal designs,"
*IEEE Transactions on Information Theory*, vol. 45, no. 5, pp. 1456-1467, Jul. 1999.