MATLAB Examples

Improve Execution Efficiency by Reordering Block Operations in the Generated Code

To improve execution efficiency, the code generator can change the block execution order. In the Configuration Parameters dialog box, when you set the Optimize Block Order parameter to Improved Execution Speed, the code generator can change the block operation order to implement these optimizations:

  • Eliminate data copies for blocks that perform inplace operations (that is, use the same input and output variable) and contain algorithm code with unnecessary data copies.
  • Combine more for loops by executing blocks together that have the same size.
  • Reuse the same variable for the input, output, and state of a Unit Delay block by executing the Unit Delay block before upstream blocks.

These optimizations improve execution speed and conserve RAM and ROM consumption.

Contents

Example Model

Open the model matlab:rtwdemo_optimizeblockorder. This model contains three subsystems for demonstrating how reordering block operations improves execution efficiency.

for Loop Fusion

The subsystem LoopFusionScheduling shows how the code generator reorders block operations so that blocks that have the same output size execute together. This reordering enables for loop fusion. Set the Optimize block order in the generated code paramter to Off.

In your system's temporary folder, create a folder for the build and inspection process and build the model.

### Starting build procedure for model: rtwdemo_optimizeblockorder
### Successful completion of build procedure for model: rtwdemo_optimizeblockorder

View the generated code without the optimization. Code for the LoopFusionScheduling subsystem:

/* Output and update for atomic system: '<Root>/LoopFusionScheduling' */
static void LoopFusionScheduling(const real_T rtu_In1[6], const real_T rtu_In2[6],
  const real_T rtu_In3[6], const real_T rtu_In4[6], real_T rty_Out1[6], real_T
  rty_Out2[9], real_T rty_Out3[6], real_T rty_Out4[9])
{
  int32_T i;
  int32_T i_0;
  int32_T tmp;

  /* Bias: '<S2>/Bias' incorporates:
   *  Gain: '<S2>/Gain'
   */
  for (i = 0; i < 6; i++) {
    rty_Out1[i] = -0.3 * rtu_In1[i] + 0.5;
  }

  /* End of Bias: '<S2>/Bias' */

  /* Product: '<S2>/Product' */
  for (i = 0; i < 3; i++) {
    for (i_0 = 0; i_0 < 3; i_0++) {
      tmp = i_0 + 3 * i;
      rty_Out2[tmp] = 0.0;
      rty_Out2[tmp] = rty_Out2[3 * i + i_0] + rtu_In2[i << 1] * rtu_In1[i_0];
      rty_Out2[tmp] = rtu_In2[(i << 1) + 1] * rtu_In1[i_0 + 3] + rty_Out2[3 * i
        + i_0];
    }
  }

  /* End of Product: '<S2>/Product' */

  /* Bias: '<S2>/Bias1' incorporates:
   *  Gain: '<S2>/Gain1'
   */
  for (i = 0; i < 6; i++) {
    rty_Out3[i] = -0.3 * rtu_In3[i] + 0.5;
  }

  /* End of Bias: '<S2>/Bias1' */

  /* Product: '<S2>/Product1' */
  for (i = 0; i < 3; i++) {
    for (i_0 = 0; i_0 < 3; i_0++) {
      tmp = i_0 + 3 * i;
      rty_Out4[tmp] = 0.0;
      rty_Out4[tmp] = rty_Out4[3 * i + i_0] + rtu_In4[i << 1] * rtu_In3[i_0];
      rty_Out4[tmp] = rtu_In4[(i << 1) + 1] * rtu_In3[i_0 + 3] + rty_Out4[3 * i
        + i_0];
    }
  }

  /* End of Product: '<S2>/Product1' */
}

With the default execution order, the blocks execute from left to right and from top to bottom. As a result, there are separate for loops for the two combinations of Gain and Bias blocks and the Product blocks.

Generate code with the optimization. Set the Optimize block order in the generated code parameter to Improved Execution Speed and build the model.

### Starting build procedure for model: rtwdemo_optimizeblockorder
### Successful completion of build procedure for model: rtwdemo_optimizeblockorder

View the generated code with the optimization.

/* Output and update for atomic system: '<Root>/LoopFusionScheduling' */
static void LoopFusionScheduling(const real_T rtu_In1[6], const real_T rtu_In2[6],
  const real_T rtu_In3[6], const real_T rtu_In4[6], real_T rty_Out1[6], real_T
  rty_Out2[9], real_T rty_Out3[6], real_T rty_Out4[9])
{
  int32_T i;
  int32_T i_0;
  int32_T tmp;
  int32_T tmp_0;
  for (i = 0; i < 3; i++) {
    for (i_0 = 0; i_0 < 3; i_0++) {
      /* Product: '<S2>/Product' incorporates:
       *  Product: '<S2>/Product1'
       */
      tmp = i_0 + 3 * i;
      rty_Out2[tmp] = 0.0;

      /* Product: '<S2>/Product1' */
      rty_Out4[tmp] = 0.0;

      /* Product: '<S2>/Product' incorporates:
       *  Product: '<S2>/Product1'
       */
      tmp_0 = 3 * i + i_0;
      rty_Out2[tmp] = rty_Out2[tmp_0] + rtu_In2[i << 1] * rtu_In1[i_0];

      /* Product: '<S2>/Product1' */
      rty_Out4[tmp] = rty_Out4[tmp_0] + rtu_In4[i << 1] * rtu_In3[i_0];

      /* Product: '<S2>/Product' */
      rty_Out2[tmp] = rtu_In2[(i << 1) + 1] * rtu_In1[i_0 + 3] + rty_Out2[3 * i
        + i_0];

      /* Product: '<S2>/Product1' */
      rty_Out4[tmp] = rtu_In4[(i << 1) + 1] * rtu_In3[i_0 + 3] + rty_Out4[3 * i
        + i_0];
    }
  }

  for (i = 0; i < 6; i++) {
    /* Bias: '<S2>/Bias' incorporates:
     *  Gain: '<S2>/Gain'
     */
    rty_Out1[i] = -0.3 * rtu_In1[i] + 0.5;

    /* Bias: '<S2>/Bias1' incorporates:
     *  Gain: '<S2>/Gain1'
     */
    rty_Out3[i] = -0.3 * rtu_In3[i] + 0.5;
  }
}

In the optimized code, blocks with the same output size execute together. The two sets of Gain and Bias blocks have an output dimension size of 6, so they execute together. The Product blocks have an output dimension size of 9, so they execute together. The fusion of for loops enables the code generator to set the value of the expression 3 * i + i_0 equal to the temporary variable tmp_0. This optimization also improves execution efficiency.

Buffer Reuse for the Input, Output, and State of Unit Delay Blocks

The subsystem RegionScheduling shows how the code generator reorders block operations to enable buffer reuse for the input, output, and state of Unit Delay blocks. When computation is part of separate regions that connect only through Delay blocks, the code generator can change the block execution order so that the downstream regions execute before the upstream regions. This execution order enables maximum reuse of Delay block states and input and output variables. Set the Optimize block order in the generated code paramter to Off and build the model.

### Starting build procedure for model: rtwdemo_optimizeblockorder
### Successful completion of build procedure for model: rtwdemo_optimizeblockorder

View the generated code without the optimization. Code for the RegionScheduling subsystem:

/* Output and update for atomic system: '<Root>/RegionScheduling' */
static void RegionScheduling(const real_T rtu_In1[6], const real_T rtu_In2[6],
  real_T rty_Out1[6], rtDW_RegionScheduling *localDW)
{
  int32_T i;
  real_T rtb_Sum;
  for (i = 0; i < 6; i++) {
    /* Sum: '<S3>/Sum' incorporates:
     *  UnitDelay: '<S3>/Delay'
     *  UnitDelay: '<S3>/UnitDelay'
     */
    rtb_Sum = localDW->Delay_DSTATE[i] + localDW->UnitDelay_DSTATE[i];

    /* UnitDelay: '<S3>/UnitDelay2' */
    rty_Out1[i] = localDW->UnitDelay2_DSTATE[i];

    /* Update for UnitDelay: '<S3>/Delay' incorporates:
     *  Bias: '<S3>/Bias'
     */
    localDW->Delay_DSTATE[i] = rtu_In1[i] + 3.0;

    /* Update for UnitDelay: '<S3>/UnitDelay' incorporates:
     *  Gain: '<S3>/Gain'
     */
    localDW->UnitDelay_DSTATE[i] = 2.0 * rtu_In2[i];

    /* Update for UnitDelay: '<S3>/UnitDelay2' */
    localDW->UnitDelay2_DSTATE[i] = rtb_Sum;
  }
}

With the default execution order, the generated code contains the extra, temporary variable rtb_Sum and a data copy.

Generate code with the optimization. Set the Optimize block order in the generated code parameter to Improved Execution Speed and build the model.

### Starting build procedure for model: rtwdemo_optimizeblockorder
### Successful completion of build procedure for model: rtwdemo_optimizeblockorder

View the generated code with the optimization.

/* Output and update for atomic system: '<Root>/RegionScheduling' */
static void RegionScheduling(const real_T rtu_In1[6], const real_T rtu_In2[6],
  real_T rty_Out1[6], rtDW_RegionScheduling *localDW)
{
  int32_T i;
  for (i = 0; i < 6; i++) {
    /* UnitDelay: '<S3>/UnitDelay2' */
    rty_Out1[i] = localDW->UnitDelay2_DSTATE[i];

    /* Update for UnitDelay: '<S3>/UnitDelay2' incorporates:
     *  Sum: '<S3>/Sum'
     *  UnitDelay: '<S3>/Delay'
     *  UnitDelay: '<S3>/UnitDelay'
     */
    localDW->UnitDelay2_DSTATE[i] = localDW->Delay_DSTATE[i] +
      localDW->UnitDelay_DSTATE[i];

    /* Update for UnitDelay: '<S3>/Delay' incorporates:
     *  Bias: '<S3>/Bias'
     */
    localDW->Delay_DSTATE[i] = rtu_In1[i] + 3.0;

    /* Update for UnitDelay: '<S3>/UnitDelay' incorporates:
     *  Gain: '<S3>/Gain'
     */
    localDW->UnitDelay_DSTATE[i] = 2.0 * rtu_In2[i];
  }
}

In the optimized code, the blocks in Regions 3, 2, and 1 execute in that order. With that execution order, the generated code does not contain the temporary variable rtb_Sum and the corresponding data copy.

Eliminate Data Copies for Blocks That Perform Inplace Operations

The subsystem InplaceScheduling shows how the code generator reorders block operations to eliminate data copies for blocks that perform inplace operations. In the Configuration Parameters dialog box, set the Optimize block order in the generated code paramter to Off and build the model.

### Starting build procedure for model: rtwdemo_optimizeblockorder
### Successful completion of build procedure for model: rtwdemo_optimizeblockorder

View the generated code without the optimization. Code for the InplaceScheduling subsystem:

/* Output and update for atomic system: '<Root>/InplaceScheduling' */
static void InplaceScheduling(const real_T rtu_In1[6], real_T rtu_In3, const
  real_T rtu_In2[6], real_T rty_Out2[6], real_T rty_Out1[6],
  rtDW_InplaceScheduling *localDW)
{
  int32_T idx1;
  int32_T idx2;
  real_T acc;
  int32_T k;
  real_T rtb_Max[6];
  for (idx2 = 0; idx2 < 6; idx2++) {
    /* Sum: '<S1>/Sum2x3' incorporates:
     *  UnitDelay: '<S1>/Unit Delay'
     */
    localDW->UnitDelay_DSTATE[idx2] += rtu_In1[idx2];

    /* MinMax: '<S1>/Max' */
    if (2.0 > localDW->UnitDelay_DSTATE[idx2]) {
      rtb_Max[idx2] = 2.0;
    } else {
      rtb_Max[idx2] = localDW->UnitDelay_DSTATE[idx2];
    }

    /* End of MinMax: '<S1>/Max' */
  }

  /* S-Function (sdsp2norm2): '<S1>/Normalization' */
  idx2 = 0;
  acc = rtb_Max[0] * rtb_Max[0];
  idx1 = 1;
  for (k = 0; k < 5; k++) {
    acc += rtb_Max[idx1] * rtb_Max[idx1];
    idx1++;
  }

  acc = 1.0 / (sqrt(acc) + 1.0E-10);
  for (k = 0; k < 6; k++) {
    rty_Out1[idx2] = rtb_Max[idx2] * acc;
    idx2++;

    /* Product: '<S1>/Product' incorporates:
     *  Bias: '<S1>/Bias'
     */
    rty_Out2[k] = (rtu_In3 + 1.0) * localDW->UnitDelay_DSTATE[k];

    /* Switch: '<S1>/Switch' */
    if (rtu_In2[k] > 0.0) {
      /* Update for UnitDelay: '<S1>/Unit Delay' */
      localDW->UnitDelay_DSTATE[k] = 0.0;
    } else {
      /* Update for UnitDelay: '<S1>/Unit Delay' */
      localDW->UnitDelay_DSTATE[k] = rtb_Max[k];
    }

    /* End of Switch: '<S1>/Switch' */
  }

  /* End of S-Function (sdsp2norm2): '<S1>/Normalization' */
}

With the default execution order, the Max block executes before the Product block. To hold the Sum block output, the generated code contains two variables, UnitDelay_DSTATE and rtb_Max.

Generate code with the optimization. Set the Optimize block order in the generated code parameter to Improved Execution Speed and build the model.

### Starting build procedure for model: rtwdemo_optimizeblockorder
### Successful completion of build procedure for model: rtwdemo_optimizeblockorder

View the generated code with the optimization.

/* Output and update for atomic system: '<Root>/InplaceScheduling' */
static void InplaceScheduling(const real_T rtu_In1[6], real_T rtu_In3, const
  real_T rtu_In2[6], real_T rty_Out2[6], real_T rty_Out1[6],
  rtDW_InplaceScheduling *localDW)
{
  int32_T idx1;
  real_T acc;
  int32_T k;
  for (k = 0; k < 6; k++) {
    /* Sum: '<S1>/Sum2x3' incorporates:
     *  UnitDelay: '<S1>/Unit Delay'
     */
    localDW->UnitDelay_DSTATE[k] += rtu_In1[k];

    /* Product: '<S1>/Product' incorporates:
     *  Bias: '<S1>/Bias'
     */
    rty_Out2[k] = (rtu_In3 + 1.0) * localDW->UnitDelay_DSTATE[k];

    /* MinMax: '<S1>/Max' */
    if (2.0 > localDW->UnitDelay_DSTATE[k]) {
      localDW->UnitDelay_DSTATE[k] = 2.0;
    }

    /* End of MinMax: '<S1>/Max' */
  }

  /* S-Function (sdsp2norm2): '<S1>/Normalization' */
  acc = localDW->UnitDelay_DSTATE[0] * localDW->UnitDelay_DSTATE[0];
  idx1 = 1;
  for (k = 0; k < 5; k++) {
    acc += localDW->UnitDelay_DSTATE[idx1] * localDW->UnitDelay_DSTATE[idx1];
    idx1++;
  }

  acc = 1.0 / (sqrt(acc) + 1.0E-10);
  rty_Out1[0] = localDW->UnitDelay_DSTATE[0] * acc;
  idx1 = 1;
  for (k = 0; k < 5; k++) {
    rty_Out1[idx1] = localDW->UnitDelay_DSTATE[idx1] * acc;
    idx1++;
  }

  /* End of S-Function (sdsp2norm2): '<S1>/Normalization' */

  /* Update for UnitDelay: '<S1>/Unit Delay' */
  for (k = 0; k < 6; k++) {
    /* Switch: '<S1>/Switch' */
    if (rtu_In2[k] > 0.0) {
      localDW->UnitDelay_DSTATE[k] = 0.0;
    }

    /* End of Switch: '<S1>/Switch' */
  }

  /* End of Update for UnitDelay: '<S1>/Unit Delay' */
}

The optimized code does not contain the variable rtb_Max or the data copy. The generated code contains one variable, UnitDelay_DSTATE, for holding the Sum block output. The Product block reads from UnitDelay_DSTATE and the Max block reads from and writes to UnitDelay_DSTATE.

To implement buffer reuse, the code generator does not violate user-specified block priorities.