# Analyze Data Using MDF Datastore and Tall Arrays

This example shows how to work with a big data set using tall arrays and the MDF datastore feature. Tall arrays are commonly used to perform calculations on different types of data that do not fit in memory.

This example first operates on a small subset of data and then scales up to analyze the entire data set. Although the data set used here might not represent the actual size in real-world applications, the same analysis technique can scale up further to work on data sets so large that they cannot be read into memory.

To learn more about tall arrays, see the example Analyze Big Data in MATLAB Using Tall Arrays.

### Introduction to Tall Arrays

Tall arrays and tall tables are used to work with out-of-memory data that has any number of rows. Using tall arrays and tables, you can work with large data sets in a manner similar to in-memory MATLAB arrays.

The difference is that tall arrays typically remain unevaluated until the calculations are requested to be performed. This deferred evaluation enables MATLAB to combine the queued calculations where possible and take the minimum number of passes through the data.

### Create an MDF Datastore

An MDF datastore can be used to read and process homogeneous data stored in multiple MDF-files as a single entity. If the data set is too large to fit in memory, a datastore also makes it possible to work with the data set in smaller blocks that individually fit in memory. This capability can be further extended by tall arrays which enable working with out-of-memory data backed up by a datastore using common functions.

Create an MDF datastore using the `mdfDatastore` function by selecting MDF-file `EngineData_MDF_TallArray.mf4` in the current workflow directory. This file contains time-stamped data logged from a Simulink model representing an engine plant and controller connected to a dynamometer.

`mds = mdfDatastore("EngineData_MDF_TallArray.mf4")`
```mds = MDFDatastore with properties: DataStore Details Files: { ' ...\Documents\MATLAB\Examples\vnt-ex08773747\EngineData_MDF_TallArray.mf4' } ChannelGroups: ChannelGroupNumber AcquisitionName Comment ... and 4 more columns __________________ _______________ __________ 1 {1×1 cell} {1×1 cell} Channels: ChannelGroupNumber ChannelName DisplayName ... and 10 more columns __________________ _________________ ___________ 1 {'EngineSpeed' } '' 1 {'TorqueCommand'} '' 1 {'EngineTorque' } '' ... and 1 more rows Options SelectedChannelNames: { 'EngineSpeed'; 'TorqueCommand'; 'EngineTorque' ... and 1 more } SelectedChannelGroupNumber: 1 ReadSize: 'file' Conversion: Numeric ```

It is possible to further configure the MDF datastore to control what and how data is read from the MDF-file. By default, the first channel group is selected and all channels from the group are read.

`mds.SelectedChannelGroupNumber`
```ans = 1 ```
`mds.SelectedChannelNames`
```ans = 4×1 string "EngineSpeed" "TorqueCommand" "EngineTorque" "t" ```

Configure the MDF datastore to select only three variables of interest: `EngineSpeed`, `TorqueCommand`, and `EngineTorque`.

`mds.SelectedChannelNames = ["EngineSpeed", "TorqueCommand", "EngineTorque"]`
```mds = MDFDatastore with properties: DataStore Details Files: { ' ...\Documents\MATLAB\Examples\vnt-ex08773747\EngineData_MDF_TallArray.mf4' } ChannelGroups: ChannelGroupNumber AcquisitionName Comment ... and 4 more columns __________________ _______________ __________ 1 {1×1 cell} {1×1 cell} Channels: ChannelGroupNumber ChannelName DisplayName ... and 10 more columns __________________ _________________ ___________ 1 {'EngineSpeed' } '' 1 {'TorqueCommand'} '' 1 {'EngineTorque' } '' ... and 1 more rows Options SelectedChannelNames: { 'EngineSpeed'; 'TorqueCommand'; 'EngineTorque' } SelectedChannelGroupNumber: 1 ReadSize: 'file' Conversion: Numeric ```

Preview the selected data using the `preview` function.

`preview(mds)`
```ans=8×3 timetable Time EngineSpeed TorqueCommand EngineTorque ______________ ___________ _____________ ____________ 0 sec 0 0 47.153 0 sec 2.37e-26 0 47.153 1.47e-05 sec 0.11056 47.158 47.158 8.85e-05 sec 0.66312 48.708 48.708 0.00010107 sec 0.75762 49.77 49.77 0.00010107 sec 0.75762 49.77 49.77 0.0001405 sec 1.053 39.967 39.967 0.00017993 sec 1.3482 23.143 23.143 ```

### Create Tall Array

Tall arrays are similar to in-memory MATLAB arrays, except that they can have any number of rows. Because the MDF datastore `mds` contains time-stamped tabular data, the `tall` function returns a tall timetable containing data from the datastore.

`tt = tall(mds)`
```Starting parallel pool (parpool) using the 'local' profile ... Connected to the parallel pool (number of workers: 6). tt = M×3 tall timetable Time EngineSpeed TorqueCommand EngineTorque ______________ ___________ _____________ ____________ 0 sec 0 0 47.153 0 sec 2.37e-26 0 47.153 1.47e-05 sec 0.11056 47.158 47.158 8.85e-05 sec 0.66312 48.708 48.708 0.00010107 sec 0.75762 49.77 49.77 0.00010107 sec 0.75762 49.77 49.77 0.0001405 sec 1.053 39.967 39.967 0.00017993 sec 1.3482 23.143 23.143 : : : : : : : : ```

The display includes the first several rows of data. The timetable size may display as `M×3` to indicate that the number of rows is not yet known to MATLAB.

### Perform Calculations on Tall Array

You can work with tall arrays and tall tables similar to in-memory MATLAB arrays and tables. However, MATLAB does not perform most operations on tall arrays, and defers the actual computations until the output is requested.

It is common to work with unevaluated tall arrays and request output only when required. MATLAB does not know the content or size of an unevaluated tall array until you request that it be evaluated and displayed.

Calculate median, minimum, and maximum values of the `TorqueCommand` variable. Note that the results are not immediately evaluated.

`medianTorqueCommand = median(tt.TorqueCommand)`
```medianTorqueCommand = tall double ? Preview deferred. Learn more. ```
`minTorqueCommand = min(tt.TorqueCommand)`
```minTorqueCommand = tall double ? Preview deferred. Learn more. ```
`maxTorqueCommand = max(tt.TorqueCommand)`
```maxTorqueCommand = tall double ? Preview deferred. Learn more. ```

### Gather Results into Workspace

The `gather` function forces evaluation of all queued operations and brings the resulting output back into memory.

Perform the queued operations, `median`, `min`, `max`, and evaluate the answers. If the calculation requires several passes through the data, MATLAB determines the minimum number of passes to save execution time and displays this information at the command line.

`[medianTorqueCommand, minTorqueCommand, maxTorqueCommand] = gather(medianTorqueCommand, minTorqueCommand, maxTorqueCommand)`
```Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 4: Completed in 6.7 sec - Pass 2 of 4: Completed in 0.73 sec - Pass 3 of 4: Completed in 1.3 sec - Pass 4 of 4: Completed in 0.62 sec Evaluation completed in 12 sec ```
```medianTorqueCommand = 116.2799 ```
```minTorqueCommand = 0 ```
```maxTorqueCommand = 232.9807 ```

### Select Subset of Tall Array

Use `head` to select a subset of 10,000 rows from the data for prototyping code before scaling to the full data set.

`ttSubset = head(tt, 10000)`
```ttSubset = 10,000×3 tall timetable Time EngineSpeed TorqueCommand EngineTorque ______________ ___________ _____________ ____________ 0 sec 0 0 47.153 0 sec 2.37e-26 0 47.153 1.47e-05 sec 0.11056 47.158 47.158 8.85e-05 sec 0.66312 48.708 48.708 0.00010107 sec 0.75762 49.77 49.77 0.00010107 sec 0.75762 49.77 49.77 0.0001405 sec 1.053 39.967 39.967 0.00017993 sec 1.3482 23.143 23.143 : : : : : : : : ```

### Remove Duplicate Rows in Tall Array

Timetable rows are duplicates if they have the same row times and the same data values. Use the `unique` function to remove duplicate rows from the subset tall timetable.

`ttSubset = unique(ttSubset)`
```ttSubset = 9,968×3 tall timetable Time EngineSpeed TorqueCommand EngineTorque ______________ ___________ _____________ ____________ 0 sec 0 0 47.153 0 sec 2.37e-26 0 47.153 1.47e-05 sec 0.11056 47.158 47.158 8.85e-05 sec 0.66312 48.708 48.708 0.00010107 sec 0.75762 49.77 49.77 0.0001405 sec 1.053 39.967 39.967 0.00017993 sec 1.3482 23.143 23.143 0.00037708 sec 2.8228 23.143 -0.021071 : : : : : : : : ```

### Calculate Engine Power

Calculate engine power in kilowatts (kW) with `EngineSpeed` and `EngineTorque` using the formula $\mathit{P}\text{\hspace{0.17em}}\left[\mathrm{kW}\right]=\frac{\pi \cdot \mathit{N}\text{\hspace{0.17em}}\left[\mathrm{rpm}\right]\cdot \mathit{T}\text{\hspace{0.17em}}\left[\mathrm{Nm}\right]}{30\cdot 1000}$. Save the results to a new variable named `EnginePower` in the tall timetable.

`ttSubset.EnginePower = (pi * ttSubset.EngineSpeed .* ttSubset.EngineTorque) / (30 * 1000)`
```ttSubset = 9,968×4 tall timetable Time EngineSpeed TorqueCommand EngineTorque EnginePower ______________ ___________ _____________ ____________ ___________ 0 sec 0 0 47.153 0 0 sec 2.37e-26 0 47.153 1.1703e-28 1.47e-05 sec 0.11056 47.158 47.158 0.00054599 8.85e-05 sec 0.66312 48.708 48.708 0.0033824 0.00010107 sec 0.75762 49.77 49.77 0.0039487 0.0001405 sec 1.053 39.967 39.967 0.0044072 0.00017993 sec 1.3482 23.143 23.143 0.0032675 0.00037708 sec 2.8228 23.143 -0.021071 -6.2287e-06 : : : : : : : : : : ```

The `topkrows` function for tall arrays returns the top `k` rows in sorted order. Obtain the top 20 rows with maximum `EnginePower` values.

`maxEnginePower = topkrows(ttSubset, 20, "EnginePower")`
```maxEnginePower = 20×4 tall timetable Time EngineSpeed TorqueCommand EngineTorque EnginePower _________ ___________ _____________ ____________ ___________ 15.17 sec 750 78.052 78.052 6.1302 15.16 sec 750 77.841 77.841 6.1136 15.15 sec 750 77.556 77.556 6.0912 15.14 sec 750 77.326 77.326 6.0732 15.18 sec 750 77.277 77.277 6.0693 15.13 sec 750 77.157 77.157 6.0599 15.12 sec 750 77.082 77.082 6.054 15.11 sec 750 77.067 77.075 6.0534 : : : : : : : : : : ```

Call the `gather` function to execute all queued operations and collect the results into memory.

`[ttSubset, maxEnginePower] = gather(ttSubset, maxEnginePower)`
```ttSubset=9968×4 timetable Time EngineSpeed TorqueCommand EngineTorque EnginePower ______________ ___________ _____________ ____________ ___________ 0 sec 0 0 47.153 0 0 sec 2.37e-26 0 47.153 1.1703e-28 1.47e-05 sec 0.11056 47.158 47.158 0.00054599 8.85e-05 sec 0.66312 48.708 48.708 0.0033824 0.00010107 sec 0.75762 49.77 49.77 0.0039487 0.0001405 sec 1.053 39.967 39.967 0.0044072 0.00017993 sec 1.3482 23.143 23.143 0.0032675 0.00037708 sec 2.8228 23.143 -0.021071 -6.2287e-06 0.00076951 sec 5.7492 15 -0.042938 -2.5851e-05 0.0014014 sec 10.437 15 -0.078013 -8.5265e-05 0.0023449 sec 17.382 15 -0.13009 -0.00023679 0.0036773 sec 27.079 15 -0.20304 -0.00057575 0.0054808 sec 40 15 -0.30067 -0.0012595 0.0072843 sec 52.691 15 -0.39703 -0.0021907 0.01 sec 71.373 15 -0.53973 -0.0040341 0.013562 sec 95.119 15 51.176 0.50976 ⋮ ```
```maxEnginePower=20×4 timetable Time EngineSpeed TorqueCommand EngineTorque EnginePower _________ ___________ _____________ ____________ ___________ 15.17 sec 750 78.052 78.052 6.1302 15.16 sec 750 77.841 77.841 6.1136 15.15 sec 750 77.556 77.556 6.0912 15.14 sec 750 77.326 77.326 6.0732 15.18 sec 750 77.277 77.277 6.0693 15.13 sec 750 77.157 77.157 6.0599 15.12 sec 750 77.082 77.082 6.054 15.11 sec 750 77.067 77.075 6.0534 15.1 sec 750 77.067 77.067 6.0528 15.09 sec 750 77.059 77.059 6.0522 15.08 sec 750 77.051 77.051 6.0516 15.07 sec 750 77.042 77.042 6.0509 15.06 sec 750 77.034 77.034 6.0502 15.05 sec 750 77.025 77.025 6.0495 15.04 sec 750 77.016 77.016 6.0488 15.03 sec 750 77.006 77.006 6.0481 ⋮ ```

### Visualize Data in Tall Array

Visualize the `EngineTorque` and `EnginePower` signals over time in a plot with two y-axes.

```figure yyaxis left plot(ttSubset.Time, ttSubset.EngineTorque) title("Engine Torque and Engine Power Over Time") xlabel("Time") ylabel("Engine Torque [Nm]") yyaxis right plot(ttSubset.Time, ttSubset.EnginePower) ylabel("Engine Power [kW]")```

### Scale to Entire Data Set

Instead of using the smaller data returned from `head`, scale up to apply the same steps on the entire data set by using the complete tall timetable.

`tt = tall(mds)`
```tt = M×3 tall timetable Time EngineSpeed TorqueCommand EngineTorque ______________ ___________ _____________ ____________ 0 sec 0 0 47.153 0 sec 2.37e-26 0 47.153 1.47e-05 sec 0.11056 47.158 47.158 8.85e-05 sec 0.66312 48.708 48.708 0.00010107 sec 0.75762 49.77 49.77 0.00010107 sec 0.75762 49.77 49.77 0.0001405 sec 1.053 39.967 39.967 0.00017993 sec 1.3482 23.143 23.143 : : : : : : : : ```

Firstly, remove duplicate rows from the tall timetable.

`tt = unique(tt)`
```tt = M×3 tall timetable Time EngineSpeed TorqueCommand EngineTorque ____ ___________ _____________ ____________ ? ? ? ? ? ? ? ? ? ? ? ? : : : : : : : : Preview deferred. Learn more. ```

Secondly, calculate engine power and obtain the top 20 rows with maximum `EnginePower` values.

`tt.EnginePower = (pi * tt.EngineSpeed .* tt.EngineTorque) / (30 * 1000)`
```tt = M×4 tall timetable Time EngineSpeed TorqueCommand EngineTorque EnginePower ____ ___________ _____________ ____________ ___________ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? : : : : : : : : : : Preview deferred. Learn more. ```
`maxEnginePower = topkrows(tt, 20, "EnginePower")`
```maxEnginePower = M×4 tall timetable Time EngineSpeed TorqueCommand EngineTorque EnginePower ____ ___________ _____________ ____________ ___________ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? : : : : : : : : : : Preview deferred. Learn more. ```
`[tt, maxEnginePower] = gather(tt, maxEnginePower)`
```Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: 0% complete Evaluation 0% complete - Pass 1 of 1: Completed in 1.3 sec Evaluation completed in 1.9 sec ```
```tt=359326×4 timetable Time EngineSpeed TorqueCommand EngineTorque EnginePower ______________ ___________ _____________ ____________ ___________ 0 sec 0 0 47.153 0 0 sec 2.37e-26 0 47.153 1.1703e-28 1.47e-05 sec 0.11056 47.158 47.158 0.00054599 8.85e-05 sec 0.66312 48.708 48.708 0.0033824 0.00010107 sec 0.75762 49.77 49.77 0.0039487 0.0001405 sec 1.053 39.967 39.967 0.0044072 0.00017993 sec 1.3482 23.143 23.143 0.0032675 0.00037708 sec 2.8228 23.143 -0.021071 -6.2287e-06 0.00076951 sec 5.7492 15 -0.042938 -2.5851e-05 0.0014014 sec 10.437 15 -0.078013 -8.5265e-05 0.0023449 sec 17.382 15 -0.13009 -0.00023679 0.0036773 sec 27.079 15 -0.20304 -0.00057575 0.0054808 sec 40 15 -0.30067 -0.0012595 0.0072843 sec 52.691 15 -0.39703 -0.0021907 0.01 sec 71.373 15 -0.53973 -0.0040341 0.013562 sec 95.119 15 51.176 0.50976 ⋮ ```
```maxEnginePower=20×4 timetable Time EngineSpeed TorqueCommand EngineTorque EnginePower __________ ___________ _____________ ____________ ___________ 3819.8 sec 5000 217.53 217.53 113.9 3819.8 sec 5000 217.53 217.53 113.9 3819.8 sec 5000 217.53 217.53 113.9 3819.8 sec 5000 217.53 217.53 113.9 3819.8 sec 5000 217.53 217.53 113.9 3819.9 sec 5000 217.53 217.53 113.9 3819.9 sec 5000 217.53 217.53 113.9 3819.9 sec 5000 217.53 217.53 113.9 3819.9 sec 5000 217.52 217.52 113.89 3819.9 sec 5000 217.52 217.52 113.89 3820 sec 5000 217.52 217.52 113.89 3820.1 sec 5000 217.52 217.52 113.89 3820.2 sec 5000 217.52 217.52 113.89 3820.3 sec 5000 217.52 217.52 113.89 3820.4 sec 5000 217.52 217.52 113.89 3820.5 sec 5000 217.52 217.52 113.89 ⋮ ```

Lastly, visualize the `EngineTorque` and `EnginePower` signals over time in a plot with two y-axes.

```figure yyaxis left plot(tt.Time, tt.EngineTorque) title("Engine Torque and Engine Power Over Time") xlabel("Time") ylabel("Engine Torque [Nm]") yyaxis right plot(tt.Time, tt.EnginePower) ylabel("Engine Power [kW]")```

### Close MDF-File

Close access to the MDF-file by clearing the MDF datastore variable from workspace.

`clear mds`

