Note: This page has been translated by MathWorks. Please click here

To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

Tall arrays are used to work with out-of-memory
data that is backed by a `datastore`

. Datastores
enable you to work with large data sets in small chunks that individually
fit in memory, instead of loading the entire data set into memory
at once. Tall arrays extend this capability to enable you to work
with out-of-memory data using common functions.

Since the data is not loaded into memory all at once, tall arrays
can be arbitrarily large in the first dimension (that is, they can
have any number of rows). Instead of writing special code that takes
into account the huge size of the data, such as with techniques like
MapReduce, tall arrays let you work with large data sets in an intuitive
manner that is similar to the way you would work with in-memory MATLAB^{®} arrays.
Many core operators and functions work the same with tall arrays as
they do with in-memory arrays. MATLAB works with small chunks
of the data at a time, handling all of the data chunking and processing
in the background, so that common expressions, such as `A+B`

,
work with big data sets.

Unlike in-memory arrays, tall arrays typically remain unevaluated
until you request that the calculations be performed using the `gather`

function.
This *deferred evaluation* allows you to work quickly
with large data sets. When you eventually request output using `gather`

, MATLAB combines
the queued calculations where possible and takes the minimum number
of passes through the data. The number of passes through the data
greatly affects execution time, so it is recommended that you request
output only when necessary.

Since `gather`

returns results as in-memory MATLAB arrays,
standard memory considerations apply. MATLAB might run out of
memory if the result returned by `gather`

is too
large.

Tall tables are like in-memory MATLAB tables, except that
they can have any number of rows. To create a tall table from a large
data set, you first need to create a `datastore`

for
the data. If the datastore `ds`

contains tabular
data, then `tall(ds)`

returns a tall table containing
the data. See Datastore for more information about creating datastores.

Create a spreadsheet datastore that points to a tabular file
of airline flight data. For folders that contain a collection of files,
you can specify the entire folder location, or use the wildcard character, `'*.csv'`

,
to include multiple files with the same file extension in the datastore.
Clean the data by treating `'NA'`

values as missing
data so that `datastore`

replaces them with `NaN`

values.
Also, set the format of a few text variables to `%s`

so
that `datastore`

reads them as cell arrays of character
vectors.

ds = datastore('airlinesmall.csv'); ds.TreatAsMissing = 'NA'; ds.SelectedFormats{strcmp(ds.SelectedVariableNames,'TailNum')} = '%s'; ds.SelectedFormats{strcmp(ds.SelectedVariableNames,'CancellationCode')} = '%s';

Create a tall table from the datastore. When you perform calculations on this tall table, the underlying datastore reads chunks of data and passes them to the tall table to process. Neither the datastore nor the tall table retain any of the underlying data.

tt = tall(ds)

tt = M×29 tall table Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay ____ _____ __________ _________ _______ __________ _______ __________ _____________ _________ _______ _________________ ______________ _______ ________ ________ ______ _____ ________ ______ _______ _________ ________________ ________ ____________ ____________ ________ _____________ _________________ 1987 10 21 3 642 630 735 727 'PS' 1503 'NA' 53 57 NaN 8 12 'LAX' 'SJC' 308 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 1987 10 26 1 1021 1020 1124 1116 'PS' 1550 'NA' 63 56 NaN 8 1 'SJC' 'BUR' 296 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 1987 10 23 5 2055 2035 2218 2157 'PS' 1589 'NA' 83 82 NaN 21 20 'SAN' 'SMF' 480 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 1987 10 23 5 1332 1320 1431 1418 'PS' 1655 'NA' 59 58 NaN 13 12 'BUR' 'SJC' 296 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 1987 10 22 4 629 630 746 742 'PS' 1702 'NA' 77 72 NaN 4 -1 'SMF' 'LAX' 373 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 1987 10 28 3 1446 1343 1547 1448 'PS' 1729 'NA' 61 65 NaN 59 63 'LAX' 'SJC' 308 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 1987 10 8 4 928 930 1052 1049 'PS' 1763 'NA' 84 79 NaN 3 -2 'SAN' 'SFO' 447 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 1987 10 10 6 859 900 1134 1123 'PS' 1800 'NA' 155 143 NaN 11 -1 'SEA' 'LAX' 954 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

The display indicates that the number of rows, `M`

,
is currently unknown. MATLAB displays some of the rows, and the
vertical ellipses `:`

indicate that more rows exist
in the tall table that are not currently being displayed.

If the data you are working with has a time associated with
each row of data, then you can convert the tall table into a tall
timetable. You can use `table2timetable`

to
convert an entire tall table, or construct the new tall timetable
using specific table variables using the `timetable`

function.

In this case, the tall table `tt`

has times
associated with each row, but they are broken down into several table
variables such as `Year`

, `Month`

, `DayofMonth`

,
and so on. Combine all of these pieces of datetime information into
a single new tall datetime variable `Dates`

, which
is based on the departure times `DepTime`

. Create
a tall timetable using `Dates`

as the row times.
Since `Dates`

is the only datetime variable in the
table, the `table2timetable`

function automatically
uses it for the row times.

hrs = (tt.DepTime - mod(tt.DepTime,100))/100; mins = mod(tt.DepTime,100); tt.Dates = datetime(tt.Year, tt.Month, tt.DayofMonth, hrs, mins, 0); tt(:,1:8) = []; TT = table2timetable(tt)

TT = M×21 tall timetable Dates UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay ____________________ _____________ _________ _______ _________________ ______________ _______ ________ ________ ______ _____ ________ ______ _______ _________ ________________ ________ ____________ ____________ ________ _____________ _________________ 21-Oct-1987 06:42:00 'PS' 1503 'NA' 53 57 NaN 8 12 'LAX' 'SJC' 308 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 26-Oct-1987 10:21:00 'PS' 1550 'NA' 63 56 NaN 8 1 'SJC' 'BUR' 296 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 23-Oct-1987 20:55:00 'PS' 1589 'NA' 83 82 NaN 21 20 'SAN' 'SMF' 480 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 23-Oct-1987 13:32:00 'PS' 1655 'NA' 59 58 NaN 13 12 'BUR' 'SJC' 296 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 22-Oct-1987 06:29:00 'PS' 1702 'NA' 77 72 NaN 4 -1 'SMF' 'LAX' 373 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 28-Oct-1987 14:46:00 'PS' 1729 'NA' 61 65 NaN 59 63 'LAX' 'SJC' 308 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 08-Oct-1987 09:28:00 'PS' 1763 'NA' 84 79 NaN 3 -2 'SAN' 'SFO' 447 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN 10-Oct-1987 08:59:00 'PS' 1800 'NA' 155 143 NaN 11 -1 'SEA' 'LAX' 954 NaN NaN 0 'NA' 0 NaN NaN NaN NaN NaN : : : : : : : : : : : : : : : : : : : : : : : :

When you extract a variable from a tall table or tall timetable,
the result is a tall array of the appropriate underlying data type.
A tall array can be a numeric, logical, datetime, duration, calendar
duration, categorical, string, or cell array. Also, you can convert
an in-memory array `A`

into a tall array with ```
tA
= tall(A)
```

. The in-memory array `A`

must
be one of the supported data types.

Extract the arrival delay `ArrDelay`

from the
tall timetable `TT`

. This creates a new tall array
variable with underlying data type double.

a = TT.ArrDelay

a = M×1 tall double column vector 8 8 21 13 4 59 3 11 : :

The `classUnderlying`

and `isaUnderlying`

functions are useful to
determine the underlying data type of a tall array.

One important aspect of tall arrays is that as you work with
them, most operations are not performed immediately. These operations
appear to execute quickly, because the actual computation is deferred
until you specifically request that the calculations be performed.
You can trigger evaluation of a tall array with either the `gather`

function (to bring the result
into memory) or the `write`

function
(to write the result to disk). This deferred evaluation is important
because even a simple command like `size(X)`

executed
on a tall array with a billion rows is not a quick calculation.

As you work with tall arrays, MATLAB keeps track of all
of the operations to be carried out. This information is then used
to optimize the number of passes through the data that will be required
when you request output with the `gather`

function.
Thus, it is normal to work with unevaluated tall arrays and request
output only when you require it. For more information, see Deferred Evaluation of Tall Arrays.

Calculate the mean and standard deviation of the arrival delay. Use these values to construct the upper and lower thresholds for delays that are within one standard deviation of the mean. Notice that the result of each operation indicates that the array has not been calculated yet.

`m = mean(a,'omitnan')`

m = tall double ?

`s = std(a,'omitnan')`

s = tall array ?

one_sigma_bounds = [m-s m m+s]

one_sigma_bounds = M×N×... tall array ? ? ? ... ? ? ? ... ? ? ? ... : : : : : :

`gather`

The benefit of delayed evaluation is that when the time comes for MATLAB to perform the calculations, it is often possible to combine the operations in such a way that the number of passes through the data is minimized. So even if you perform many operations, MATLAB only makes extra passes through the data when absolutely necessary.

The `gather`

function forces evaluation of all queued operations and brings the
resulting output into memory. For this reason, you can think of `gather`

as a bridge between tall arrays and in-memory arrays. For example, you cannot control
`if`

or `while`

loops using a tall logical array, but
once the array is evaluated with `gather`

it becomes an in-memory logical
array that you can use in these contexts.

Since `gather`

returns the entire result in MATLAB, you should make sure that the result will fit in memory.

Use `gather`

to calculate `one_sigma_bounds`

and
bring the result into memory. In this case, `one_sigma_bounds`

requires
several operations to calculate, but MATLAB combines the operations
into one pass through the data. Since the data in this example is
small, `gather`

executes quickly. However, the
elimination of passes through the data becomes more valuable as the
size of your data increases.

sig1 = gather(one_sigma_bounds)

Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 1 sec Evaluation completed in 1 sec sig1 = -23.4572 7.1201 37.6975

You can specify multiple inputs and outputs to `gather`

if
you want to evaluate several tall arrays at once. This technique is
faster than calling `gather`

multiple times. For
example, calculate the minimum and maximum arrival delay. Computed
separately, each value requires a pass through the data to calculate
for a total of two passes. However, computing both values simultaneously
requires only one pass through the data.

[max_delay, min_delay] = gather(max(a),min(a))

Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 1 sec Evaluation completed in 1 sec max_delay = 1014 min_delay = -64

These results indicate that on average, most flights arrive about 7 minutes late. But it is within one standard deviation for a flight to be up to 37 minutes late or 23 minutes early. The quickest flight in the data set arrived about an hour early, and the latest flight was delayed by many hours.

The `save`

function saves
the *state* of a tall array, but does not copy
any of the data. The resulting `.mat`

file is typically
small. However, the original data files must be available in the same
location in order to subsequently use `load`

.

The `write`

function makes
a copy of the data and saves the copy as a collection of binary files,
which can consume a large amount of disk space. `write`

executes
all pending operations on the tall array to calculate the values prior
to writing. Once `write`

copies the data, it is
independent of the original raw data. Therefore, you can recreate
the tall array from the written files even if the original raw data
is no longer available.

You can recreate the tall array from the written binary files
by creating a new datastore that points to the location where the
files were written. This functionality enables you to create *checkpoints* or *snapshots* of
tall array data. Creating a checkpoint is a good way to save the results
of preprocessing your data, so that the data is in a form that is
more efficient to load.

If you have a tall array `TA`

, then you can
write it to the folder `location`

with the command:

write(location,TA);

Later, to reconstruct `TA`

from the written
files, use the commands:

ds = datastore(location); TA = tall(ds);

Additionally, you can use the `write`

function
to trigger evaluation of a tall array and write the results to disk.
This use of `write`

is similar to `gather`

,
however, `write`

does not bring any results into
memory.

Tall arrays are supported by several toolboxes, enabling you to do things like write machine learning algorithms, deploy standalone apps, and run calculations in parallel or on a cluster. For more information, see Extend Tall Arrays with Other Products.

`datastore`

| `gather`

| `mapreducer`

| `table`

| `tall`

Was this topic helpful?