MATLAB Examples

Simple Data Subsetting Using MapReduce

This example shows how to extract a subset of a large data set.

There are two aspects of subsetting, or performing a query. One is selecting a subset of the variables (columns) in the data set. The other is selecting a subset of the observations, or rows.

In this example, the selection of variables takes place in the definition of the datastore. (The map function could perform a further sub-selection of variables, but that is not within the scope of this example). In this example, the role of the map function is to perform the selection of observations. The role of the reduce function is to concatenate the subsetted records extracted by each call to the map function. This approach assumes that the data set can fit in memory after the Map phase.

Contents

Prepare Data

Create a datastore using the airlinesmall.csv data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. This example uses 15 variables out of the 29 variables available in the data.

ds = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA');
ds.SelectedVariableNames = ds.VariableNames([1 2 5 9 12 13 15 16 17 ...
    18 20 21 25 26 27]);
ds.SelectedVariableNames
ans =

  1x15 cell array

  Columns 1 through 4

    {'Year'}    {'Month'}    {'DepTime'}    {'UniqueCarrier'}

  Columns 5 through 8

    {'ActualElapsedTime'}    {'CRSElapsedTime'}    {'ArrDelay'}    {'DepDelay'}

  Columns 9 through 13

    {'Origin'}    {'Dest'}    {'TaxiIn'}    {'TaxiOut'}    {'CarrierDelay'}

  Columns 14 through 15

    {'WeatherDelay'}    {'NASDelay'}

The datastore treats 'NA' values as missing, and replaces the missing values with NaN values by default. Additionally, the SelectedVariableNames property allows you to work with only the specified variables of interest, which you can verify using preview.

preview(ds)
ans =

  8x15 table

    Year    Month    DepTime    UniqueCarrier    ActualElapsedTime    CRSElapsedTime    ArrDelay    DepDelay    Origin    Dest     TaxiIn    TaxiOut    CarrierDelay    WeatherDelay    NASDelay
    ____    _____    _______    _____________    _________________    ______________    ________    ________    ______    _____    ______    _______    ____________    ____________    ________

    1987     10        642          'PS'                 53                 57              8          12       'LAX'     'SJC'     NaN        NaN          NaN             NaN           NaN   
    1987     10       1021          'PS'                 63                 56              8           1       'SJC'     'BUR'     NaN        NaN          NaN             NaN           NaN   
    1987     10       2055          'PS'                 83                 82             21          20       'SAN'     'SMF'     NaN        NaN          NaN             NaN           NaN   
    1987     10       1332          'PS'                 59                 58             13          12       'BUR'     'SJC'     NaN        NaN          NaN             NaN           NaN   
    1987     10        629          'PS'                 77                 72              4          -1       'SMF'     'LAX'     NaN        NaN          NaN             NaN           NaN   
    1987     10       1446          'PS'                 61                 65             59          63       'LAX'     'SJC'     NaN        NaN          NaN             NaN           NaN   
    1987     10        928          'PS'                 84                 79              3          -2       'SAN'     'SFO'     NaN        NaN          NaN             NaN           NaN   
    1987     10        859          'PS'                155                143             11          -1       'SEA'     'LAX'     NaN        NaN          NaN             NaN           NaN   

Run MapReduce

The mapreduce function requires a map function and a reduce function as inputs. The mapper receives chunks of data and outputs intermediate results. The reducer reads the intermediate results and produces a final result.

In this example, the mapper receives a table with the variables described by the SelectedVariableNames property in the datastore. Then, the mapper extracts flights that had a high amount of delay after pushback from the gate. Specifically, it identifies flights with a duration exceeding 2.5 times the length of the scheduled duration. The mapper ignores flights prior to 1995, because some of the variables of interest for this example were not collected before that year.

Display the map function file.

function subsettingMapper(data, ~, intermKVStore)
% Select flights from 1995 and later that had exceptionally long
% elapsed flight times (including both time on the tarmac and time in 
% the air).

% Copyright 2014 The MathWorks, Inc.

idx = data.Year > 1994 & (data.ActualElapsedTime - data.CRSElapsedTime)...
    > 1.50 * data.CRSElapsedTime;
intermVal = data(idx,:);

add(intermKVStore,'Null',intermVal);

The reducer receives the subsetted observations obtained from the mapper and simply concatenates them into a single table. The reducer returns one key (which is relatively meaningless) and one value (the concatenated table).

Display the reduce function file.

function subsettingReducer(~, intermValList, outKVStore)
% Reducer function for the SubsettingMapReduceExample 

% Copyright 2014 The MathWorks, Inc.

% get all intermediate results from the list
outVal = {};

while hasnext(intermValList)
    outVal = [outVal; getnext(intermValList)];
end
% Note that this approach assumes the concatenated intermediate values (the
% subset of the whole data) fit in memory.
    
add(outKVStore, 'Null', outVal);

Use mapreduce to apply the map and reduce functions to the datastore, ds.

result = mapreduce(ds, @subsettingMapper, @subsettingReducer);
********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  16% Reduce   0%
Map  32% Reduce   0%
Map  48% Reduce   0%
Map  65% Reduce   0%
Map  81% Reduce   0%
Map  97% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce 100%

mapreduce returns an output datastore, result, with files in the current folder.

Display Results

Look for patterns in the first 10 variables that were pulled from the data set. These variables identify the airline, the destination, and the arrival airports, as well as some basic delay information.

r = readall(result);
tbl = r.Value{1};
tbl(:,1:10)
ans =

  37x10 table

    Year    Month    DepTime    UniqueCarrier    ActualElapsedTime    CRSElapsedTime    ArrDelay    DepDelay    Origin    Dest 
    ____    _____    _______    _____________    _________________    ______________    ________    ________    ______    _____

    1995      6       1601          'US'                162                 58            118           14      'BWI'     'PIT'
    1996      6       1834          'CO'                241                 75            220           54      'IAD'     'EWR'
    1997      1        730          'DL'                110                 43            137           70      'ATL'     'GSP'
    1997      4       1715          'UA'                152                 57            243          148      'IND'     'ORD'
    1997      9       2232          'NW'                143                 50            115           22      'DTW'     'CMH'
    1997     10       1419          'CO'                196                 58            157           19      'DFW'     'IAH'
    1998      3       2156          'DL'                139                 49            146           56      'TYS'     'ATL'
    1998     10       1803          'NW'                291                 81            213            3      'MSP'     'ORD'
    2000      5        830          'WN'                140                 55             85            0      'DAL'     'HOU'
    2000      8       1630          'CO'                357                123            244           10      'EWR'     'CLT'
    2002      6       1759          'US'                260                 67            192           -1      'LGA'     'BOS'
    2003      3       1214          'XE'                214                 84            124           -6      'GPT'     'IAH'
    2003      3        604          'XE'                175                 60            114           -1      'LFT'     'IAH'
    2003      4       1556          'MQ'                142                 52            182           92      'PIA'     'ORD'
    2003      5       1954          'US'                127                 48             78           -1      'RDU'     'CLT'
    2003      7       1250          'FL'                261                 95            166            0      'ATL'     'IAD'
    2003      8       2010          'AA'                339                115            406          182      'BHM'     'DFW'
    2004      3       1238          'MQ'                184                 69            115            0      'AMA'     'DFW'
    2004      7       1730          'DL'                241                 68            173            0      'DCA'     'LGA'
    2004      8       1330          'XE'                204                 80            124            0      'HRL'     'IAH'
    2005      7       1951          'MQ'                251                 97            345          191      'RDU'     'JFK'
    2005     10        916          'MQ'                343                 77            266            0      'LIT'     'DFW'
    2006      2        324          'B6'               1650                199            415        -1036      'BOS'     'FLL'
    2006      5       1444          'CO'                167                 60            131           24      'IAH'     'SAT'
    2006      5       1250          'DL'                148                 59            109           20      'DCA'     'LGA'
    2006      7       1030          'WN'                211                 80            226           95      'PIT'     'MDW'
    2006      7       1424          'MQ'                254                 69            259           74      'LGA'     'DCA'
    2006     11       2147          'UA'                222                 77            160           15      'DEN'     'ICT'
    2006     11       1307          'AA'                175                 60            132           17      'DFW'     'AUS'
    2007     10       1141          'OO'                137                 54            107           24      'PIA'     'ORD'
    2008      1       1027          'MQ'                139                 55             96           12      'MLI'     'ORD'
    2008      1       2049          'MQ'                151                 60            175           84      'AZO'     'ORD'
    2008      2        818          'WN'                280                 95            198           13      'MHT'     'BWI'
    2008      4       1014          'CO'                151                 58             92           -1      'SAT'     'IAH'
    2008      6       2000          'OH'                263                104            204           45      'JFK'     'BOS'
    2008      6       1715          'AA'                271                 90            201           20      'RDU'     'LGA'
    2008     11       1603          'XE'                183                 73            124           14      'BTR'     'IAH'

Looking at the first record, a U.S. Air flight departed the gate 14 minutes after its scheduled departure time and arrived 118 minutes late. The flight experienced a delay of 104 minutes after pushback from the gate which is the difference between ActualElapsedTime and CRSElapsedTime.

There is one anomalous record. In February of 2006, a JetBlue flight had a departure time of 3:24 a.m. and an elapsed flight time of 1650 minutes, but an arrival delay of only 415 minutes. This might be a data entry error.

Otherwise, there are no clear cut patterns concerning when and where these exceptionally delayed flights occur. No airline, time of year, time of day, or single airport dominates. Some intuitive patterns, such as O'Hare (ORD) in the winter months, are certainly present.

Delay Patterns

Beginning in 1995, the airline system performance data began including measurements of how much delay took place in the taxi phases of a flight. Then, in 2003, the data also began to include certain causes of delay.

Examine these two variables in closer detail.

tbl(:,[1,7,8,11:end])
ans =

  37x8 table

    Year    ArrDelay    DepDelay    TaxiIn    TaxiOut    CarrierDelay    WeatherDelay    NASDelay
    ____    ________    ________    ______    _______    ____________    ____________    ________

    1995      118           14         7        101          NaN             NaN           NaN   
    1996      220           54        12        180          NaN             NaN           NaN   
    1997      137           70         2         12          NaN             NaN           NaN   
    1997      243          148         4         38          NaN             NaN           NaN   
    1997      115           22         4         98          NaN             NaN           NaN   
    1997      157           19         6         95          NaN             NaN           NaN   
    1998      146           56         9         47          NaN             NaN           NaN   
    1998      213            3        11        205          NaN             NaN           NaN   
    2000       85            0         5         51          NaN             NaN           NaN   
    2000      244           10         4        273          NaN             NaN           NaN   
    2002      192           -1         6        217          NaN             NaN           NaN   
    2003      124           -6        13        131          NaN             NaN           NaN   
    2003      114           -1         8        106          NaN             NaN           NaN   
    2003      182           92         9        106          NaN             NaN           NaN   
    2003       78           -1         5         90          NaN             NaN           NaN   
    2003      166            0        11        170            0               0           166   
    2003      406          182       242         10            0               0           224   
    2004      115            0         6         61            0               0           115   
    2004      173            0         5        161            0               0           173   
    2004      124            0         9        102            0               0           124   
    2005      345          191        54        125            0               0           345   
    2005      266            0        13        183            0               0           266   
    2006      415        -1036         4         12           14               0            11   
    2006      131           24         7        118            0               6           107   
    2006      109           20         4        105           20               0            89   
    2006      226           95         5        130            0               0           226   
    2006      259           74         6        208           39               0           185   
    2006      160           15         3        158           15               0           145   
    2006      132           17         4        127            0              17           115   
    2007      107           24         7        100            0               0           107   
    2008       96           12        25         72            0               0            96   
    2008      175           84        12        107            0               0            91   
    2008      198           13         4        190            0               0           185   
    2008       92           -1         9         93            0               0            92   
    2008      204           45        12        212            0              45           159   
    2008      201           20         4        193            0               0           201   
    2008      124           14        12         93            0               0           110   

For these exceptionally delayed flights, the great majority of delay occurs during taxi out, on the tarmac. Moreover, the major cause of the delay is NASDelay. NAS delays are holds imposed by the national aviation authorities on departures headed for an airport that is forecast to be unable to handle all scheduled arrivals at the time the flight is scheduled to arrive. NAS delay programs in effect at any given time are posted at https://www.fly.faa.gov/ois/.

Preferably, when NAS delays are imposed, boarding of the aircraft is simply delayed. Such a delay would show up as a departure delay. However, for most of the flights selected for this example, the delays took place largely after departure from the gate, leading to a taxi delay.

Rerun MapReduce

The previous map function had the subsetting criteria hard-wired in the function file. A new map function would have to be written for any new query, such as flights departing San Francisco on a given day.

A generic mapper can be more adaptive by separating out the subsetting criteria from the map function definition and using an anonymous function to configure the mapper for each query. This generic mapper uses a fourth input argument that supplies the desired query variable.

Display the generic map function file.

function subsettingMapperGeneric(data, ~, intermKVStore, subsetter)

intermKey = 'Null';

intermVal = data(subsetter(data), :);

add(intermKVStore,intermKey,intermVal);

Create an anonymous function that performs the same selection of rows that is hard-coded in subsettingMapper.m.

inFlightDelay150percent = ...
   @(data) data.Year > 1994 & ...
   (data.ActualElapsedTime-data.CRSElapsedTime) > 1.50*data.CRSElapsedTime;

Since the mapreduce function requires the map and reduce functions to accept exactly three inputs, use another anonymous function to specify the fourth input to the mapper, subsettingMapperGeneric.m. Subsequently, you can use this anonymous function to call subsettingMapperGeneric.m using only three arguments (the fourth is implicit).

configuredMapper = ...
    @(data, info, intermKVStore) subsettingMapperGeneric(data, info, ...
    intermKVStore, inFlightDelay150percent);

Use mapreduce to apply the generic map function to the input datastore.

result2 = mapreduce(ds, configuredMapper, @subsettingReducer);
********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  16% Reduce   0%
Map  32% Reduce   0%
Map  48% Reduce   0%
Map  65% Reduce   0%
Map  81% Reduce   0%
Map  97% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce 100%

mapreduce returns an output datastore, result2, with files in the current folder.

Verify Results

Confirm that the generic mapper gets the same result as with the hard-wired subsetting logic.

r2 = readall(result2);
tbl2 = r2.Value{1};

if isequaln(tbl, tbl2)
    disp('Same results with the configurable mapper.')
else
    disp('Oops, back to the drawing board.')
end
Same results with the configurable mapper.