MATLAB Examples

Simple Data Subsetting Using MapReduce

This example shows how to extract a subset of a large data set.

There are two aspects of subsetting, or performing a query. One is selecting a subset of the variables (columns) in the data set. The other is selecting a subset of the observations, or rows.

In this example, the selection of variables takes place in the definition of the datastore. (The map function could perform a further sub-selection of variables, but that is not within the scope of this example). In this example, the role of the map function is to perform the selection of observations. The role of the reduce function is to concatenate the subsetted records extracted by each call to the map function. This approach assumes that the data set can fit in memory after the Map phase.

Contents

Prepare Data

Create a datastore using the airlinesmall.csv data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. This example uses 15 variables out of the 29 variables available in the data.

ds = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA');
ds.SelectedVariableNames = ds.VariableNames([1 2 5 9 12 13 15 16 17 ...
    18 20 21 25 26 27]);
ds.SelectedVariableNames
ans =

  1x15 cell array

  Columns 1 through 4

    {'Year'}    {'Month'}    {'DepTime'}    {'UniqueCarrier'}

  Columns 5 through 8

    {'ActualElapsedTime'}    {'CRSElapsedTime'}    {'ArrDelay'}    {'DepDelay'}

  Columns 9 through 13

    {'Origin'}    {'Dest'}    {'TaxiIn'}    {'TaxiOut'}    {'CarrierDelay'}

  Columns 14 through 15

    {'WeatherDelay'}    {'NASDelay'}

The datastore treats 'NA' values as missing, and replaces the missing values with NaN values by default. Additionally, the SelectedVariableNames property allows you to work with only the specified variables of interest, which you can verify using preview.

preview(ds)
ans =

  8x15 table

    Year    Month    DepTime    UniqueCarrier    ActualElapsedTime    CRSElapsedTime    ArrDelay    DepDelay    Origin    Dest     TaxiIn    TaxiOut    CarrierDelay    WeatherDelay    NASDelay
    ____    _____    _______    _____________    _________________    ______________    ________    ________    ______    _____    ______    _______    ____________    ____________    ________

    1987    10        642       'PS'              53                   57                8          12          'LAX'     'SJC'    NaN       NaN        NaN             NaN             NaN     
    1987    10       1021       'PS'              63                   56                8           1          'SJC'     'BUR'    NaN       NaN        NaN             NaN             NaN     
    1987    10       2055       'PS'              83                   82               21          20          'SAN'     'SMF'    NaN       NaN        NaN             NaN             NaN     
    1987    10       1332       'PS'              59                   58               13          12          'BUR'     'SJC'    NaN       NaN        NaN             NaN             NaN     
    1987    10        629       'PS'              77                   72                4          -1          'SMF'     'LAX'    NaN       NaN        NaN             NaN             NaN     
    1987    10       1446       'PS'              61                   65               59          63          'LAX'     'SJC'    NaN       NaN        NaN             NaN             NaN     
    1987    10        928       'PS'              84                   79                3          -2          'SAN'     'SFO'    NaN       NaN        NaN             NaN             NaN     
    1987    10        859       'PS'             155                  143               11          -1          'SEA'     'LAX'    NaN       NaN        NaN             NaN             NaN     

Run MapReduce

The mapreduce function requires a map function and a reduce function as inputs. The mapper receives chunks of data and outputs intermediate results. The reducer reads the intermediate results and produces a final result.

In this example, the mapper receives a table with the variables described by the SelectedVariableNames property in the datastore. Then, the mapper extracts flights that had a high amount of delay after pushback from the gate. Specifically, it identifies flights with a duration exceeding 2.5 times the length of the scheduled duration. The mapper ignores flights prior to 1995, because some of the variables of interest for this example were not collected before that year.

Display the map function file.

function subsettingMapper(data, ~, intermKVStore)
% Select flights from 1995 and later that had exceptionally long
% elapsed flight times (including both time on the tarmac and time in 
% the air).

% Copyright 2014 The MathWorks, Inc.

idx = data.Year > 1994 & (data.ActualElapsedTime - data.CRSElapsedTime)...
    > 1.50 * data.CRSElapsedTime;
intermVal = data(idx,:);

add(intermKVStore,'Null',intermVal);

The reducer receives the subsetted observations obtained from the mapper and simply concatenates them into a single table. The reducer returns one key (which is relatively meaningless) and one value (the concatenated table).

Display the reduce function file.

function subsettingReducer(~, intermValList, outKVStore)
% Reducer function for the SubsettingMapReduceExample 

% Copyright 2014 The MathWorks, Inc.

% get all intermediate results from the list
outVal = {};

while hasnext(intermValList)
    outVal = [outVal; getnext(intermValList)];
end
% Note that this approach assumes the concatenated intermediate values (the
% subset of the whole data) fit in memory.
    
add(outKVStore, 'Null', outVal);

Use mapreduce to apply the map and reduce functions to the datastore, ds.

result = mapreduce(ds, @subsettingMapper, @subsettingReducer);
********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  16% Reduce   0%
Map  32% Reduce   0%
Map  48% Reduce   0%
Map  65% Reduce   0%
Map  81% Reduce   0%
Map  97% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce 100%

mapreduce returns an output datastore, result, with files in the current folder.

Display Results

Look for patterns in the first 10 variables that were pulled from the data set. These variables identify the airline, the destination, and the arrival airports, as well as some basic delay information.

r = readall(result);
tbl = r.Value{1};
tbl(:,1:10)
ans =

  37x10 table

    Year    Month    DepTime    UniqueCarrier    ActualElapsedTime    CRSElapsedTime    ArrDelay    DepDelay    Origin    Dest 
    ____    _____    _______    _____________    _________________    ______________    ________    ________    ______    _____

    1995     6       1601       'US'              162                  58               118            14       'BWI'     'PIT'
    1996     6       1834       'CO'              241                  75               220            54       'IAD'     'EWR'
    1997     1        730       'DL'              110                  43               137            70       'ATL'     'GSP'
    1997     4       1715       'UA'              152                  57               243           148       'IND'     'ORD'
    1997     9       2232       'NW'              143                  50               115            22       'DTW'     'CMH'
    1997    10       1419       'CO'              196                  58               157            19       'DFW'     'IAH'
    1998     3       2156       'DL'              139                  49               146            56       'TYS'     'ATL'
    1998    10       1803       'NW'              291                  81               213             3       'MSP'     'ORD'
    2000     5        830       'WN'              140                  55                85             0       'DAL'     'HOU'
    2000     8       1630       'CO'              357                 123               244            10       'EWR'     'CLT'
    2002     6       1759       'US'              260                  67               192            -1       'LGA'     'BOS'
    2003     3       1214       'XE'              214                  84               124            -6       'GPT'     'IAH'
    2003     3        604       'XE'              175                  60               114            -1       'LFT'     'IAH'
    2003     4       1556       'MQ'              142                  52               182            92       'PIA'     'ORD'
    2003     5       1954       'US'              127                  48                78            -1       'RDU'     'CLT'
    2003     7       1250       'FL'              261                  95               166             0       'ATL'     'IAD'
    2003     8       2010       'AA'              339                 115               406           182       'BHM'     'DFW'
    2004     3       1238       'MQ'              184                  69               115             0       'AMA'     'DFW'
    2004     7       1730       'DL'              241                  68               173             0       'DCA'     'LGA'
    2004     8       1330       'XE'              204                  80               124             0       'HRL'     'IAH'
    2005     7       1951       'MQ'              251                  97               345           191       'RDU'     'JFK'
    2005    10        916       'MQ'              343                  77               266             0       'LIT'     'DFW'
    2006     2        324       'B6'             1650                 199               415         -1036       'BOS'     'FLL'
    2006     5       1444       'CO'              167                  60               131            24       'IAH'     'SAT'
    2006     5       1250       'DL'              148                  59               109            20       'DCA'     'LGA'
    2006     7       1030       'WN'              211                  80               226            95       'PIT'     'MDW'
    2006     7       1424       'MQ'              254                  69               259            74       'LGA'     'DCA'
    2006    11       2147       'UA'              222                  77               160            15       'DEN'     'ICT'
    2006    11       1307       'AA'              175                  60               132            17       'DFW'     'AUS'
    2007    10       1141       'OO'              137                  54               107            24       'PIA'     'ORD'
    2008     1       1027       'MQ'              139                  55                96            12       'MLI'     'ORD'
    2008     1       2049       'MQ'              151                  60               175            84       'AZO'     'ORD'
    2008     2        818       'WN'              280                  95               198            13       'MHT'     'BWI'
    2008     4       1014       'CO'              151                  58                92            -1       'SAT'     'IAH'
    2008     6       2000       'OH'              263                 104               204            45       'JFK'     'BOS'
    2008     6       1715       'AA'              271                  90               201            20       'RDU'     'LGA'
    2008    11       1603       'XE'              183                  73               124            14       'BTR'     'IAH'

Looking at the first record, a U.S. Air flight departed the gate 14 minutes after its scheduled departure time and arrived 118 minutes late. The flight experienced a delay of 104 minutes after pushback from the gate which is the difference between ActualElapsedTime and CRSElapsedTime.

There is one anomalous record. In February of 2006, a JetBlue flight had a departure time of 3:24 a.m. and an elapsed flight time of 1650 minutes, but an arrival delay of only 415 minutes. This might be a data entry error.

Otherwise, there are no clear cut patterns concerning when and where these exceptionally delayed flights occur. No airline, time of year, time of day, or single airport dominates. Some intuitive patterns, such as O'Hare (ORD) in the winter months, are certainly present.

Delay Patterns

Beginning in 1995, the airline system performance data began including measurements of how much delay took place in the taxi phases of a flight. Then, in 2003, the data also began to include certain causes of delay.

Examine these two variables in closer detail.

tbl(:,[1,7,8,11:end])
ans =

  37x8 table

    Year    ArrDelay    DepDelay    TaxiIn    TaxiOut    CarrierDelay    WeatherDelay    NASDelay
    ____    ________    ________    ______    _______    ____________    ____________    ________

    1995    118            14         7       101        NaN             NaN             NaN     
    1996    220            54        12       180        NaN             NaN             NaN     
    1997    137            70         2        12        NaN             NaN             NaN     
    1997    243           148         4        38        NaN             NaN             NaN     
    1997    115            22         4        98        NaN             NaN             NaN     
    1997    157            19         6        95        NaN             NaN             NaN     
    1998    146            56         9        47        NaN             NaN             NaN     
    1998    213             3        11       205        NaN             NaN             NaN     
    2000     85             0         5        51        NaN             NaN             NaN     
    2000    244            10         4       273        NaN             NaN             NaN     
    2002    192            -1         6       217        NaN             NaN             NaN     
    2003    124            -6        13       131        NaN             NaN             NaN     
    2003    114            -1         8       106        NaN             NaN             NaN     
    2003    182            92         9       106        NaN             NaN             NaN     
    2003     78            -1         5        90        NaN             NaN             NaN     
    2003    166             0        11       170          0               0             166     
    2003    406           182       242        10          0               0             224     
    2004    115             0         6        61          0               0             115     
    2004    173             0         5       161          0               0             173     
    2004    124             0         9       102          0               0             124     
    2005    345           191        54       125          0               0             345     
    2005    266             0        13       183          0               0             266     
    2006    415         -1036         4        12         14               0              11     
    2006    131            24         7       118          0               6             107     
    2006    109            20         4       105         20               0              89     
    2006    226            95         5       130          0               0             226     
    2006    259            74         6       208         39               0             185     
    2006    160            15         3       158         15               0             145     
    2006    132            17         4       127          0              17             115     
    2007    107            24         7       100          0               0             107     
    2008     96            12        25        72          0               0              96     
    2008    175            84        12       107          0               0              91     
    2008    198            13         4       190          0               0             185     
    2008     92            -1         9        93          0               0              92     
    2008    204            45        12       212          0              45             159     
    2008    201            20         4       193          0               0             201     
    2008    124            14        12        93          0               0             110     

For these exceptionally delayed flights, the great majority of delay occurs during taxi out, on the tarmac. Moreover, the major cause of the delay is NASDelay. NAS delays are holds imposed by the national aviation authorities on departures headed for an airport that is forecast to be unable to handle all scheduled arrivals at the time the flight is scheduled to arrive. NAS delay programs in effect at any given time are posted at http://www.fly.faa.gov/ois/.

Preferably, when NAS delays are imposed, boarding of the aircraft is simply delayed. Such a delay would show up as a departure delay. However, for most of the flights selected for this example, the delays took place largely after departure from the gate, leading to a taxi delay.

Rerun MapReduce

The previous map function had the subsetting criteria hard-wired in the function file. A new map function would have to be written for any new query, such as flights departing San Francisco on a given day.

A generic mapper can be more adaptive by separating out the subsetting criteria from the map function definition and using an anonymous function to configure the mapper for each query. This generic mapper uses a fourth input argument that supplies the desired query variable.

Display the generic map function file.

function subsettingMapperGeneric(data, ~, intermKVStore, subsetter)

intermKey = 'Null';

intermVal = data(subsetter(data), :);

add(intermKVStore,intermKey,intermVal);

Create an anonymous function that performs the same selection of rows that is hard-coded in subsettingMapper.m.

inFlightDelay150percent = ...
   @(data) data.Year > 1994 & ...
   (data.ActualElapsedTime-data.CRSElapsedTime) > 1.50*data.CRSElapsedTime;

Since the mapreduce function requires the map and reduce functions to accept exactly three inputs, use another anonymous function to specify the fourth input to the mapper, subsettingMapperGeneric.m. Subsequently, you can use this anonymous function to call subsettingMapperGeneric.m using only three arguments (the fourth is implicit).

configuredMapper = ...
    @(data, info, intermKVStore) subsettingMapperGeneric(data, info, ...
    intermKVStore, inFlightDelay150percent);

Use mapreduce to apply the generic map function to the input datastore.

result2 = mapreduce(ds, configuredMapper, @subsettingReducer);
********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  16% Reduce   0%
Map  32% Reduce   0%
Map  48% Reduce   0%
Map  65% Reduce   0%
Map  81% Reduce   0%
Map  97% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce 100%

mapreduce returns an output datastore, result2, with files in the current folder.

Verify Results

Confirm that the generic mapper gets the same result as with the hard-wired subsetting logic.

r2 = readall(result2);
tbl2 = r2.Value{1};

if isequaln(tbl, tbl2)
    disp('Same results with the configurable mapper.')
else
    disp('Oops, back to the drawing board.')
end
Same results with the configurable mapper.