Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

Compute Summary Statistics by Group Using MapReduce

This example shows how to compute summary statistics organized by group using mapreduce. It demonstrates the use of an anonymous function to pass an extra grouping parameter to a parameterized map function. This parameterization allows you to quickly recalculate the statistics using a different grouping variable.

Prepare Data

Create a datastore using the airlinesmall.csv data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. For this example, select Month, UniqueCarrier (airline carrier ID), and ArrDelay (flight arrival delay) as the variables of interest.

ds = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA');
ds.SelectedVariableNames = {'Month', 'UniqueCarrier', 'ArrDelay'};

The datastore treats 'NA' values as missing, and replaces the missing values with NaN values by default. Additionally, the SelectedVariableNames property allows you to work with only the selected variables of interest, which you can verify using preview.

preview(ds)
ans =

  8x3 table

    Month    UniqueCarrier    ArrDelay
    _____    _____________    ________

    10       'PS'              8      
    10       'PS'              8      
    10       'PS'             21      
    10       'PS'             13      
    10       'PS'              4      
    10       'PS'             59      
    10       'PS'              3      
    10       'PS'             11      

Run MapReduce

The mapreduce function requires a map function and a reduce function as inputs. The mapper receives chunks of data and outputs intermediate results. The reducer reads the intermediate results and produces a final result.

In this example, the mapper computes the grouped statistics for each chunk of data and stores the statistics as intermediate key-value pairs. Each intermediate key-value pair has a key for the group level and a cell array of values with the corresponding statistics.

This map function accepts four input arguments, whereas the mapreduce function requires the map function to accept exactly three input arguments. The call to mapreduce (below) shows how to pass in this extra parameter.

Display the map function file.

function statsByGroupMapper(data, ~, intermKVStore, groupVarName)
% Mapper function for the StatisticsByGroupMapReduceExample.

% Copyright 2014 The MathWorks, Inc.

% Data is a n-by-3 table. Remove missing values first
delays = data.ArrDelay;
groups = data.(groupVarName);
notNaN =~isnan(delays);
groups = groups(notNaN);
delays = delays(notNaN);

% find the unique group levels in this chunk
[intermKeys,~,idx] = unique(groups, 'stable');

% group delays by idx and apply @grpstatsfun function to each group
intermVals = accumarray(idx,delays,size(intermKeys),@grpstatsfun);
addmulti(intermKVStore,intermKeys,intermVals);

function out = grpstatsfun(x)
n = length(x); % count
m = sum(x)/n; % mean
v = sum((x-m).^2)/n; % variance
s = sum((x-m).^3)/n; % skewness without normalization
k = sum((x-m).^4)/n; % kurtosis without normalization
out = {[n, m, v, s, k]};

After the Map phase, mapreduce groups the intermediate key-value pairs by unique key (in this case, the airline carrier ID), so each call to the reduce function works on the values associated with one airline. The reducer receives a list of the intermediate statistics for the airline specified by the input key (intermKey) and combines the statistics into separate vectors: n, m, v, s, and k. Then, the reducer uses these vectors to calculate the count, mean, variance, skewness, and kurtosis for a single airline. The final key is the airline carrier code, and the associated values are stored in a structure with five fields.

Display the reduce function file.

function statsByGroupReducer(intermKey, intermValIter, outKVStore)
% Reducer function for the StatisticsByGroupMapReduceExample.

% Copyright 2014 The MathWorks, Inc.

n = [];
m = [];
v = [];
s = [];
k = [];

% get all sets of intermediate statistics
while hasnext(intermValIter)
    value = getnext(intermValIter);
    n = [n; value(1)];
    m = [m; value(2)];
    v = [v; value(3)];
    s = [s; value(4)];
    k = [k; value(5)];
end
% Note that this approach assumes the concatenated intermediate values fit
% in memory. Refer to the reducer function, covarianceReducer,  of the
% CovarianceMapReduceExample for an alternative pairwise reduction approach

% combine the intermediate results
count = sum(n);
meanVal = sum(n.*m)/count;
d = m - meanVal;
variance = (sum(n.*v) + sum(n.*d.^2))/count;
skewnessVal = (sum(n.*s) + sum(n.*d.*(3*v + d.^2)))./(count*variance^(1.5));
kurtosisVal = (sum(n.*k) + sum(n.*d.*(4*s + 6.*v.*d +d.^3)))./(count*variance^2);

outValue = struct('Count',count, 'Mean',meanVal, 'Variance',variance,...
                 'Skewness',skewnessVal, 'Kurtosis',kurtosisVal);

% add results to the output datastore
add(outKVStore,intermKey,outValue);

Use mapreduce to apply the map and reduce functions to the datastore, ds. Since the parameterized map function accepts four inputs, use an anonymous function to pass in the airline carrier IDs as the fourth input.

outds1 = mapreduce(ds, ...
    @(data,info,kvs)statsByGroupMapper(data,info,kvs,'UniqueCarrier'), ...
    @statsByGroupReducer);
********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  16% Reduce   0%
Map  32% Reduce   0%
Map  48% Reduce   0%
Map  65% Reduce   0%
Map  81% Reduce   0%
Map  97% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce  10%
Map 100% Reduce  21%
Map 100% Reduce  31%
Map 100% Reduce  41%
Map 100% Reduce  52%
Map 100% Reduce  62%
Map 100% Reduce  72%
Map 100% Reduce  83%
Map 100% Reduce  93%
Map 100% Reduce 100%

mapreduce returns a datastore, outds1, with files in the current folder.

Read the final results from the output datastore.

r1 = readall(outds1)
r1 =

  29x2 table

      Key          Value    
    ________    ____________

    'PS'        [1x1 struct]
    'TW'        [1x1 struct]
    'UA'        [1x1 struct]
    'WN'        [1x1 struct]
    'EA'        [1x1 struct]
    'HP'        [1x1 struct]
    'NW'        [1x1 struct]
    'PA (1)'    [1x1 struct]
    'PI'        [1x1 struct]
    'CO'        [1x1 struct]
    'DL'        [1x1 struct]
    'AA'        [1x1 struct]
    'US'        [1x1 struct]
    'AS'        [1x1 struct]
    'ML (1)'    [1x1 struct]
    'AQ'        [1x1 struct]
    'MQ'        [1x1 struct]
    'OO'        [1x1 struct]
    'XE'        [1x1 struct]
    'TZ'        [1x1 struct]
    'EV'        [1x1 struct]
    'FL'        [1x1 struct]
    'B6'        [1x1 struct]
    'DH'        [1x1 struct]
    'HA'        [1x1 struct]
    'OH'        [1x1 struct]
    'F9'        [1x1 struct]
    'YV'        [1x1 struct]
    '9E'        [1x1 struct]

Organize Results

To organize the results better, convert the structure containing the statistics into a table and use the carrier IDs as the row names. mapreduce returns the key-value pairs in the same order as they were added by the reduce function, so sort the table by carrier ID.

statsByCarrier = struct2table(cell2mat(r1.Value), 'RowNames', r1.Key);
statsByCarrier = sortrows(statsByCarrier, 'RowNames')
statsByCarrier =

  29x5 table

              Count     Mean      Variance    Skewness    Kurtosis
              _____    _______    ________    ________    ________

    9E          507     5.3669    1889.5      6.2676      61.706  
    AA        14578     6.9598      1123      6.0321      93.085  
    AQ          153     1.0065    230.02      3.9905      28.383  
    AS         2826     8.0771       717      3.6547      24.083  
    B6          793     11.936    2087.4      4.0072       27.45  
    CO         7999      7.048    1053.8      4.6601      41.038  
    DH          673      7.575    1491.7      2.9929      15.461  
    DL        16284     7.4971    697.48      4.4746      41.115  
    EA          875     8.2434    1221.3      5.2955      43.518  
    EV         1655     10.028    1325.4      2.9347      14.878  
    F9          332     8.4849    1138.6      4.2983      30.742  
    FL         1248     9.5144    1360.4      3.6277      21.866  
    HA          271    -1.5387    323.27      8.4245      109.63  
    HP         3597     7.5897    744.51      5.2534      50.004  
    ML (1)       69    0.15942    169.32      2.8354      16.559  
    MQ         3805     8.8591    1530.5       7.054      105.51  
    NW        10097     5.4265    977.64       8.616      172.87  
    OH         1414     7.7617      1224        3.57       24.52  
    OO         3010     5.8618    1010.4      4.4263      32.783  
    PA (1)      313     5.3738    692.19      3.2061      20.747  
    PI          861     11.252    1121.1      14.751      315.59  
    PS           82     5.3902    454.51      2.9682      14.383  
    TW         3718      7.411    830.76       4.139       30.67  
    TZ          215      1.907    814.63      2.8269      13.758  
    UA        12955     8.3939    1046.6      3.9742      28.187  
    US        13666     6.8027    760.83      4.6905      47.975  
    WN        15749     5.4581    562.49      4.0439      30.403  
    XE         2294     8.8082    1410.1      3.7114      23.235  
    YV          827     12.376    2192.6      3.9315      26.446  

Change Grouping Parameter

The use of an anonymous function to pass in the grouping variable allows you to quickly recalculate the statistics with a different grouping.

For this example, recalculate the statistics and group the results by Month, instead of by the carrier IDs, by simply passing the Month variable into the anonymous function.

outds2 = mapreduce(ds, ...
    @(data,info,kvs)statsByGroupMapper(data,info,kvs,'Month'), ...
    @statsByGroupReducer);
********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  16% Reduce   0%
Map  32% Reduce   0%
Map  48% Reduce   0%
Map  65% Reduce   0%
Map  81% Reduce   0%
Map  97% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce  17%
Map 100% Reduce  33%
Map 100% Reduce  50%
Map 100% Reduce  67%
Map 100% Reduce  83%
Map 100% Reduce 100%

Read the final results and organize them into a table.

r2 = readall(outds2);
r2 = sortrows(r2,'Key');
statsByMonth = struct2table(cell2mat(r2.Value));
mon = {'Jan','Feb','Mar','Apr','May','Jun', ...
       'Jul','Aug','Sep','Oct','Nov','Dec'};
statsByMonth.Properties.RowNames = mon
statsByMonth =

  12x5 table

           Count     Mean     Variance    Skewness    Kurtosis
           _____    ______    ________    ________    ________

    Jan     9870    8.5954    973.69      4.1142      35.152  
    Feb     9160    7.3275    911.14      4.7241       45.03  
    Mar    10219    7.5536    976.34      5.1678      63.155  
    Apr     9949    6.0081    1077.4      8.9506      170.52  
    May    10180    5.2949    737.09      4.0535      30.069  
    Jun    10045    10.264    1266.1      4.8777        43.5  
    Jul    10340    8.7797    1069.7      5.1428      64.896  
    Aug    10470    7.4522    908.64      4.1959       29.66  
    Sep     9691    3.6308    664.22      4.6573      38.964  
    Oct    10590    4.6059    684.94      5.6407      74.805  
    Nov    10071    5.2835    808.65      8.0297      186.68  
    Dec    10281    10.571    1087.6      3.8564      28.823  

See Also

|

Related Topics

Was this topic helpful?