This example shows how to find the maximum value of a single variable in a data set using
mapreduce. It demonstrates the simplest use of
mapreduce since there is only one key and minimal computation.
Create a datastore using the
airlinesmall.csv data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. In this example, select
ArrDelay (flight arrival delay) as the variable of interest.
ds = tabularTextDatastore('airlinesmall.csv', 'TreatAsMissing', 'NA'); ds.SelectedVariableNames = 'ArrDelay';
The datastore treats
'NA' values as missing, and replaces the missing values with
NaN values by default. Additionally, the
SelectedVariableNames property allows you to work with only the selected variable of interest, which you can verify using
ans = 8x1 table ArrDelay ________ 8 8 21 13 4 59 3 11
mapreduce function requires a map function and a reduce function as inputs. The mapper receives chunks of data and outputs intermediate results. The reducer reads the intermediate results and produces a final result.
In this example, the mapper finds the maximum arrival delay in each chunk of data. The mapper then stores these maximum values as the intermediate values associated with the key
Display the map function file.
function maxArrivalDelayMapper (data, info, intermKVStore) % Mapper function for the MaxMapreduceExample. % Copyright 1984-2014 The MathWorks, Inc. % Data is an n-by-1 table of the ArrDelay. As the data source is tabular, % the return of read is a table object. partMax = max(data.ArrDelay); add(intermKVStore, 'PartialMaxArrivalDelay',partMax);
The reducer receives a list of the maximum arrival delays for each chunk and finds the overall maximum arrival delay from the list of values.
mapreduce only calls this reducer once, since the mapper only adds a single unique key. The reducer uses
add to add a final key-value pair to the output.
Display the reduce function file.
function maxArrivalDelayReducer(intermKey, intermValIter, outKVStore) % Reducer function for the MaxMapreduceExample. % Copyright 2014 The MathWorks, Inc. % intermKey is 'PartialMaxArrivalDelay'. intermValIter is an iterator of % all values that has the key 'PartialMaxArrivalDelay'. maxVal = -inf; while hasnext(intermValIter) maxVal = max(getnext(intermValIter), maxVal); end % The key-value pair added to outKVStore will become the output of mapreduce add(outKVStore,'MaxArrivalDelay',maxVal);
mapreduce to apply the map and reduce functions to the datastore,
maxDelay = mapreduce(ds, @maxArrivalDelayMapper, @maxArrivalDelayReducer);
******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 16% Reduce 0% Map 32% Reduce 0% Map 48% Reduce 0% Map 65% Reduce 0% Map 81% Reduce 0% Map 97% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 100%
mapreduce returns a datastore,
maxDelay, with files in the current folder.
Read the final result from the output datastore,
ans = 1x2 table Key Value _________________ ______ 'MaxArrivalDelay'