MATLAB Examples

# Tall Skinny QR (TSQR) Matrix Factorization Using MapReduce

This example shows how to compute a tall skinny QR (TSQR) factorization using mapreduce. It demonstrates how to chain mapreduce calls to perform multiple iterations of factorizations, and uses the info argument of the map function to compute numeric keys.

## Prepare Data

Create a datastore using the airlinesmall.csv data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. In this example, the variables of interest are ArrDelay (flight arrival delay), DepDelay (flight departure delay) and Distance (total flight distance).

ds = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA'); ds.ReadSize = 1000; ds.SelectedVariableNames = {'ArrDelay', 'DepDelay', 'Distance'}; 

The datastore treats 'NA' values as missing and replaces the missing values with NaN values by default. The ReadSize property lets you specify how to partition the data into chunks. Additionally, the SelectedVariableNames property allows you to work with only the specified variables of interest, which you can verify using preview.

preview(ds) 
ans = 8x3 table ArrDelay DepDelay Distance ________ ________ ________ 8 12 308 8 1 296 21 20 480 13 12 296 4 -1 373 59 63 308 3 -2 447 11 -1 954 

## Chain MapReduce Calls

The implementation of the multi-iteration TSQR algorithm needs to chain consecutive mapreduce calls. To demonstrate the general chaining design pattern, this example uses two mapreduce iterations. The output from the map function calls is passed into a large set of reducers, and then the output of these reducers becomes the input for the next mapreduce iteration.

## First MapReduce Iteration

In the first iteration, the map function, tsqrMapper, receives one chunk (the ith) of data, which is a table of size . The mapper computes the matrix of this chunk of data and stores it as an intermediate result. Then, mapreduce aggregates the intermediate results by unique key before sending them to the reduce function. Thus, mapreduce sends all intermediate matrices with the same key to the same reducer.

Since the reducer uses qr, which is an in-memory MATLAB function, it's best to first make sure that the matrices fit in memory. This example divides the dataset into eight partitions. The mapreduce function reads the data in chunks and passes the data along with some meta information to the map function. The info input argument is the second input to the map function and it contains the read offset and file size information that are necessary to generate the key,

  key = ceil(offset/fileSize/numPartitions).

Display the map function file.

function tsqrMapper(data, info, intermKVStore)
% Mapper function for the TSQRMapReduceExample.

% Copyright 2014 The MathWorks, Inc.

x = data{:,:};
x(any(isnan(x),2),:) = [];% Remove missing values

[~, r] = qr(x,0);

% intermKey = randi(4); % random integer key for partitioning intermediate results
intermKey = computeKey(info, 8);

function key = computeKey(info, numPartitions)
% Helper function to generate a key for the tsqrMapper function.

fileSize = info.FileSize; % total size of the underlying data file
partitionSize = fileSize/numPartitions; % size in bytes of each partition
offset = info.Offset; % offset in bytes of the current read

key = ceil(offset/partitionSize);



The reduce function receives a list of the intermediate matrices, vertically concatenates them, and computes the matrix of the concatenated matrix.

Display the reduce function file.

function tsqrReducer(intermKey, intermValIter, outKVStore)
% Reducer function for the TSQRMapReduceExample.

% Copyright 2014 The MathWorks, Inc.

x = [];

while (intermValIter.hasnext)
x = [x;intermValIter.getnext];
end
% Note that this approach assumes the concatenated intermediate values fit
% in memory. Consider increasing the number of reduce tasks (increasing the
% number of partitions in the tsqrMapper) and adding more iterations if it
% does not fit in memory.

[~, r] =qr(x,0);



Use mapreduce to apply the map and reduce functions to the datastore, ds.

outds1 = mapreduce(ds, @tsqrMapper, @tsqrReducer); 
******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 10% Reduce 0% Map 20% Reduce 0% Map 30% Reduce 0% Map 40% Reduce 0% Map 50% Reduce 0% Map 60% Reduce 0% Map 70% Reduce 0% Map 80% Reduce 0% Map 90% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 11% Map 100% Reduce 22% Map 100% Reduce 33% Map 100% Reduce 44% Map 100% Reduce 56% Map 100% Reduce 67% Map 100% Reduce 78% Map 100% Reduce 89% Map 100% Reduce 100% 

mapreduce returns an output datastore, outds1, with files in the current folder.

## Second MapReduce Iteration

The second iteration uses the output of the first iteration, outds1, as its input. This iteration uses an identity mapper, identityMapper, which simply copies over the data using a single key, 'Identity'.

Display the identity mapper file.

function identityMapper(data, info, intermKVStore)
% Mapper function for the MapReduce TSQR example.
%
% This mapper function simply copies the data and add them to the
% intermKVStore as intermediate values.

% Copyright 2014 The MathWorks, Inc.

x = data.Value{:,:};



The reducer function is the same in both iterations. The use of a single key by the map function means that mapreduce only calls the reduce function once in the second iteration.

Display the reduce function file.

function tsqrReducer(intermKey, intermValIter, outKVStore)
% Reducer function for the TSQRMapReduceExample.

% Copyright 2014 The MathWorks, Inc.

x = [];

while (intermValIter.hasnext)
x = [x;intermValIter.getnext];
end
% Note that this approach assumes the concatenated intermediate values fit
% in memory. Consider increasing the number of reduce tasks (increasing the
% number of partitions in the tsqrMapper) and adding more iterations if it
% does not fit in memory.

[~, r] =qr(x,0);



Use mapreduce to apply the identity mapper and the same reducer to the output from the first mapreduce call.

outds2 = mapreduce(outds1, @identityMapper, @tsqrReducer); 
******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 12% Reduce 0% Map 25% Reduce 0% Map 37% Reduce 0% Map 50% Reduce 0% Map 62% Reduce 0% Map 75% Reduce 0% Map 87% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 100% 

## View Results

Read the final results from the output datastore.

r = readall(outds2); r.Value{:} 
ans = 1.0e+05 * 0.1091 0.0893 0.5564 0 -0.0478 -0.4890 0 0 3.0130 

## Reference

1. Paul G. Constantine and David F. Gleich. 2011. Tall and skinny QR factorizations in MapReduce architectures. In Proceedings of the Second International Workshop on MapReduce and Its Applications (MapReduce '11). ACM, New York, NY, USA, 43-50. DOI=10.1145/1996092.1996103 http://doi.acm.org/10.1145/1996092.1996103