MATLAB Examples

Using MapReduce to Fit a Logistic Regression Model

This example shows how to use mapreduce to carry out simple logistic regression using a single predictor. It demonstrates chaining multiple mapreduce calls to carry out an iterative algorithm. Since each iteration requires a separate pass through the data, an anonymous function passes information from one iteration to the next to supply information directly to the mapper.

Contents

Prepare Data

Create a datastore using the airlinesmall.csv data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. In this example, the variables of interest are ArrDelay (flight arrival delay) and Distance (total flight distance).

ds = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA');
ds.SelectedVariableNames = {'ArrDelay', 'Distance'};

The datastore treats 'NA' values as missing, and replaces the missing values with NaN values by default. Additionally, the SelectedVariableNames property allows you to work with only the specified variables of interest, which you can verify using preview.

preview(ds)
ans =

  8x2 table

    ArrDelay    Distance
    ________    ________

     8          308     
     8          296     
    21          480     
    13          296     
     4          373     
    59          308     
     3          447     
    11          954     

Perform Logistic Regression

Logistic regression is a way to model the probability of an event as a function of another variable. In this example, logistic regression models the probability of a flight being more than 20 minutes late as a function of the flight distance, in thousands of miles.

To accomplish this logistic regression, the map and reduce functions must collectively perform a weighted least-squares regression based on the current coefficient values. The mapper computes a weighted sum of squares and cross product for each chunk of input data.

Display the map function file.

function logitMapper(b,t,~,intermKVStore)
%logitMapper Mapper function for mapreduce to perform logistic regression.

% Copyright 2014 The MathWorks, Inc.

% Get data input table and remove any rows with missing values
y = t.ArrDelay;
x = t.Distance;
t = ~isnan(x) & ~isnan(y);
y = y(t)>20;                 % late by more than 20 min
x = x(t)/1000;               % distance in thousands of miles

% Compute the linear combination of the predictors, and the estimated mean
% probabilities, based on the coefficients from the previous iteration
if ~isempty(b)
    % Compute xb as the linear combination using the current coefficient
    % values, and derive mean probabilities mu from them
    xb = b(1)+b(2)*x;
    mu = 1./(1+exp(-xb));
else
    % This is the first iteration. Compute starting values for mu that are
    % 1/4 if y=0 and 3/4 if y=1. Derive xb values from them.
    mu = (y+.5)/2;
    xb = log(mu./(1-mu)); 
end

% We want to perform weighted least squares. We do this by computing a sum
% of squares and cross products matrix
%      (X'*W*X) = (X1'*W1*X1) + (X2'*W2*X2) + ... + (Xn'*Wn*Xn)
% where X = X1;X2;...;Xn]  and  W = [W1;W2;...;Wn].
%
% Here in the mapper we receive one chunk at a time, so we compute one of
% the terms on the right hand side. The reducer will add them up to get the
% quantity on the left hand side, and then peform the regression.
w = (mu.*(1-mu));                  % weights
z = xb + (y - mu) .* 1./w;         % adjusted response

X = [ones(size(x)),x,z];           % matrix of unweighted data
wss = X' * bsxfun(@times,w,X);     % weighted cross-products X1'*W1*X1

% Store the results for this part of the data.
add(intermKVStore, 'key', wss);

The reducer computes the regression coefficient estimates from the sums of squares and cross products.

Display the reduce function file.

function logitReducer(~,intermValIter,outKVStore)
%logitReducer Reducer function for mapreduce to perform logistic regression

% Copyright 2014 The MathWorks, Inc.

% We will operate over chunks of the data, updating the count, mean, and
% covariance each time we add a new chunk
old = 0;

% We want to perform weighted least squares. We do this by computing a sum
% of squares and cross products matrix
%      M = (X'*W*X) = (X1'*W1*X1) + (X2'*W2*X2) + ... + (Xn'*Wn*Xn)
% where X = X1;X2;...;Xn]  and  W = [W1;W2;...;Wn].
%
% The mapper has computed the terms on the right hand side. Here in the
% reducer we just add them up.

while hasnext(intermValIter)
    new = getnext(intermValIter);
    old = old+new;
end
M = old;  % the value on the left hand side

% Compute coefficients estimates from M. M is a matrix of sums of squares
% and cross products for [X Y] where X is the design matrix including a
% constant term and Y is the adjusted response for this iteration. In other
% words, Y has been included as an additional column of X. First we
% separate them by extracting the X'*W*X part and the X'*W*Y part.
XtWX = M(1:end-1,1:end-1);
XtWY = M(1:end-1,end);

% Solve the normal equations.
b = XtWX\XtWY;

% Return the vector of coefficient estimates.
add(outKVStore, 'key', b);

Run MapReduce

Run mapreduce iteratively by enclosing the calls to mapreduce in a loop. The loop runs until the convergence criteria are met, with a maximum of five iterations.

% Define the coefficient vector, starting as empty for the first iteration.
b = [];

for iteration = 1:5
    b_old = b;
    iteration

    % Here we will use an anonymous function as our mapper. This function
    % definition includes the value of b computed in the previous
    % iteration.
    mapper = @(t,ignore,intermKVStore) logitMapper(b,t,ignore,intermKVStore);
    result = mapreduce(ds, mapper, @logitReducer, 'Display', 'off');

    tbl = readall(result);
    b = tbl.Value{1}

    % Stop iterating if we have converged.
    if ~isempty(b_old) && ...
       ~any(abs(b-b_old) > 1e-6 * abs(b_old))
       break
    end
end
iteration =

     1


b =

   -1.7674
    0.1209


iteration =

     2


b =

   -1.8327
    0.1807


iteration =

     3


b =

   -1.8331
    0.1806


iteration =

     4


b =

   -1.8331
    0.1806

View Results

Use the resulting regression coefficient estimates to plot a probability curve. This curve shows the probability of a flight being more than 20 minutes late as a function of the flight distance.

xx = linspace(0,4000);
yy = 1./(1+exp(-b(1)-b(2)*(xx/1000)));
plot(xx,yy);
xlabel('Distance');
ylabel('Prob[Delay>20]')