Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

Analyze Large Data in Database Using MapReduce

This example determines the mean arrival delay of a large set of flight data that is stored in a database using MapReduce. You can access large data sets using a DatabaseDatastore object with Database Toolbox™. After creating a DatabaseDatastore object, you can use a tall array to run algorithms on large data sets. Alternatively, you can write a MapReduce algorithm that defines the chunking and reduction of the data.

The DatabaseDatastore object does not support using a parallel pool with Parallel Computing Toolbox™ installed. To analyze data using tall arrays or run MapReduce algorithms, set the global execution environment to be the local MATLAB® session.

Create DatabaseDatastore Object

Set the global execution environment to be the local MATLAB® session.

mapreducer(0);

The file airlinesmall.csv contains a large set of flight data. Load this file into a Microsoft® SQL Server® database table airlinesmall. This table contains 123,523 records.

Using a JDBC driver, create a database connection to a Microsoft® SQL Server® database with Windows® authentication. Specify a blank user name and password. This code assumes that you are connecting to database toy_store, database server dbtb04, and port number 54317.

conn = database('toy_store','','','Vendor','Microsoft SQL Server', ...
    'Server','dbtb04','PortNumber',54317,'AuthType','Windows');

Create a DatabaseDatastore object using the database connection and SQL query to retrieve arrival-delay data from the table.

sqlquery = 'select ArrDelay from airlinesmall';

dbds = databaseDatastore(conn,sqlquery);

Define Mapper and Reducer Functions

To process large data sets in chunks, you can write your own mapper function. Here use meanArrivalDelayMapper.m to:

  • Read arrival-delay data from the DatabaseDatastore object.

  • Determine the number of delays and the total delay in the chunk.

  • Store both values in KeyValueDatastore.

The meanArrivalDelayMapper.m file contains this code.

function meanArrivalDelayMapper (data, info, intermKVStore)
% Mapper function for the MeanMapReduceExample.

% Copyright 2014 The MathWorks, Inc.

% Data is an n-by-1 table of the ArrDelay. Remove missing value first:
data(isnan(data.ArrDelay),:) = [];

% Record the partial counts and sums and the reducer will accumulate them.
partCountSum = [length(data.ArrDelay), sum(data.ArrDelay)];
add(intermKVStore, 'PartialCountSumDelay',partCountSum);

You also can write your own reducer function. Here use meanArrivalDelayReducer.m to read intermediate values for the number of delays and the total arrival delay. Then, determine the overall mean arrival delay. mapreduce calls this reducer function only once since the mapper function adds just one key to KeyValueStore. The meanArrivalDelayMapper.m file contains this code.

function meanArrivalDelayReducer(intermKey, intermValIter, outKVStore)
% Reducer function for the MeanMapReduceExample.

% Copyright 2014 The MathWorks, Inc.

% intermKey is 'PartialCountSumDelay'
count = 0;
sum = 0;
while hasnext(intermValIter)
   countSum = getnext(intermValIter);
   count = count + countSum(1);
   sum = sum + countSum(2);
end

meanDelay = sum/count;

% The key-value pair added to outKVStore will become the output of mapreduce 
add(outKVStore,'MeanArrivalDelay',meanDelay);

Run MapReduce Using Mapper and Reducer Functions

To determine the mean arrival delay in the flight data, run MapReduce with the DatabaseDatastore object, mapper function, and reducer function.

outds = mapreduce(dbds,@meanArrivalDelayMapper,@meanArrivalDelayReducer);
********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce 100%

Display Output from MapReduce

Read the table from the output datastore using readall.

outtab = readall(outds)
outtab = 

           Key             Value  
    __________________    ________

    'MeanArrivalDelay'    [7.1201]

The table has only one row containing one key-value pair.

Display the mean arrival delay from the table.

meanArrDelay = outtab.Value{1}
meanArrDelay =

    7.1201

Close DatabaseDatastore Object and Database Connection

close(dbds)

See Also

| | | | | |

Related Topics

External Websites

Was this topic helpful?