Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

matlab.io.datastore.HadoopFileBased class

Package: matlab.io.datastore

Add Hadoop file support to datastore

Description

matlab.io.datastore.HadoopFileBased is an abstract mixin class that adds Hadoop® support to your custom datastore.

To use this mixin class, you must subclass from the matlab.io.datastore.HadoopFileBased class in addition to subclassing from the matlab.io.Datastore base class. Type the following syntax as the first line of your class definition file:

classdef MyDatastore < matlab.io.Datastore & ...
                             matlab.io.datastore.HadoopFileBased 
    ...
end

To add support for Hadoop to your custom datastore, you must:

For more details and steps to create your custom datastore with support for Hadoop, see Develop Custom Datastore.

Methods

matlab.io.datastore.HadoopFileBased.getLocation Location of files in Hadoop
matlab.io.datastore.HadoopFileBased.initializeDatastore Initialize datastore with information from Hadoop
matlab.io.datastore.HadoopFileBased.isfullfile Check if datastore reads full files

Attributes

Sealedfalse

To learn about attributes of classes, see Class Attributes.

Examples

Build Datastore with Hadoop Support

Implement a datastore with parallel processing and Hadoop support and use it to bring your data from the Hadoop server into MATLAB® .Then use the tall and gather functions on this data.

Implement your custom datastore in your working folder or in a folder that is on the MATLAB path. Then create a new script, MyDatastoreHadoop.m that contains the code implementing your custom datastore. The name of the script file must be the same as the name of your object constructor function. For example, if you want your constructor function to have the name MyDatastoreHadoop, then the name of the script file must be MyDatastoreHadoop.m. The script must contain these steps:

  • Step 1: Inherit from the datastore classes.

  • Step 2: Define the constructor and the required methods.

  • Step 3: Define your custom file reading function.

This code shows the three steps in a sample implemenation of a custom datastore that can read binary files from a Hadoop server.

%% STEP 1: INHERIT FROM DATASTORE CLASSES
classdef MyDatastoreHadoop < matlab.io.Datastore & ...
        matlab.io.datastore.Partitionable & ...
        matlab.io.datastore.HadoopFileBased
    
    % properties(Access = private)
    properties
        CurrentFileIndex double
        FileSet matlab.io.datastore.DsFileSet
    end
        
%% STEP 2: DEFINE THE CONSTRUCTOR AND THE REQUIRED METHODS
    methods
        % Define your datastore constructor
        function myds = MyDatastoreHadoop(location)
            myds.FileSet = matlab.io.datastore.DsFileSet(location,...
                'FileExtensions','.bin', ...
                'FileSplitSize',8*1024);
            myds.CurrentFileIndex = 1;
            reset(myds);
        end
        
        % Define the hasdata method
        function tf = hasdata(myds)
            % Return true if more data is available
            tf = hasfile(myds.FileSet);
        end
        
        % Define the read method
        function [data,info] = read(myds)
            % Read data and information about the extracted data
            % See also: MyFileReader()
            if ~hasdata(myds)
                error('No more data');
            end
            
            fileInfoTbl = nextfile(myds.FileSet);
            data = MyFileReader(fileInfoTbl);
            info.Size = size(data);
            info.FileName = fileInfoTbl.FileName;
            info.Offset = fileInfoTbl.Offset;
            
            % Update CurrentFileIndex for tracking progress
            if fileInfoTbl.Offset + fileInfoTbl.SplitSize >= ...
                    fileInfoTbl.FileSize
                myds.CurrentFileIndex = myds.CurrentFileIndex + 1 ;
            end
        end
        
        % Define the reset method
        function reset(myds)
            % Reset to the start of the data
            reset(myds.FileSet);
            myds.CurrentFileIndex = 1;
        end
        
        % Define the progress method
        function frac = progress(myds)
            % Determine percentage of data that you have read
            % from a datastore
            frac = (myds.CurrentFileIndex-1)/myds.FileSet.NumFiles;
        end
        
        % Define the partition method
        function subds = partition(myds,n,ii)
            subds = copy(myds);
            subds.FileSet = partition(myds.FileSet,n,ii);
            reset(subds);
        end
        
        % Define the initializeDatastore method
        function initializeDatastore(myds,hadoopInfo)
            import matlab.io.datastore.DsFileSet;
            myds.FileSet = DsFileSet(hadoopInfo,...
                'FileSplitSize',myds.FileSet.FileSplitSize,...
                'IncludeSubfolders',true, ...
                'FileExtensions','.bin');
            reset(myds);
        end
        
        % Define the getLocation method
        function loc = getLocation(myds)
            loc = myds.FileSet;
        end
        
        % Define the isfullfile method
        function tf = isfullfile(~)
            tf = false;
        end
    end
        
    methods(Access = protected)
        % If you use the  FileSet property in the datastore,
        % then you must define the COPYELEMENT method. The
        % copyelement method allows methods such as readall
        % and preview to remain stateless 
        function dscopy = copyElement(ds)
            dscopy = copyElement@matlab.mixin.Copyable(ds);
            dscopy.FileSet = copy(ds.FileSet);
        end
        
        % Define the maxpartitions method
        function n = maxpartitions(myds)
            n = maxpartitions(myds.FileSet);
        end
    end
end

%% STEP 3: IMPLEMENT YOUR CUSTOM FILE READING FUNCTION
function data = MyFileReader(fileInfoTbl)
% create a reader object using FileName
reader = matlab.io.datastore.DsFileReader(fileInfoTbl.FileName);

% seek to the offset
seek(reader,fileInfoTbl.Offset,'Origin','start-of-file');

% read fileInfoTbl.SplitSize amount of data
data = read(reader,fileInfoTbl.SplitSize);
end
This completes the implementation step of your custom datastore.

Next, create a datastore object using your custom datastore constructor. If your data is located at hdfs:///path_to_files, then you can use this code.

setenv('HADOOP_HOME','/path/to/hadoop/install');
ds = MyDatastoreHadoop('hdfs:///path_to_files');

To use tall arrays and the gather function on Apache Spark™ with parallel cluster configration. Set mapreducer and attach MyDatastoreHadoop.m to the cluster.

mr = mapreducer(cluster);
mr.Cluster.AttachedFiles = ‘MyDatastoreHadoop.m’;

Create tall array from datastore.

t = tall(ds);

Gather the head of the tall array.

 hd = gather(head(t));

Introduced in R2017b

Was this topic helpful?