Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English verison of the page.

Note: This page has been translated by MathWorks. Please click here
To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.

Tall Array Support, Usage Notes, and Limitations

Descriptive Statistics and Visualization

FunctionNotes or Limitations
geomean 
harmmean 
kurtosis 
range 
skewness 
zscore 
corr

Only 'Pearson' type is supported.

tabulate 
crosstab

The fourth output, labels, is returned as a cell array containing M unevaluated tall cell arrays, where M is the number of input grouping variables. Each unevaluated tall cell array, labels{j}, contains the labels for one grouping variable.

grpstats
  • If the input data is a tall array, then all grouping variables must also be tall and have the same number of rows as the data.

  • The whichstats option cannot be specified as a function handle. In addition to the current built-in options, whichstats can also be:

    • 'Count' — Number of non-NaNs.

    • 'NNZ' — Number of nonzeros and non-NaNs.

    • 'Kurtosis' — Compute kurtosis.

    • 'Skewness' — Compute skewness.

    • 'all-stats' — Compute all summary statistics.

  • Group order is not guaranteed to be the same as the in-memory grpstats computation.

  • Summary statistics for nonnumeric variables return NaNs.

  • grpstats always operates on the first dimension.

  • If the input is a tall table, then the output is also a tall table. However, rather than including row names, the output tall table contains an extra variable GroupLabel that contains the same information.

binScatterPlot

This function is specifically designed for visualizing large data sets. Instead of plotting millions of data points, which is not very feasible, binScatterPlot summarizes the data points into bins. This “scatter plot of bins” reveals high-level trends in the data.

ksdensity
  • Some options that require extra passes or sorting of the input data are not supported:

    • 'BoundaryCorrection'

    • 'Censoring'

    • 'Support' (support is always unbounded).

  • Uses standard deviation (instead of median absolute deviation) to compute the bandwidth.

Probability Distributions

FunctionNotes or Limitations
datasample
  • datasample is useful as a precursor to plotting a random subset of a very large data set. Sampling a large data set preserves trends in the data without requiring that you plot all the data points.

  • datasample supports sampling only along the first dimension of the data.

  • For tall arrays, datasample does not support sampling with replacement. You must specify 'Replace',false, for example, datasample(data,k,'Replace',false).

  • The value of 'Weights' must be a numeric tall array of the same height as data.

  • For the syntax [Y,idx] = datasample(___), the output idx is a tall logical vector of the same height as data. The vector indicates whether each data point is included in the sample.

  • If you specify a random number stream, then the underlying generator must support multiple streams and substreams. If you do not specify a random number stream, then datasample uses the stream controlled by tallrng.

Cluster Analysis

FunctionNotes or Limitations
kmeans

Only random sample initialization is supported. Supported syntaxes:

  • idx = kmeans(X,k) performs classic k-means clustering.

  • [idx,C] = kmeans(X,k) also returns the k cluster centroid locations.

  • [idx,C,sumd] = kmeans(X,k) additionally returns the k within-cluster sums of point-to-centroid distances.

  • [___] = kmeans(___,Name,Value) specifies additional name-value pair options using any of the other syntaxes. Valid options are:

    • 'Start' — Method used to choose the initial cluster centroid positions. Value can be:

      • 'plus' (default) — Select k observations from X using a variant of the kmeans++ algorithm adapted for tall data.

      • 'sample' — Select k observations from X at random.

      • Numeric matrix — A k-by-p matrix to explicitly specify starting locations.

    • 'Options' — An options structure created using the statset function. For tall arrays, kmeans uses the fields listed here and ignores all other fields in the options structure:

      • 'Display' — Level of display. Choices are 'iter' (default), 'off', and 'final'.

      • 'MaxIter' — Maximum number of iterations. Default is 100.

      • 'TolFun' — Convergency tolerance for the within-cluster sums of point-to-centroid distances. Default is 1e-4. This option field only works with tall arrays.

Regression

FunctionNotes or Limitations

The loss and predict methods of these regression classes support tall arrays:

  • You can use models trained on either in-memory or tall data with these methods.

  • The loss method of CompactRegressionTree only supports one output argument.

cvpartition
  • For tall arrays only stratified-HoldOut partitions are supported.

  • c = cvpartition(group,'HoldOut',p) randomly partitions observations into a training set and a test set with stratification, using the class information in group. P is a scalar such that 0 < P < 1.

  • To obtain nonstratified partitions, set a uniform grouping variable from the data samples. For example, assuming X is a tall numeric array, you can use

    groups = X(:,1).*0;
    C = cvpartition(groups,'HoldOut',P)

fitlm
  • If any input argument to fitlm is a tall array, then all of the other inputs must be tall arrays as well. This includes nonempty variables supplied with the 'Weights' and 'Exclude' name-value pairs.

  • The 'RobustOpts' name-value pair is not supported with tall arrays.

  • For tall data, fitlm returns a CompactLinearModel object that contains most of the same properties as a LinearModel object. The main difference is that the compact object is sensitive to memory requirements. The compact object does not include properties that include the data, or that include an array of the same size as the data. The compact object does not contain these LinearModel properties:

    • Diagnostics

    • Fitted

    • ObservationInfo

    • ObservationNames

    • Residuals

    • Steps

    • Variables

    You can compute the residuals directly from the compact object returned by LM = fitlm(X,Y) using

    RES = Y - predict(LM,X);
    S = LM.RMSE;
    histogram(RES,linspace(-3*S,3*S,51))
    
  • If the CompactLinearModel object is missing lower order terms that include categorical factors:

    • The plotEffects and plotInteraction methods are not supported.

    • The anova method with the 'components' option is not supported.

fitglm
  • If any input argument to fitglm is a tall array, then all of the other inputs must be tall arrays as well. This includes nonempty variables supplied with the 'Weights', 'Exclude', 'Offset', and 'BinomialSize' name-value pairs.

  • The default number of iterations is 5. You can change the number of iterations using the 'Options' name-value pair to pass in an options structure. Create an options structure using statset to specify a different value for MaxIter.

  • For tall data, fitglm returns a CompactGeneralizedLinearModel object that contains most of the same properties as a GeneralizedLinearModel object. The main difference is that the compact object is sensitive to memory requirements. The compact object does not include properties that include the data, or that include an array of the same size as the data. The compact object does not contain these GeneralizedLinearModel properties:

    • Diagnostics

    • Fitted

    • Offset

    • ObservationInfo

    • ObservationNames

    • Residuals

    • Steps

    • Variables

    You can compute the residuals directly from the compact object returned by GLM = fitglm(X,Y) using

    RES = Y - predict(GLM,X);
    S = sqrt(GLM.SSE/GLM.DFE);
    histogram(RES,linspace(-3*S,3*S,51))
    
lasso
  • With tall arrays, lasso uses an algorithm based on ADMM (alternating direction method of multipliers).

  • No elastic-net support. The 'Alpha' parameter is always 1.

  • No cross-validation ('CV' parameter) support, which includes the related parameter 'MCReps'.

  • The second output FitInfo does not contain the additional fields: 'SE', 'LambdaMinMSE', 'Lambda1SE', 'IndexMinMSE', and 'Index1SE'.

  • The 'Options' parameter is not supported, since it does not contain options that apply to the ADMM algorithm. You can tune the ADMM algorithm using name-value pairs.

  • Supported name-value pairs are:

    • 'Lambda'

    • 'LambdaRatio'

    • 'NumLambda'

    • 'Standardize'

    • 'PredictorNames'

    • 'RelTol'

    • 'Weights'

  • Additional name-value pairs to control the ADMM algorithm are:

    • 'Rho' — Augmented Lagrangian parameter, ρ. Default value is automatic selection.

    • 'AbsTol' — Absolute tolerance used to determine convergence. Default value is 1e-4.

    • 'MaxIter' — Maximum number of iterations. Default value is 1e4.

    • 'B0' — Initial values for the coefficients x. Default value is a vector of zeros.

    • 'U0' — Initial values of scaled dual variable u. Default value is a vector of zeros.

fitrlinear
  • Supported syntaxes for tall arrays X and Y are:

    • obj = fitrlinear(X,Y)

    • obj = fitrlinear(X,Y,Name,Value)

  • Some name-value pairs have different defaults and values compared to the in-memory fitrlinear function. Supported name-value pairs, and any differences, are:

    • 'Epsilon'

    • 'ObservationsIn' — Supports only 'rows'.

    • 'Lambda' — Can be 'auto' (default) or a scalar.

    • 'Learner'

    • 'Regularization' — Supports only 'ridge'.

    • 'Solver' — Supports only 'lbfgs'.

    • 'Verbose' — Default value is 1

    • 'Beta'

    • 'Bias'

    • 'FitBias' — Supports only true.

    • 'Weights' — Value must be a tall array.

    • 'HessianHistorySize'

    • 'BetaTolerance' — Default value is relaxed to 1e-3.

    • 'GradientTolerance' — Default value is relaxed to 1e-3.

    • 'IterationLimit' — Default value is relaxed to 20.

  • For tall arrays fitrlinear implements LBFGS by distributing the calculation of the loss and the gradient among different parts of the tall array at each iteration. Other solvers are not available for tall arrays.

    When initial values for Beta and Bias are not given, fitrlinear first refines the initial estimates of the parameters by fitting the model locally to parts of the data and combining the coefficients by averaging.

Classification

FunctionNotes or Limitations

The predict, loss, margin, and edge methods of these classification classes support tall arrays:

  • You can use models trained on either in-memory or tall data with these methods.

  • The loss method of CompactClassificationTree only supports one output argument.

The resume method of ClassificationKernel supports tall arrays.

  • You can use models trained on either in-memory or tall data.

  • The 'IterationLimit' name-value pair argument has a different default compared to the in-memory resume function. The default value is relaxed to 20.

  • resume uses a block-wise strategy. For details, see Algorithms of fitckernel.

fitcdiscr
  • Supported name-value pairs are:

    • 'ClassNames'

    • 'Cost'

    • 'DiscrimType'

    • 'PredictorNames'

    • 'Prior'

    • 'ResponseName'

    • 'ScoreTransform'

    • 'Weights'

  • For tall arrays and tall tables, fitcdiscr returns a CompactClassificationDiscriminant object, which contains most of the same properties as a ClassificationDiscriminant object. The main difference is that the compact object is sensitive to memory requirements. The compact object does not include properties that include the data, or that include an array of the same size as the data. The compact object does not contain these ClassificationDiscriminant properties:

    • ModelParameters

    • NumObservations

    • ParameterOptimizationResults

    • RowsUsed

    • XCentered

    • W

    • X

    • Y

    Additionally, the compact object does not support these ClassificationDiscriminant methods:

    • compact

    • crossval

    • cvshrink

    • resubEdge

    • resubLoss

    • resubMargin

    • resubPredict

fitckernel

  • Some name-value pairs have different defaults compared to the in-memory fitckernel function.

    • 'Verbose' — Default value is 1.

    • 'BetaTolerance' — Default value is relaxed to 1e-3.

    • 'GradientTolerance' — Default value is relaxed to 1e-5.

    • 'IterationLimit' — Default value is relaxed to 20.

  • If 'KernelScale' is 'auto', then fitckernel uses the random stream controlled by tallrng for subsampling. For reproducibility, you must set a random number seed for both the global stream and the random stream controlled by tallrng.

  • If 'Lambda' is 'auto', then fitckernel might take an extra pass through the data to calculate the number of observations in X.

  • fitckernel uses a block-wise strategy. For details, see Algorithms.

fitclinear
  • Supported syntaxes for tall arrays X and Y are:

    • obj = fitclinear(X,Y)

    • obj = fitclinear(___,Name,Value)

  • Some name-value pairs have different defaults compared to the in-memory fitclinear function. Supported name-value pairs, and any differences, are:

    • 'ObservationsIn' — Supports only 'rows'.

    • 'Lambda' — Can be 'auto' (default) or a scalar.

    • 'Learner'

    • 'Regularization' — Supports only 'ridge'.

    • 'Solver' — Supports only 'lbfgs'.

    • 'FitBias' — Supports only true.

    • 'Verbose' — Default value is 1.

    • 'Beta'

    • 'Bias'

    • 'ClassNames'

    • 'Cost'

    • 'Prior'

    • 'Weights' — Value must be a tall array.

    • 'HessianHistorySize'

    • 'BetaTolerance' — Default value is relaxed to 1e-3.

    • 'GradientTolerance' — Default value is relaxed to 1e-3.

    • 'IterationLimit' — Default value is relaxed to 20.

  • For tall arrays, fitclinear implements LBFGS by distributing the calculation of the loss and gradient among different parts of the tall array at each iteration. Other solvers are not available for tall arrays.

    When initial values for Beta and Bias are not given, fitclinear refines the initial estimates of the parameters by fitting the model locally to parts of the data and combining the coefficients by averaging.

fitcnb
  • Supported syntaxes are:

    • discr = fitcnb(Tbl,Y)

    • discr = fitcnb(X,Y)

    • discr = fitcnb(___,Name,Value)

  • Options related to kernel densities, cross-validation, and hyperparameter optimization are not supported. The supported name-value pairs are:

    • 'DistributionNames''kernel' value is not supported.

    • 'CategoricalPredictors'

    • 'PredictorNames'

    • 'ResponseName'

    • 'ScoreTransform'

    • 'Weights' — Value must be a tall array.

fitctree
  • Supported syntaxes for tall arrays are:

    • tree = fitctree(Tbl,Y)

    • tree = fitctree(X,Y)

    • tree = fitctree(___,Name,Value)

  • Supported name-value pairs are:

    • 'AlgorithmForCategorical'

    • 'CategoricalPredictors'

    • 'ClassNames'

    • 'MaxNumCategories'

    • 'MaxNumSplits'

    • 'MergeLeaves'

    • 'MinLeafSize'

    • 'MinParentSize'

    • 'NumVariablesToSample'

    • 'PredictorNames'

    • 'ResponseName'

    • 'ScoreTransform'

    • 'SplitCriterion'

    • 'Weights'

  • There is an additional name-value pair specific to tall arrays:

    • 'MaxDepth' — A positive integer specifying the maximum depth of the output tree. Specify a value for this parameter to return a tree with fewer levels that requires fewer passes through the tall array to compute. Generally the algorithm of fitctree takes one pass through the data and an additional pass for each tree level. There is no maximum tree depth by default.

TreeBagger
  • Supported syntaxes for tall X, Y, Tbl are:

    • B = TreeBagger(NumTrees,Tbl,Y)

    • B = TreeBagger(NumTrees,X,Y)

    • B = TreeBagger(___,Name,Value)

  • For tall arrays, TreeBagger supports classification. Regression is not supported.

  • Supported name-value pairs are:

    • 'NumPredictorsToSample' — Default value is the square root of the number of variables for classification.

    • 'MinLeafSize' — Default value is 1 if the number of observations is less than 50,000. If the number of observations is larger than 50,000, then the default value is max(1,min(5,floor(0.01*NobsChunk))).

    • 'ChunkSize' (only for tall arrays) — Default value is 50000.

    In addition, TreeBagger supports these optional arguments of fitctree:

    • 'AlgorithmForCategorical'

    • 'CategoricalPredictors'

    • 'MaxNumCategories'

    • 'MergeLeaves'

    • 'PredictorNames'

    • 'PredictorSelection'

    • 'Prune'

    • 'PruneCriterion'

    • 'Surrogate'

    • 'MaxNumSplits'

    • 'SplitCriterion'

  • For tall data, TreeBagger returns a CompactTreeBagger object that contains most of the same properties as a full TreeBagger object. The main difference is that the compact object is more memory efficient. The compact object does not include properties that include the data, or that include an array of the same size as the data.

  • Supported CompactTreeBagger methods are:

    • combine

    • error

    • margin

    • meanMargin

    • predict

    • setDefaultYfit

    The error, margin, meanMargin, and predict methods do not support the name-value pairs 'Trees', 'TreeWeights', or 'UseInstanceForTree'. The error and meanMargin methods additionally do not support 'Weights'.

  • TreeBagger creates a random forest by generating trees on disjoint chunks of the data. When more data is available than is required to create the random forest, the data is subsampled. For a similar example, see Random Forests for Big Data (Genuer, Poggi, Tuleau-Malot, Villa-Vialaneix 2015).

    Depending on how the data is stored, it is possible that some chunks of data contain observations from only a few classes out of all the classes. In this case, TreeBagger might produce inferior results compared to the case where each chunk of data contains observations from most of the classes.

Dimensionality Reduction

FunctionNotes or Limitations
pcacov, factoran

pcacov and factoran do not work directly on tall arrays. Instead, use C = gather(cov(X)) to compute the covariance matrix of a tall array. Then, you can use pcacov or factoran on the in-memory covariance matrix. Alternatively, you can use pca directly on a tall array.

pca
  • pca works directly with tall arrays by computing the covariance matrix and using the in-memory pcacov function to compute the principle components.

  • Supported syntaxes are:

    • coeff = pca(X)

    • [coeff,score,latent] = pca(X)

    • [coeff,score,latent,explained] = pca(X)

    • [coeff,score,latent,tsquared] = pca(X)

    • [coeff,score,latent,tsquared,explained] = pca(X)

  • Name-value pair arguments are not supported.

See Also

|

Related Topics

Was this topic helpful?