Extend Tall Arrays with Other Products

Products Used: Statistics and Machine Learning Toolbox™, Database Toolbox™, Parallel Computing Toolbox™, MATLAB^® Parallel Server™, MATLAB Compiler™

Several toolboxes enhance the capabilities of tall arrays. These enhancements include writing machine learning algorithms, integrating with big data systems, and deploying standalone apps.

Statistics and Machine Learning

Statistics and Machine Learning Toolbox enables you to perform advanced statistical calculations on tall arrays. Capabilities include:

K-means clustering
Linear regression fitting
Grouped statistics
Classification

See Analysis of Big Data with Tall Arrays (Statistics and Machine Learning Toolbox) for more information.

Control Where Your Code Runs

When you execute calculations on tall arrays, the default execution environment uses either the local MATLAB session, or a local parallel pool if you have Parallel Computing Toolbox. Use the mapreducer function to change the execution environment of tall arrays when using Parallel Computing Toolbox, MATLAB Parallel Server, or MATLAB Compiler:

Parallel Computing Toolbox — Run calculations in parallel using local or cluster workers to speed up large tall array calculations. See Use Tall Arrays on a Parallel Pool (Parallel Computing Toolbox) or Process Big Data in the Cloud (Parallel Computing Toolbox) for more information.
MATLAB Parallel Server — Run tall array calculations on a cluster, including Apache^® Spark™ enabled Hadoop^® clusters. This can significantly reduce the execution time of very large calculations. See Use Tall Arrays on a Spark Cluster (Parallel Computing Toolbox) for more information.
MATLAB Compiler — Deploy MATLAB applications containing tall arrays as standalone apps on Apache Spark. See Spark Applications (MATLAB Compiler) for more information.

One of the benefits of developing your algorithms with tall arrays is that you only need to write the code once. You can develop your code locally, then use mapreducer to scale up and take advantage of the capabilities offered by Parallel Computing Toolbox, MATLAB Parallel Server, or MATLAB Compiler, without needing to rewrite your algorithm.

Note

Each tall array is bound to a single execution environment when it is constructed using tall(ds). If that execution environment is later modified or deleted, then the tall array becomes invalid.

For this reason, each time you change the execution environment you must reconstruct the tall array.

Work with Databases

Database Toolbox enables you to create a tall table from a DatabaseDatastore that is backed by data in a database. For more information, see Analyze Large Data in Database Using Tall Arrays (Database Toolbox).

Note

DatabaseDatastore has these limitations:

DatabaseDatastore must use the local MATLAB session as the execution environment. Set this environment using the command mapreducer(0).
Standalone applications containing tall arrays that use DatabaseDatastore cannot be deployed against Apache Spark using MATLAB Compiler.