How to work with huge and fast data sets
Big data refers to the dramatic increase in the amount and rate of data being created and made available for analysis.
A primary driver of this trend is the ever increasing digitization of information. The number and types of acquisition devices and other data generation mechanisms are growing all the time.
Big data sources include streaming data from instrumentation sensors, satellite and medical imagery, video from security cameras, as well as data derived from financial markets and retail operations. Big data sets from these sources can contain gigabytes or terabytes of data, and may grow on the order of megabytes or gigabytes per day.
Big data represents an opportunity for analysts and data scientists to gain greater insight and to make more informed decisions, but it also presents a number of challenges. Big data sets may not fit into available memory, may take too long to process, or may stream too quickly to store. Standard algorithms are usually not designed to process big data sets in reasonable amounts of time or memory. There is no single approach to big data. Therefore, MATLAB provides a number of tools to tackle these challenges.
Working with Big Data in MATLAB
- 64-bit Computing. The 64-bit version of MATLAB drastically increases the amount of data you can hold in memory – typically up to 2000 times more than any 32-bit program. While 32-bit programs limit you to addressing only 2 GB of memory, 64-bit MATLAB lets you address up to the physical memory limits of the OS. For Windows 8, that’s 500 GB for desktop versions and 4 TB for Windows Server.
- Memory Mapped Variables. The
memmapfile function in MATLAB lets you map a file, or a portion of a file, to a MATLAB variable in memory. This allows you to efficiently access big data sets on disk that are too large to hold in memory or that take too long to load.
- Disk Variables. The
matfile function lets you access MATLAB variables directly from MAT-files on disk, using MATLAB indexing commands, without loading the full variables into memory. This allows you to do block processing on big data sets that are otherwise too large to fit in memory.
- Datastore. Use the datastore function to access data that doesn’t fit into memory. This includes data from files, collections of files or, in conjunction with Database Toolbox, database tables. The datastore function allows you to define the data you want to import from your files or database tables, define the format to apply to your imported data, and manage the incremental import of your data, providing a means to iterate over big data sets using only a while loop.
- Intrinsic Multicore Math. Many of the built-in mathematical functions in MATLAB, such as
eig, are multithreaded. By running in parallel, these functions take full advantage of the multiple cores of your computer, providing high-performance computation of big data sets.
- GPU Computing. If you’re working with GPUs, GPU-optimized mathematical functions in Parallel Computing Toolbox provide even higher performance for big data sets.
- Parallel Computing. Parallel Computing Toolbox provides a
parallel for-loop (2:48)
that runs your MATLAB code and algorithms in parallel on multicore computers. If you use MATLAB Distributed Computing Server, you can execute in parallel on clusters of machines that can scale up to thousands of computers.
- Cloud Computing. You can run MATLAB computations in parallel using MATLAB Distributed Computing Server on Amazon’s Elastic Computing Cloud (EC2) for on-demand parallel processing on hundreds or thousands of computers. Cloud computing lets you process big data without having to buy or maintain your own cluster or data center.
- Distributed Arrays. Using Parallel Computing Toolbox and MATLAB Distributed Computing Server, you can work with matrices and multidimensional arrays that are distributed across the memory of a cluster of computers. Using this approach, you can store and perform computations on big data sets that are too large to fit in a single computer’s memory.
- MapReduce. Use the MapReduce functionality built into MATLAB to analyze data that does not fit into memory. This is a powerful, and established programming technique that can be used to analyze data on your desktop, as well as run MATLAB analytics on the big data platform Hadoop.
- Streaming Algorithms. Using System objects, you can perform stream processing on incoming streams of data that are too large or too fast to hold in memory. In addition, you can generate embedded C/C++ code from your MATLAB algorithms using MATLAB Coder, and run the resulting code on high-performance real-time systems.
- Image Block Processing. The
blockproc function in Image Processing Toolbox lets you work with really big images by processing them efficiently a block at a time. Computations run in parallel on multiple cores and GPUs when used with Parallel Computing Toolbox.
- Machine Learning. Machine learning is helpful for extracting insights and developing predictive models with big data sets. A wide variety of machine learning algorithms including boosted and bagged decision trees, K-means and hierarchical clustering, K-nearest neighbor search, Gaussian mixtures, the expectation maximization algorithm, hidden Markov models, and neural networks are available in Statistics Toolbox and Neural Network Toolbox.
- Hadoop. With the MapReduce and Datastore functionality built into MATLAB, you can develop algorithms on your desktop and directly execute them on Hadoop. To get started, access a portion of your big data stored in HDFS with the MATLAB datastore function, and use this data to develop MapReduce based algorithms in MATLAB on your desktop. Then use MATLAB Distributed Computing Server to execute your algorithms within the Hadoop MapReduce framework against the full data set stored in HDFS. To integrate MATLAB analytics with production Hadoop systems, use MATLAB Compiler to create applications or libraries from MATLAB MapReduce based algorithms.
Examples and How To
See also: HDF5 files, large data import (in Database Toolbox), MATLAB MapReduce and Hadoop