Cross-Validation

Assess and improve predictive performance of models

Cross-validation is a model assessment technique used to evaluate a machine learning algorithm’s performance in making predictions on new datasets that it has not been trained on. This is done by partitioning a dataset and using a subset to train the algorithm and the remaining data for testing. Because cross-validation does not use all of the data to build a model, it is a commonly used method to prevent overfitting during training.

Each round of cross-validation involves randomly partitioning the original dataset into a training set and a testing set. The training set is then used to train a supervised learning algorithm and the testing set is used to evaluate its performance. This process is repeated several times and the average cross-validation error is used as a performance indicator.

Common cross-validation techniques include:

  • k-fold: Partitions data into k randomly chosen subsets (or folds) of roughly equal size. One subset is used to validate the model trained using the remaining subsets. This process is repeated k times such that each subset is used exactly once for validation.
  • Holdout: Partitions data into exactly two subsets (or folds) of specified ratio for training and validation.
  • Leaveout: Partitions data using the k-fold approach where k is equal to the total number of observations in the data. Also known as leave-one-out cross-validation.
  • Repeated random sub-sampling: Performs Monte Carlo repetitions of randomly partitioning data and aggregating results over all the runs.
  • Stratify: Partitions data such that both training and test sets have roughly the same class proportions in the response or target.
  • Resubstitution: Does not partition the data; uses the training data for validation. Often produces overly optimistic estimates for performance and must be avoided if there is sufficient data.

Cross-validation can be a computationally intensive operation since training and validation is done several times. Because each partition set is independent, this analysis can be performed in parallel to speed up the process.

For more information on using cross-validation with machine learning problems, see Statistics Toolbox™ and Neural Network Toolbox™.

Examples and How To

Software Reference

See also: Statistics Toolbox, machine learning, supervised learning, feature selection, regularization, linear model