Cross-validation is a model assessment technique used to evaluate a machine learning algorithm’s performance in making predictions on new datasets that it has not been trained on. This is done by partitioning a dataset and using a subset to train the algorithm and the remaining data for testing. Because cross-validation does not use all of the data to build a model, it is a commonly used method to prevent overfitting during training.
Each round of cross-validation involves randomly partitioning the original dataset into a training set and a testing set. The training set is then used to train a supervised learning algorithm and the testing set is used to evaluate its performance. This process is repeated several times and the average cross-validation error is used as a performance indicator.
Common cross-validation techniques include:
- k-fold: Partitions data into k randomly chosen subsets (or folds) of roughly equal size. One subset is used to validate the model trained using the remaining subsets. This process is repeated k times such that each subset is used exactly once for validation.
- Holdout: Partitions data into exactly two subsets (or folds) of specified ratio for training and validation.
- Leaveout: Partitions data using the k-fold approach where k is equal to the total number of observations in the data. Also known as leave-one-out cross-validation.
- Repeated random sub-sampling: Performs Monte Carlo repetitions of randomly partitioning data and aggregating results over all the runs.
- Stratify: Partitions data such that both training and test sets have roughly the same class proportions in the response or target.
- Resubstitution: Does not partition the data; uses the training data for validation. Often produces overly optimistic estimates for performance and must be avoided if there is sufficient data.
Cross-validation can be a computationally intensive operation since training and validation is done several times. Because each partition set is independent, this analysis can be performed in parallel to speed up the process.