Auckland University of Technology and University of Auckland Researchers Analyze Dairy Processing Data with Machine Learning
Ensure the consistent production of high-quality milk powder in New Zealand’s milk processing plants
Use MATLAB to preprocess and align data from multiple plants, analyze and visualize the data, and develop machine learning models capable of predicting the powder’s functional properties
- Key process flaws identified and corrected
- Multiple machine learning classifiers evaluated in hours
- Large datasets easily handled; manual processes automated
The Industrial Information and Control Centre (I2C2) is a joint research institute between Auckland University of Technology (AUT) and the University of Auckland. It was established to improve process simulation and control in New Zealand’s dairy and other export industries.
Among the institute’s industrial partners is Fonterra, the largest producer of milk powder in the country. In a recent project, I2C2 researchers developed machine learning models that are helping Fonterra to optimize product quality and streamline production processes.
Using MATLAB® and Statistics and Machine Learning Toolbox™, the researchers analyzed data collected from a number of production facilities across New Zealand to predict the functional properties of milk powder based on process conditions.
“The breadth of MATLAB is unmatched by other environments we’ve used for statistical analysis,” says David Wilson, co-director of I2C2 and associate professor in the Department of Electrical and Electronic Engineering at AUT. “With MATLAB, we work with huge amounts of information within a single environment without needing to move large datasets from one tool to another.”
Milk powder quality is assessed by its chemical composition, such as fat and protein content, and physical and functional properties, such as bulk density and solubility Although chemical composition is relatively well regulated by existing industrial processes, ensuring consistent functional properties has proved to be more challenging. The plants that produce the powder vary widely in design and age, and often use vastly different process settings. As a result, when a batch of powder is produced with variable quality, determining what went wrong and exactly when can be problematic.
Motivated in part by the Food and Drug Administration’s Quality by Design and Process Analytical Technology initiatives, I2C2 researchers set out to analyze millions of rows of time-series data (including temperatures and other logged process variables, as well as measured values of physical and functional properties), from three different processing plants over a six-year period. As collected, the raw data was inconsistent and not well aligned. There was no common reference between the process measurements and the product values, recording errors and instrument failures had on occasion resulted in missing data, and the time stamps for different datasets were in disparate formats.
Nevertheless, the team needed to use this data to determine the conditions under which a plant was operating when a particular sample was produced. They then needed to determine which abnormal conditions contributed to milk powder of varying quality, and recommend procedures for correcting those conditions. Ideally, the corrections had to be made while the plant was in operation rather than hours or days later when the relevant lab test results became available.
I2C2 used MATLAB to preprocess and align the data from milk processing plants, analyze and visualize the data, and develop machine learning models capable of predicting the milk powder’s functional properties.
Working in MATLAB, I2C2 researchers loaded process data extracted from Fonterra’s databases. Cleaning and aligning the data involved estimating values for missing data using interpolation, and aligning disparate datasets by interpreting time stamps generated in multiple formats.
Once the team had a clean set of data, they used Statistics and Machine Learning Toolbox to perform statistical analyses using principal component analysis (PCA) and partial least squares (PLS) regression. The team complemented that multivariate analysis with MATLAB 3D histograms, scatter plots, and other graphs to visualize results and share their findings with Fonterra engineers.
Continuing in MATLAB, the I2C2 team implemented more advanced regression models using the least absolute shrinkage and selection operator (LASSO) method, and evaluated various machine learning classifiers.
Initially, the classifiers achieved a prediction accuracy of less than 50%. This was because the training data included only a few instances of data recorded when milk powder processing parameters varied significantly. While a low number of such instances pleases operations staff, it does not provide sufficient data for model building. To rectify this issue, the team upsampled substandard samples in the training data and downsampled the remaining samples.
To improve prediction accuracy, they used the resampled training data to assess other classifier types. With the Classification Learner app, they rapidly evaluated more than 20 classifiers, including support vector machines, k-nearest neighbors, and a variety of decision trees, including boosted trees and bagged decision trees. They ultimately found that boosted trees worked best, with a prediction accuracy of almost 95%.
I2C2 researchers are currently integrating automated image processing into their analysis workflow. Using Image Processing Toolbox™, the team has analyzed thousands of photos of milk powder particles, calculating particle size, convexity, circularity, and other shape factors and correlating these metrics with functional properties of the powder.
- Key process flaws identified and corrected. “At one of our partner’s plants, a process designed to add a key ingredient to milk powder was failing from time to time, and plant managers were unable to determine the cause of this failure,” says Nick Depree, project manager at I2C2 and postdoctoral researcher at the University of Auckland. “The step-by-step analysis we conducted in MATLAB enabled us to identify the cause of the problem, and it has now been resolved.”
- Multiple machine learning classifiers evaluated in hours. “With the Classification Learner app, in a single afternoon we were able to try support vector machines and several other classifier types to see which worked best with our data,” David says. “Because we had little prior experience with machine learning, it could have taken us months otherwise.”
- Large datasets easily handled; manual processes automated. “The tools we used for multivariate analysis in the past failed to handle our larger datasets, but MATLAB had no problems with them,” says Depree. “Similarly, it would have been impossible to create the reports we share with Fonterra manually in Microsoft® Excel®. With MATLAB, we automated this process and generated hundreds of charts from data spanning multiple plants and years.”