# Accelerating Pharmaceutical Manufacturing Analysis with App-Based Machine Learning

By Ram Kumar, Akshay Hatewar, and Vaidehi Soman, Manufacturing Science and Technology Group, Cipla

Pharmaceutical companies perform rigorous testing to measure critical quality attributes of the drugs they produce. When an issue with a particular batch is uncovered, manufacturing teams must identify the root cause as soon as possible to avoid delays in delivery and shortages of critical medicines.

Accurate and timely root cause analysis is challenging because of the wide variety of raw materials, machines, and multiple process steps involved in pharmaceutical manufacturing. In the past, teams manually entered data from raw material labels and machine printouts into spreadsheets for analysis, but this method was slow and error prone. Additionally, there was no tool and methodology to analyze this huge data set at one go.

At Cipla, we now use a web app for advanced process analytics. Built in MATLAB®, the app automates data collection, analyzes the data using machine learning models, and displays the results (Figure 1). With this app, what used to take us several weeks to identify root causes now takes a couple of days. Further, we can predict potential issues with specific batches and take corrective action right away instead of waiting up to 14 days for quality control testing results on the finished product.

Figure 1. Cipla performs pharmaceutical manufacturing analysis with an app built in MATLAB.

## Collecting and Preprocessing Data

The data that pharmaceutical manufacturing teams need to analyze is highly heterogeneous and comes from disparate sources, but it can be grouped into two broad categories: critical material attributes (CMAs) and critical process parameters (CPPs). CMAs include properties of the raw materials used in manufacturing, such as the material’s density and particle size distribution as well as its vendor, age, and shelf life. A typical product is composed of about 20 raw materials, each with more than a dozen CMAs. CPPs include time-series measurements captured during the multiple unit operations in the manufacturing process. For example, a single unit operation like fluidized bed granulation may take 2–3 hours or more to complete. During this time, process parameters such as the temperature, humidity, and velocity of the air moving through the machine and the pressure differential across filters are recorded every minute. Another unit operation like lyophilization or freeze-drying usually takes 48 hours or more to complete.

We turned to MathWorks Consulting to develop an application for collecting and structuring these data. We used Database Toolbox™ to retrieve CMAs and to batch data from a Microsoft® Azure® data warehouse and other databases. With Industrial Communication Toolbox™, we were able to access additional CPP data directly from OPC servers in our facilities. The Database Explorer app was particularly helpful for connecting to various Cipla databases and visually exploring the data.

The CMA data that we accessed was relatively clean and required little preprocessing. The CPP data, particularly differential pressure measurements, was much noisier. We applied filters from Signal Processing Toolbox™ to reduce the noise and reveal trends in the data.

## Building Machine Learning Models

Once we had a well-structured representation of the CMA and CPP data, our next task was to build machine learning models. These models would enable us to determine which of the hundreds of material properties and process parameters had the greatest effect on a particular attribute. Mathematically speaking, there is a function $$y = f(x_1, x_2, …, x_n)$$ where $$y$$ is the critical quality attribute and each $$x$$ represents a CMA or CPP variable. We needed our models to determine to what extent each $$x$$ influences $$y$$.

We implemented an algorithm that applies three machine learning techniques in a series: principal component analysis (PCA), partial least squares (PLS), and random forest. The x-space (PCA plot) reveals that the batches do have differences in raw material properties and/or have been processed differently (Figure 2). Also, the on-target and off-target batches have been processed in multiple ways but have always resulted in an off-target product. We confirmed this using the x-y space (PLS plot). In this x-y space plot, all the off-target clusters come together to form one large off-target zone. We applied a random forest on top of the PLS to understand how accurately the model classifies the batch as on-target versus off-target. The reason for the batch being on-target or off-target is further understood using weightage of variables on the latent variables.

Figure 2. Results of PCA (left) and PLS (right). The green circles are the on-target batches; the red squares are the off-target batches.

We opted for machine learning rather than deep learning so that we could fulfill a key requirement of our analysis: interpretability. We need to fully understand any manufacturing issues that we identify to address them comprehensively and avoid them in the future. Traditional machine learning enables this level of understanding, whereas deep learning generally does not.

## Packaging and Deploying a Web App

One of our key objectives was the democratization of analytics: we wanted to develop a solution that could be used across Cipla by many users, not just a small group of experts. To meet this objective, we created a simple interface with App Designer. We packaged it with the machine learning algorithms and deployed the package as a web app with MATLAB Web App Server™.

When working with the app, users begin by selecting the product that they want to analyze. The app retrieves the relevant CMA data for that specific product and builds the PCA, PLS, and random forest models. The app displays the results from the models, including the relative contribution of each variable to the critical quality attribute, and highlights important factors (Figure 3). After reviewing the results, the user may decide to build a reduced model with these factors highlighted to improve the model accuracy. For example, if the initial iteration included 500 variables but a subset of 300 variables is shown to have little effect on the results, then the user may simplify the model by omitting that subset and rerun the analysis.

Figure 3. Results from models of CMA data, including relative contribution of each variable.

## Piloting a Real-Time Version of the App

Our team is currently developing a real-time version of the application that will be piloted this year. This version will capture OPC server data from unit operations in real time, feed it into the machine learning models, and determine whether the processes are operating within established control parameters.

## Why MATLAB?

Before we decided to use MATLAB for our manufacturing analytics, we considered several alternatives. One option we evaluated was a commercial software package. The software was costly, in part because it was tailor-made for the pharma industry, and we would be unable to fully customize it to meet our needs.

Another option was to develop our own solution using open-source libraries in Python® or a similar language. This option was not feasible because we needed to be certain that the algorithms we used to build our app had been thoroughly validated and tested. We also needed technical support to access data from a diverse set of data stores. With MATLAB and support from MathWorks Consulting Services, we were able to build a fully customized, low-cost application and share it company-wide.

Published 2022