Batch Process Optimization with MATLAB
This presentation considers the alternative construction of the design space based on experimental data and a grey-box model of the reactions. The model is subsequently used to optimize the production process by changing the process variables. You can examine the effect of uncertainty and variability of the parameters on process performance with a Monte-Carlo simulation.
- Reduce experiments required during the process characterization phase
- Improve the fundamental scientific understanding of a process by building a model
- Optimize or scale up the process by manipulating time-varying parameters
- Examine the effect of uncertainty quantification and analysis, which is mandatory for Quality by Design (QbD)
Hello, everyone. Thank you for joining us today. My name is Paul Huxel, and I'm a Senior Application Engineer with MathWorks. Since joining the team almost four years ago, I've visited and worked with many medical device and pharmaceutical sites, to help them use MATLAB to enhance their data analysis. I hope this introductory webinar on Batch Process Optimization demonstrates how easily you can get started with MATLAB, as well.
But before we get into the how, I want to first show you what will be building towards in today's webinar. These images are from the PDF we'll be creating directly from a live script in MATLAB. Live scripts are interactive documents that allow you to combine formatted text, equations, and images-- along with your MATLAB code and output of that code-- to easily and thoroughly document your algorithms and results.
I'll describe today's problem in more detail in a moment. But in short, we'll be using data from several constant temperature batch runs to build a model that describes the process occurring during those runs. We'll then use this model to determine the optimal temperature variance schedule that will maximize the yield of this process. Finally, we'll perform a quick Monte Carlo analysis to evaluate how temperature uncertainties might impact the yield.
To do this, we'll be using the typical data analysis workflow. If we're going to perform data analysis, we're going to need access to data. In today's example, we'll be loading data from spreadsheets, but this could also come from image files, other software and web applications, or directly from hardware and scientific instruments.
Once we have our data in MATLAB, we'll see how easy it is to begin exploring and discovering various analysis and modeling techniques. In particular, we'll see how a wide range of domain-specific apps will help us quickly get started before writing any code. But to get the most out of our work, we'll want to share it with colleagues or customers. As you just saw, this could be as simple as publishing our code to a PDF.
However, other options include creating standalone or web-based applications so end users can access our work without requiring a MATLAB installation or license, as well as automatic generation or cloud deployment for scalability. I'll talk a bit more about these options after the demonstration. And along the way, we'll see how we can use MATLAB to help automate this process.
But before we jump into MATLAB, I wanted to give you a quick overview of today's topic. One batch process that I enjoy every morning is brewing coffee. Like all batch processes, this involves working with a limited inventory of a product at a time, in this case, just coffee beans and water.
We can think of optimization of a batch process as essentially trying to find the recipe, or conditions, that maximize or minimize a particular parameter, subject to process constraints. So in this case, we're trying to maximize flavor by adjusting process conditions, such as the quantity of each ingredient, how fine we grind the beans, and the brewing temperature and duration. Depending on how particular you are about your coffee, you've probably never worried about performing a full parametric sweep on the various permutations of process conditions.
You can imagine that even for this simple problem, that would be quite an undertaking. And as you know, commercial applications are much more complex. Examples of commercial batch products include fermentation-based processes like beer and wine, dyes and pigments for textiles, paints and inks,
common food additives, fragrances for perfumes and essential oils, specialty chemicals, and of course, pharmaceuticals.
These are often products where developing and controlling precise process conditions is crucial to maintaining brand reputation and regulatory compliance. Many industrial processes mimic those that occur naturally. For example, fungi penicillium-- which causes food spoilage-- is also used for production of the antibiotic, penicillin.
However, manufacturing processes need to be efficient and repeatable. This is especially true in the pharmaceutical industry, where any significant deviation from an approved batch formula could result in throwing away millions of dollars. So not only are we interested in finding the best formula to optimize a product yield or potency, but also in determining how best to control this process to adhere to this formula.
Today's talk focuses on the former. But as I mentioned, we will use a Monte Carlo analysis to see how uncertainties in the process might affect the yield. Bioreactors are used to help control industrial process conditions, to facilitate the desired biochemical reactions. These stainless steel tanks provide a controlled means to add and remove things from the reactor, such as nutrients and waste gases, respectively.
Sensors are used to monitor and control conditions, such as pressure, temperature, oxygen, and pH. And aerators and agitators help mix solutions in a controlled way, so as not to damage growing cells. But in order to design and control processes, we need to know the rate of change of these processes, which could occur over seconds, hours, or days.
These changes may be physical, such as a diffusion process-- like the time it takes for water to get inside a coffee ground-- or they may be chemical, such as fermentation, where there's a biological agent that is ingesting something in the mixture. The time rate of change of the concentration of a species is usually a function of the current concentration of that species, as well as possibly the concentration of other species or other process variables, such as temperature, pressure, pH, and so on.
A simple example would be where the rate of change of a single substance, with respect to time, is a negative of some number k times the concentration itself. If k is a constant, this is just an exponential decay which can be solved analytically. Or as is the case in the example we'll look at today, the rate constant of the reaction is often a function of temperature.
These can be extremely complex, non-linear functions. So these models are usually empirically derived from data. The model is important, because if we don't have a model of the process, the only way to find the best recipe would be trial and error. And with so many variables, it could take a prohibitive amount of time to test all the different permutations, even with only a few species.
The reaction kinetics we'll consider in today's introductory demonstration have intentionally been formulated in a simple manner, so that we can focus on the overall problem set up and workflow without getting bogged down in detailed differential equations. With that in mind, we'll only be considering the concentration of two substances-- a nutrient that acts as a food source, and the biomass of an organism that grows by consuming this nutrient.
As you can see here, the nutrient is being modeled such that it decomposes over time due to environmental variables, such as temperature. In reality, it would also be a function of the concentration of biomass available to consume it. But for this simple model, we'll assume that effect is negligible by comparison.
However, we will model the change in concentration of the organism biomass a little more realistically, such that it has both a growth and death component. The organism will begin eating and reproducing based on how much nutrient is available. Since the cells have a finite lifetime, they will then die off in proportion to the current biomass concentration, which is a fairly common model for a biological system.
Hence, at some point, the food will start to run out and the death process will begin to dominate, such that eventually, the biomass concentration would also decay to zero. In this figure, I'm showing the nutrient and biomass time histories from a constant temperature run. Our objective is to determine the temperature varying process schedule that optimizes a key performance indicator, or KPI, such as product yield.
To do so, we'll select a simple KPI that represents the maximum concentration that is ever reached in the biomass curve. As such, once the controller notices the biomass is no longer increasing, the process can be stopped to obtain the maximum yield. With our problem now defined, let's jump into MATLAB and get started.
For those new to MATLAB, allow me to give you a quick tour of the four main areas you'll see when it opens. The current folder is the first place MATLAB is going to look for scripts and other files. It shows all of the files on the path and the address bar, here.
With Git or SVN integration, you can also quickly see each file source control status. The Command window allows you to immediately execute commands and run scripts. Variables created by these commands or scripts will show up in the workspace. The workspace shows the variables currently in MATLAB memory, and provides information on their dimension and data type. These variables are then available for use in other commands or scripts.
Finally, the Tools strip at the top organizes MATLAB functionality within a series of tabs. We'll talk more about these as we use them. Today, we'll be walking through the live script that was used to create the PDF you saw earlier. As you can see, I've already commented the coat with rich text formatting, and added a table of contents for easy navigation.
The gray shaded areas contain the MATLAB code. The rest is just to document our methodology and results. Recall from the Workflow, the first thing we'll need to do is access data. As you can see in the current folder, we have a data folder containing several spreadsheets with the nutrient and biomass time histories, resulting from an experimental campaign that included six different constant temperature batch runs.
If we were working with a single file, we could begin by using the interactive Import Data tool found on the Home tab. Instead, we'll use a datastore to automatically gather information about all the available spreadsheets. We'll later use this datastore to read in all the data files in a single command.
But first, let's look at the dynamics of a single run. To make this easier, I've inserted an interactive control to allow the user to quickly select a file of interest from a drop down menu. When we do so, notice that a new variable, with 31 rows and four columns, appears in the workspace.
We can then extract individual columns from this table, using dot notation. Selecting these new variables from the workspace, we can quickly visualize the nutrient dynamics using the plot step. MATLAB then displays all the relevant plotting options. We'll begin with a simple line plot.
Notice that when we did this, MATLAB automatically displayed the corresponding command in the Command window. Let's also add the biomass dynamics as a subplot, by opening the Figure Palette view
and dragging these variables to the new axes. We could then use the Property Editor to customize our plots.
For example, we'll add labels to each x- and y-axis, add grid lines, and select the marker and line color for the nutrient and the biomass. Once we're finished customizing our plot, we can generate the corresponding MATLAB code. So we do not need to repeat this interactive process every time we get new data.
I've already saved this code as a function, named Plot Dynamics, which can then be called from our live script. As expected, the nutrient dynamics look like an exponential decay. We can quickly fit a model to this data, using one of MATLAB's built-in apps.
The Apps tab offers many interactive apps to help you get started with capabilities, such as machine and deep learning, image and signal processing, test and measurement, computational biology, or in our case, curve fitting. Once we've selected the x and y data from the appropriate variables in the workspace, we can quickly try fitting various models, such as polynomials, Fourier series, a custom equation, or in this case, an exponential.
The app will then automatically compute the corresponding model coefficients with 95% confidence bounds, and display various goodness of fit metrics. As is the case with many of MATLAB's interactive tools, we'll once again generate the corresponding MATLAB code to automate this task for future use. I've already saved this code as a function, named Create Fit, that will be used in our live script to create a model that can be evaluated to produce the fitted nutrient curve at each time step.
Alternatively, instead of fitting an exponential to the data, we could have used the fitlm function to perform a linear regression on the log of the nutrient data. If you're ever unsure of how to use a function, the MATLAB documentation contains thorough descriptions, detailed references, and many examples to help you get started. Once we've created a regression model, we can then make predictions at each time step and transform the result back from the log space.
Likewise, once we transform the computer regression coefficients, you'll see we get results similar to what we got with the exponential fit. For the biomass, I've added a Live Task to allow us to interactively smooth the data. Live Task can be found on the Insert tab, to help you get started with common capabilities without needing to leave your script to refer to documentation.
Once you're happy with your selections for the task, you can display the MATLAB code to learn how to perform these tasks programmatically in the future. We can then add all of these results to the previous figure we created, using the batch run number four data. So far, we've created functions to visualize and fit data for a constant temperature run.
But our real objective is to use this constant temperature data to create a model that accounts for temperature variation. To start, we'll use the readall command to load all of the data contained in our spreadsheet datastore, and once again, extract the time, nutrient, biomass, and temperature columns. We can then explore the effect of temperature by plotting the maximum biomass for each constant temperature run.
It appears the optimal constant temperature occurs at 22 degrees Celsius. We can confirm this by examining the max value of a piecewise cubic interpolation of these data points. To do so, we'll use the interp1 function, and evaluate the result every quarter of a degree between the minimum and maximum temperatures.
We now have some confidence the 22 degrees is the optimal constant temperature schedule. But that does not imply it will produce a greater biomass yield than varying the temperature schedule. To investigate this, we'll posit and examine the simple process model we discussed earlier, written here in matrix form.
So the k1 kinetic parameter is analogous to the B coefficient we computed earlier. However, this time, we'll consider it to be temperature dependent. Recall, the k2 parameter models the growth of the biomass, while the k3 parameter models its death cycle.
To test our model, we'll arbitrarily choose three values for these kinetic parameters-- such as 1,1, 1-- and then integrate the system, using a fourth and fifth order run cutout ordinary differential equation solver. Using our plot dynamics function again, we see the resulting dynamics look very similar to what we saw previously. In particular, the nutrient displays an exponential decay, while the biomass grows until the nutrient is significantly depleted, and then begins to die out.
Now that we have a model, we can use this data to estimate the kinetic parameters. Since this is simulated data, we can then compare our kinetic parameter estimates with the true values of 1, 1, 1. With that goal in mind, I've created a function to estimate the kinetic parameters.
Since I've set this up as a live function with embedded equations, if a user asks for help on this function, they'll see richly formatted documentation similar to MATLAB's built-in functions. This function estimates the parameters by reformulating the process model and numerically estimating the derivatives of the nutrient and biomass. In this way, the only unknown quantity is the vector k, which can then be found by fitting the equation with a linear regression model using the fitlm function, similar to what we did earlier.
We can then return to our live script and compare the estimated kinetic parameters to the known values. Now that we have confidence in our approach, we'll loop through all six constant temperature trial runs to estimate all three kinetic parameters at each temperature. We'll then use these estimates in our schedule dynamics to interpolate the kinetic parameter values, as a function of temperature.
With that, we now have everything we need to compare various temperature schedules. To better understand our process, let's begin by comparing the optimal constant temperature schedule we saw earlier with an arbitrary time varying temperature schedule, such as the oscillating temperature seen here. To do so, we'll integrate our process model for each of these schedule dynamics.
As expected, the optimal constant temperature schedule results in a maximum biomass of just under 0.6, while the arbitrary time varying temperature schedule results in a maximum biomass closer to 0.7. Of course, arbitrarily choosing a temperature schedule is not an efficient way to optimize our process. But it has confirmed that we can increase our KPI by varying the temperature.
Upon closer examination, we see the lower initial temperature seems to preserve the nutrient longer without significantly impacting the initial biomass growth. While conversely, we later see that when the temperature drops again, the biomass growth quickly plummets. So there appears to be a benefit in keeping the temperature low initially to preserve nutrients when there's little biomass, and then later increasing the temperature to promote biomass growth.
With these insights, we're now ready to set up an appropriately constrained optimization problem, to find the temperature schedule that yields the maximum biomass. We'll begin by creating an optimization problem using the optimproblem command. We can then specify the constraints and objective using expressions of optimization variables.
In this case, we'll create an optimization variable for temperature, with lower and upper bounds based on the minimum and maximum temperatures used in our trial runs. We'll then use this variable to set the constraints on the magnitude and direction the temperature can change between set points in the schedule. In particular, we'll set the minimum change in temperature to be greater than or equal to 0, such that it is always increasing.
And then, we'll set the maximum change in temperature to be consistent with the thermodynamic specifications of the bioreactor. I've created a function name scheduleCost to specify the objective. This function integrates the schedule dynamics to compute the resulting KPI, which recall is the maximum biomass achieved. Since the optimization problem will attempt to minimize the cost, we'll set the cost to be the negative of the KPI, such that it will then be maximized.
Next, we'll initialize the optimization search using the optimal constant temperature schedule from earlier, and then view the problem formulation before running the solver. Executing the Solve command will yield a solution in about a minute. But in the interest of time, I'm going to load results I obtained previously. Extracting the temperature from the optimal solution, we see the temperature changes between set points satisfies our specified magnitude and direction constraints.
Finally, we'll integrate our process model dynamics using this optimal time varying temperature schedule. Comparing the resulting KPI with the optimal constant temperature schedule, we see that the new time varying temperature schedule results in a 44% increase in biomass yield. However, we know in the real world, we may not be able to precisely control the temperature to perfectly match the optimal temperature schedule.
We can set up a Monte Carlo analysis to examine how temperature control uncertainties might impact our biomass yield. For example, suppose the temperature at each set point can only be controlled to within two degrees, one sigma. We'll simulate this by perturbing our temperature schedule at each set point, by adding a normal random error with a standard deviation of 2 degrees.
To statistically quantify the impact of this control error, we'll repeat the process for 250 runs, and calculate the mean and standard deviation of the resulting KPI distribution. Since we're using a simple process model, this only takes a few seconds. However, if this became a computational burden due to increasing the dynamics complexity or the number of Monte Carlo runs, we could speed up our analysis using MATLAB's parallel computing capabilities to perform these runs in parallel on a cluster, or in the cloud.
As you might expect, if we're unable to follow the optimal temperature schedule precisely, we will see a reduction in the maximum biomass achieved. Fitting a normal distribution to the resulting histogram, we see the mean value is reduced from 0.8 to about 0.7, with a standard deviation of about 0.03. Finally, the last step in the workflow is to share our work. In this case, we will quickly document our algorithm development and analysis by saving our live script as a PDF.
I hope this example gave you a feel for just how quickly and easily you can get started using MATLAB to enhance your data analysis. As you see here, we only covered a very narrow path through this workflow. So I'd like to take a moment to highlight some of the capabilities that we missed.
MATLAB offers a single integrated platform for your entire workflow. Today, we imported Excel spreadsheets. But we also offer interactive tools and hardware support packages to make accessing data easy from other applications, databases, and scientific instruments. For analysis and development, additional capabilities include machine and deep learning, signal and image processing, biological sequence analysis, and simulation in both the time and frequency domains, just to name a few.
And as we briefly saw earlier, in each of these domains, MATLAB offers interactive tools and deep solutions to help jump start your analysis and development. But you may have also noticed there are apps to help you share and deploy your work, as well. There are two main pathways to do this-- application deployment and automatic code generation.
In both cases, you'd once again begin by developing models algorithms or applications in MATLAB. Which sharing path you choose then depends on the requirements of your system. Application deployment allows you to create graphical user interfaces and software libraries that can be integrated with enterprise and cloud systems.
With this option, the functionality of MATLAB is still effectively running behind the scenes, but the end user does not require a MATLAB license. Automatic code generation, on the other hand, allows you to run your analytics on embedded hardware, such as a medical device or manufacturing line. In this case, your MATLAB algorithms are converted to readable, portable code in the language of your specified hardware, such as a real-time target, FPGA or ASIC, PLC, or GPU.
With deployment in mind, one question we often get asked is if MATLAB is validated by the FDA or other regulatory agencies. We've created a Q&A web page and Tool Validation Kit to help guide you through this process. Here, you'll also find a summary of the FDA MathWorks Research and Collaboration Agreement, as well as contact information if you have additional questions or need consulting help.
Today's webinar gave you a quick introduction to what MATLAB has to offer. We saw how easy it is to get started with MATLAB. Thorough documentation, with applicable examples, mean you don't have to start coding from a blank page. You also have access to product and industry experts via technical support and application engineers, as well as an expansive user community via the MATLAB Answers and File Exchange sites on our web page.
We saw how apps can be used for interactive algorithm development and automatic MATLAB code generation, which boost productivity and allow for rapid prototyping. Furthermore, once you're ready to deploy your algorithms, you have many options such as creating standalone and web applications, software libraries to integrate with enterprise and cloud systems, or automatic code generation for embedded devices to save time and reduce coding errors.
You can get started today with our free self-paced Onramp tutorial. This online interactive tutorial uses hands-on exercises with automated assessment and feedback, to teach you the essentials of MATLAB in just a couple hours. From there, you can build your skills with several other free self-paced Onramps, with popular topics such as machine and deep learning, image and signal processing, and control design with Simulink.
If you'd like to dive deeper than the Onramps, we also offer a wide variety of self-paced trainings and instructor-led courses. These courses have typically been a mix of in-person and online offerings. But over the past year, we've expanded our instructor-led online trainings, which have been very well received.
These courses can also be customized to meet your specific needs. And when time is critical, we also have consulting services available around the world. Our consulting is very transparent, with the goal that your team owns and operates the resulting work.
We customize services based on your needs, to optimize your investment and ensure your success. Our consulting has helped many customers jump start their project by applying our proven best practices,
deep product knowledge, and broad technical experience. You can learn more about our products and solutions on our website, at mathworks.com. Thank you for your time and attention today.
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.