How to find baseline intensity in noisy mass spectra?

6 views (last 30 days)
Hi all,
I'm facing some challenges related to baseline intensity values in sets of successive mass spectra acquired in "profile" (i.e. continous) mode during analyses practicized using high performance liquid chromatography hyphenated to mass spectrometry.
As you can see on the figure below, the mass spectra exhibit a certain baseline intensity value. According to my personnal experience, for one given spectrum, the baseline has a constant intensity value as a function of the mass over charge ratio. Moreover, two additional remarks can be made. First, the baseline intensity value vary (+/- 20% in my case) from one spectrum to the other. In addition, baseline noise oscillations exhibit the same period value as those of the "true" peaks of the signal, which makes hard to separate baseline contribution to the signal from those arising from the different peaks, using simple signal filtering procedures.
I checked various tips already published in the community but I could not find any good solution to my problem: I want, for every spectrum, to extract the baseline intensity (a scalar constant) in order to subtract that value from the signal and to estimate baseline noise.
Is there any appropriate way (using a function or an algorithm) to achieve that goal?
Thank you by advance for your answers and suggestions.
Antoine

Accepted Answer

Antoine BUREL
Antoine BUREL on 7 Dec 2021
Thanks for your answers. In my model, the signal is defined as , where is a scalar constant denoting baseline intensity value, the contribution of 'true' peaks to mass spectrum signal, and the contribution of noise. I also postulate that noise vector values are normally distributed around zero.
Based on your suggestions, I developed an algorithm to extract baseline intensity and noise standard deviation using histogram analysis procedures.
For a spectrum, if I plot the cumulative distribution, I obtain the following result:
By looking at the low percentile values (below ~2800), we can see that the distribution exhibits the behavior of a normal CDF. So I developed a function to calculate the sum of squared deviations between (SSD) the theoretical normal CDF values and those experimentally observed for indices below a certain value in the percentile vector, which also permits to calculate the corresponding mean value (baseline intensity) and standard deviation (noise) from the input signal. And I tried to minimize the SSD by adjusting the bounding index using fminbnd.
It seems to work properly. In the present example, the bounding index was optimized to a corresponding percentile value near 3000 counts. Just below, you will see the comparison between experimental and optimized CDF values.
If I calculate baseline intensity and noise from that optimized index, I can represent the results on my spectrum:
Clearly, we have a good solution. The procedure was repeatable over the different spectra acquired during the LC-MS analysis, and I'm able to separate the instrumental noise from the chemical noise (which can be interpreated as baseline nonlinearity for low m/z values.) Depending on the step value for cumulative distribution calculation using prctile, the algorithm takes more of less time. By setting an appropriate step, good solutions can be obtained within less than 0.1 s.
You will find attached the code and the example file.

More Answers (2)

John D'Errico
John D'Errico on 5 Dec 2021
First, define baseline. Is it the lowest value the signal does not drop below? If so, then the baseline is trivial to estimate. Just use the function min.
But if it is something else, then baseline is not so trivially defined, since you appear to have outliers below the baseline, whatever it is, however you define it.
So part of you problem is in how you will define the term. Once you define it mathematically, you are closer to knowing how to estimate it. And of course, you provide no actual data, so I can only describe what I might do.
If I look at your spectra, they seem to be mainly baseline. For example, suppose you looked at a histogram of your data for one of these spectra.You would find a huge spike in the vicinity of the baseline.
A simple solution could effectively be to discard perhaps the lowest 5% of your data, as possibly containing dropouts. Then discard the highest 90% of your data in any spectrum. What remains is probably mainly baseline. Take the median of that which remains. Essentially, this simple scheme would have you find the 7.5% percentile of your data.
help prctile
PRCTILE Percentiles of a sample. Y = PRCTILE(X,P) returns percentiles of the values in X. P is a scalar or a vector of percent values. When X is a vector, Y is the same size as P, and Y(i) contains the P(i)-th percentile. When X is a matrix, the i-th row of Y contains the P(i)-th percentiles of each column of X. For N-D arrays, PRCTILE operates along the first non-singleton dimension. Y = PRCTILE(X,P,'all') calculates percentiles of all the elements in X. The smallest dimension index of Y has length LENGTH(P) Y = PRCTILE(X,P,DIM) calculates percentiles along dimension DIM. The DIM'th dimension of Y has length LENGTH(P). Y = PRCTILE(X,P,VECDIM) calculates percentiles of elements of X based on the dimensions specified in the vector VECDIM. The smallest dimension index specified in VECDIM has length LENGTH(P). Y = PRCTILE(...,'PARAM1',val1,'PARAM2',val2,...) specifies optional parameter name/value pairs: 'Method' - 'exact' (default) to compute by sorting as explained below. 'approximate' to use an approximation algorithm based on t-digests. Percentiles are specified using percentages, from 0 to 100. For an N element vector X, PRCTILE computes percentiles as follows: 1) The sorted values in X are taken as the 100*(0.5/N), 100*(1.5/N), ..., 100*((N-0.5)/N) percentiles. 2) Linear interpolation is used to compute percentiles for percent values between 100*(0.5/N) and 100*((N-0.5)/N) 3) The minimum or maximum values in X are assigned to percentiles for percent values outside that range. PRCTILE treats NaNs as missing values, and removes them. Examples: y = prctile(x,50); % the median of x y = prctile(x,[2.5 25 50 75 97.5]); % a useful summary of x See also IQR, MEDIAN, NANMEDIAN, QUANTILE. Documentation for prctile doc prctile Other functions named prctile distributed/prctile tall/prctile
You would probably find that tweaking what you discard from the tails will actually do pretty well in such a scheme. Is it optimal, no probably not. But it is a very simple scheme that will arguably be pretty good.
Could you do better? Surely so. I might look at schemes that will let you find the peak of the histogram I suggested you plot before. And again, I lack any data, so all I can do is to make vague suggestions.
How might you estimate baseline noise? Well, once you decide what is your estimate of the baseline value, then I might go back to the histogram idea. Now find the half height locations on either side of the peak of that histogram. The width at half height could be a good way to describe the noise in the baseline.
Again, all just vague suggestions, made without seeing any data and only from the plots you show.

Star Strider
Star Strider on 5 Dec 2021
Edited: Star Strider on 7 Dec 2021
In the posted data, the baseline is not a constant, since there is a descending part at the left end (below about 150) before it becomes relatively flat.
What I usually do is to use the islocalmin function to identify the ‘baseline’ (such as it is), and avoiding the negative peaks with the name-value pair options. Then, I use the polyfit and polyval functions to define and plot the baseline. When I am happy with the result, I subtract it from the rest of the signal to correct the baseline.
It is relatively straightforward after fitting it to get whatever parameters from it that are desired. In this instance, taking the mean (or median) of the polyval output may be the desired value for the single baseline amplitude value.
EDIT — (7 Dec 2021 at 00:34)
Adding these relevant citations:
See X-ray diffraction (XRD) Base line removal - MATLAB Answers - MATLAB Central and specifically this Comment for the appropriate procedure.
.

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!