how to use maximum likelihood estimation (MLE) to deal with censored data to get its linear regression

Dear guys,
The matlab code is shown below. x and y are experimental data and plotted in figure1 with blue stars. The relationship between x and y is supposed to be linear following the equation y=x and it is plotted in figure1 with blue line. But due to limitation of data acquisition method, the last 5 data in y vector are censored to 15. When using polyfit function (least square estimation?) to get the linear equation (result shown in figure1 with red line), the result is slightly wrong.
Someone has told me that maximum likelihood method could more or less fixed this problem. I guess I should use function MLE. But could someone tell me how to use it?
Thank you very much!
David
%%%plot experimental measured data (somehow censored).
x=[0:1:20];
y=[-0.709 1.017 2.50 2.93 3.08 4.12 6.16 7.59 8.25 8.10 9.57 11.4 11.3 13.4 13.9 14.3 15 15 15 15 15];
figure(1);plot(x,y,'*');hold on;
%%%plot correct (original) linear regression line
z=x;
figure(1);plot(x,z);
%%plot censored linear regression line
p=polyfit(x,y,1);
figure(1);plot(x,p(2)+p(1)*x,'r');
legend('censored data','correct linear regression','wrong linear regression')

Answers (2)

Let me start by thanking you for such a well written question, with functioning code! (It could have been made slightly better if you had used the markup tools to format the code.) I have been despairing lately under a rash of really hard-to-understand questions.
POLYFIT uses the method of least squares, which results in the maximum likelihood estimates for the coefficients (under the assumption of normally distributed errors). So, I would not strictly say that what you need is maximum likelihood, because you are already doing that.
Rather, what you might need is to do a non-linear fit, where your fitting function is linear for data up to 15, then equal to 15 for larger values. If you have the Statistics Toolbox, you can do this type of fit with the NLINFIT function.
It might be easier, though, to simply self-censor the input to your function by manually identifying the y's that are equal to 15:
indexToCensoredData = (y==15);
and feeding only only the non-censored data into POLYFIT.

6 Comments

Thanks for your reply.
The relationship between x and y is not known. In my question, I assume their relationship is y=x; but in reality I need to find it out. By using POLYFIT (least square) the linear regression is wrongly fitted to the data since this method assumes the censored data (=15) is exact data.
what I really want to do is to find out a method to give a linear regression fitted to the censored data in which this method assumes the censored data is not exact data. Hopefully it would give a relationship close to and exactly same as y=x.
The matlab code is just an example. In reality, the data is far more sparsed.
Also I think if the data is not censored, least square estimation and maximum likelihood estimation will give the same linear regression results. But when the data is censored, the results would be different.
I am not sure I have explained everythin clearly.
Thank you and look forward to your answer.
David
One more thing:
I need to find out their linear relationship by assuming the censored data is not exact data. Somehow I still need these data (can not just ignore the data!) and change the values of censored data.
If you look at the help of MLE there is a short paragraph about censoring. However, I still do not know how to do it!
Thank you
I think that MLE will fit a one-dimensional distribution, but does not do a fit of x-y data, as you want.
I believe the key here is your statement, "this method assumes the censored data is not exact data". What DO you want to assume about those y values? It seemed to me that you just want to exclude them from the fit, and then do a linear fit to the rest of the data (because I thought y==15 implies a TOTALLY useless reading).
I could exclude the censored data but sometimes the non-censored data is not enough to give a right answer or sometimes the number of data is small and we need to use censored one.
I have seen some images about using MLE to deal with censored data. But I am not sure if they used matlab or other software. I think matlab must can do it! I just need to find a way out!
Thank you.
David
At the risk of repeating myself: The key is to decide what you want to assume about those measurements. If you do anything other than (a) treating them like the other points, or (b) ignoring them altogether, then it seems to me you are doing a non-linear regression, and NLINFIT will do the job for you.
The censored data is due to the limitation of the equipment, e.g. sometimes the value is over the limit and then the value is censored to the maximum value that the equipment can measure. So such data is not a good representation of the distribution. The censored data is not thrown away or ignored. Rather their likelihood function was redefined to account for their unknown value being anywhere below (e.g. left-censored) or above (eg. my example) the censoring line.
Treating the censoring points as other points could be a way of solving it. I think MLE will use likelihood function to find out the likely position of those points, I am not 100% sure about this. But NLINFIT will treat the censored data as exact data in which the regression might be wrong.
Thank you and look forward to your reply.
David

Sign in to comment.

I see several options here:
  • If you are able to identify a priori the points where the value is over the limit, and therefore know that the datum is not a good representative of the distribution you are fitting, then I would censor those manually, and fit only the known good points.
  • Alternatively, if you know the limit (say, x=15, as in your example ), you could instead fit a nonlinear function that would be something like
f = @(b,x) b.*x .* (x < 15) + 15 * (x>=15)
This would include the censored points, but I think would correctly render the linear part of the function and effectively ignore the rest.
  • Finally, if you do not know the limit, you could still define a nonlinear function
f = @(bc,x) bc(1).*x .* (x < bc(2)) + bc(2) * (x>=bc(2))
where "c" (the second element of the vector) bc is the unknown limit, and you could fit both the linear piece and the limit.
Here is code that adds the second alternative to your plot:
x=(0:1:20)';
y=[-0.709 1.017 2.50 2.93 3.08 4.12 6.16 7.59 8.25 8.10 9.57 11.4 11.3 13.4 13.9 14.3 15 15 15 15 15]';
%%%correct (original) linear regression line
z=x;
%%plot censored linear regression line
p=polyfit(x,y,1);
f = @(b,x) b.*x .* (x < 15) + 15 .* (x>=15);
beta = nlinfit(x,y,f,0);
figure(1)
plot(x,y,'*',x,z,'b',x,p(2)+p(1)*x,'r',x,f(beta,x),'g');
legend('censored data','correct linear regression','wrong linear regression','nonlinear regression','Location','NorthWest')
For the third method, swap in:
f = @(bc,x) bc(1).*x .* (x < bc(2)) + bc(2) .* (x>=bc(2));
beta = nlinfit(x,y,f,[0 10])

Categories

Find more on Descriptive Statistics and Insights in Help Center and File Exchange

Asked:

on 24 Mar 2011

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!