how to use maximum likelihood estimation (MLE) to deal with censored data to get its linear regression

21 views (last 30 days)
Dear guys,
The matlab code is shown below. x and y are experimental data and plotted in figure1 with blue stars. The relationship between x and y is supposed to be linear following the equation y=x and it is plotted in figure1 with blue line. But due to limitation of data acquisition method, the last 5 data in y vector are censored to 15. When using polyfit function (least square estimation?) to get the linear equation (result shown in figure1 with red line), the result is slightly wrong.
Someone has told me that maximum likelihood method could more or less fixed this problem. I guess I should use function MLE. But could someone tell me how to use it?
Thank you very much!
David
%%%plot experimental measured data (somehow censored).
x=[0:1:20];
y=[-0.709 1.017 2.50 2.93 3.08 4.12 6.16 7.59 8.25 8.10 9.57 11.4 11.3 13.4 13.9 14.3 15 15 15 15 15];
figure(1);plot(x,y,'*');hold on;
%%%plot correct (original) linear regression line
z=x;
figure(1);plot(x,z);
%%plot censored linear regression line
p=polyfit(x,y,1);
figure(1);plot(x,p(2)+p(1)*x,'r');
legend('censored data','correct linear regression','wrong linear regression')

Answers (2)

the cyclist
the cyclist on 24 Mar 2011
Let me start by thanking you for such a well written question, with functioning code! (It could have been made slightly better if you had used the markup tools to format the code.) I have been despairing lately under a rash of really hard-to-understand questions.
POLYFIT uses the method of least squares, which results in the maximum likelihood estimates for the coefficients (under the assumption of normally distributed errors). So, I would not strictly say that what you need is maximum likelihood, because you are already doing that.
Rather, what you might need is to do a non-linear fit, where your fitting function is linear for data up to 15, then equal to 15 for larger values. If you have the Statistics Toolbox, you can do this type of fit with the NLINFIT function.
It might be easier, though, to simply self-censor the input to your function by manually identifying the y's that are equal to 15:
indexToCensoredData = (y==15);
and feeding only only the non-censored data into POLYFIT.
  6 Comments
the cyclist
the cyclist on 24 Mar 2011
At the risk of repeating myself: The key is to decide what you want to assume about those measurements. If you do anything other than (a) treating them like the other points, or (b) ignoring them altogether, then it seems to me you are doing a non-linear regression, and NLINFIT will do the job for you.
David
David on 25 Mar 2011
The censored data is due to the limitation of the equipment, e.g. sometimes the value is over the limit and then the value is censored to the maximum value that the equipment can measure. So such data is not a good representation of the distribution. The censored data is not thrown away or ignored. Rather their likelihood function was redefined to account for their unknown value being anywhere below (e.g. left-censored) or above (eg. my example) the censoring line.
Treating the censoring points as other points could be a way of solving it. I think MLE will use likelihood function to find out the likely position of those points, I am not 100% sure about this. But NLINFIT will treat the censored data as exact data in which the regression might be wrong.
Thank you and look forward to your reply.
David

Sign in to comment.


the cyclist
the cyclist on 25 Mar 2011
I see several options here:
  • If you are able to identify a priori the points where the value is over the limit, and therefore know that the datum is not a good representative of the distribution you are fitting, then I would censor those manually, and fit only the known good points.
  • Alternatively, if you know the limit (say, x=15, as in your example ), you could instead fit a nonlinear function that would be something like
f = @(b,x) b.*x .* (x < 15) + 15 * (x>=15)
This would include the censored points, but I think would correctly render the linear part of the function and effectively ignore the rest.
  • Finally, if you do not know the limit, you could still define a nonlinear function
f = @(bc,x) bc(1).*x .* (x < bc(2)) + bc(2) * (x>=bc(2))
where "c" (the second element of the vector) bc is the unknown limit, and you could fit both the linear piece and the limit.
Here is code that adds the second alternative to your plot:
x=(0:1:20)';
y=[-0.709 1.017 2.50 2.93 3.08 4.12 6.16 7.59 8.25 8.10 9.57 11.4 11.3 13.4 13.9 14.3 15 15 15 15 15]';
%%%correct (original) linear regression line
z=x;
%%plot censored linear regression line
p=polyfit(x,y,1);
f = @(b,x) b.*x .* (x < 15) + 15 .* (x>=15);
beta = nlinfit(x,y,f,0);
figure(1)
plot(x,y,'*',x,z,'b',x,p(2)+p(1)*x,'r',x,f(beta,x),'g');
legend('censored data','correct linear regression','wrong linear regression','nonlinear regression','Location','NorthWest')
For the third method, swap in:
f = @(bc,x) bc(1).*x .* (x < bc(2)) + bc(2) .* (x>=bc(2));
beta = nlinfit(x,y,f,[0 10])

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!