ecdf

[f,x] = ecdf(y) returns the empirical cumulative distribution function f, evaluated at x, using the data in y.

[f,x] = ecdf(y,Name,Value) specifies additional options using one or more name-value arguments. For example, 'Function','survivor' specifies the type of function for f as a survivor function.

[f,x,flo,fup] = ecdf(___) also returns the lower and upper confidence bounds for the evaluated function values, using any of the input argument combinations in the previous syntaxes. This syntax is not valid for interval-censored data.

ecdf(___) produces a stairstep graph of the evaluated function. The function visualizes interval estimates for interval-censored data using shaded rectangles. You can specify 'Bounds','on' to include the confidence bounds in the graph for fully observed, left-censored, right-censored, and double-censored data.

ecdf(ax,___) plots on the axes specified by ax instead of the current axes (gca).

Examples

Compute Empirical cdf

Compute the Kaplan-Meier estimate of the empirical cumulative distribution function (cdf) for simulated survival data.

Generate survival data from a Weibull distribution with parameters 3 and 1.

rng('default')  % For reproducibility
failuretime = random('wbl',3,1,15,1);

Compute the Kaplan-Meier estimate of the empirical cdf for survival data.

[f,x] = ecdf(failuretime);
[f,x]

ans = 16×2

         0    0.0895
    0.0667    0.0895
    0.1333    0.1072
    0.2000    0.1303
    0.2667    0.1313
    0.3333    0.2718
    0.4000    0.2968
    0.4667    0.6147
    0.5333    0.6684
    0.6000    1.3749
      ⋮

Plot the estimated empirical cdf.

ecdf(failuretime)

Compare Empirical cdf with Known cdf

Generate right-censored survival data and compare the empirical cumulative distribution function (cdf) with the known cdf.

Generate failure times from an exponential distribution with a mean failure time of 15.

rng('default')  % For reproducibility
y = exprnd(15,75,1);

Generate drop-out times from an exponential distribution with a mean failure time of 30.

d = exprnd(30,75,1);

Generate the observed failure times, that is, the minimum of the generated failure times and the drop-out times.

t = min(y,d);

Create a logical array containing generated failure times that are larger than the drop-out times. The data for which this condition is true is censored.

censored = (y>d);

Compute the empirical cdf and confidence bounds.

[f,x,flo,fup] = ecdf(t,'Censoring',censored);

Plot the empirical cdf and confidence bounds.

ecdf(t,'Censoring',censored,'Bounds','on')
hold on

Superimpose a plot of the known population cdf.

xx = 0:.1:max(t);
yy = 1-exp(-xx/15);
plot(xx,yy,'g-','LineWidth',2)
axis([0 max(t) 0 1])
legend('Empirical cdf','Lower confidence bound', ...
    'Upper confidence bound','Known population cdf', ...
    'Location','southeast')
hold off

Plot Empirical Survivor Function with Confidence Bounds

Generate survival data and plot the empirical survivor function with 99% confidence bounds.

Generate lifetime data from a Weibull distribution with parameters 100 and 2.

rng('default')  % For reproducibility
R = wblrnd(100,2,100,1);

Plot the empirical survivor function for the data with 99% confidence bounds.

ecdf(R,'Function','survivor','Alpha',0.01,'Bounds','on')
hold on

Superimpose a plot of the Weibull survivor function.

x = 1:1:250;
wblsurv = 1-cdf('weibull',x,100,2);
plot(x,wblsurv,'g-','LineWidth',2)
legend('Empirical survivor function','Lower confidence bound', ...
    'Upper confidence bound','Weibull survivor function', ...
    'Location','northeast')

The Weibull survivor function based on the actual distribution is within the confidence bounds.

Empirical Cumulative Hazard Function of Double-Censored Data

Compute and plot the cumulative hazard function of simulated double-censored survival data.

Generate failure times from a Birnbaum-Saunders distribution.

rng('default')  % For reproducibility
failuretime = random('BirnbaumSaunders',0.3,1,[100,1]);

Assume that the study starts at time 0.1 and the ends at time 0.9. The assumption implies that failure times less than 0.1 are left censored, and failure times greater than 0.9 are right censored.

Create a vector in which each element indicates the censorship status of the corresponding observation in failuretime. Use –1, 1, and 0 to indicate left-censored, right-censored, and fully observed observations, respectively.

L = 0.1;
U = 0.9;
left_censored = (failuretime<L);
right_censored = (failuretime>U);
c = right_censored - left_censored;

Plot the empirical cumulative hazard function for the data with 95% confidence bounds.

ecdf(failuretime,'Function','cumulative hazard', ...
    'Censoring',c,'Bounds','on')

Empirical cdf of Interval-Censored Data

Compute and plot the empirical cdf of interval-censored data.

Load the cities data set. The data includes ratings for nine different indicators of the quality of life in 329 US cities: climate, housing, health, crime, transportation, education, arts, recreation, and economics. For each indicator, a higher rating is better.

load cities

Select the first indicator (climate) as sample data.

Y = ratings(:,1);

Assume that the indicators in Y are the values rounded to the nearest integer. Then, you can treat values in Y as interval-censored observations. An observation y in Y indicates that the actual rating is between y–0.5 and y+0.5.

Create a matrix in which each row represents the interval surrounding each integer in Y.

intervalY = [Y-0.5, Y+0.5];

Compute the empirical cdf values.

[f,x] = ecdf(intervalY);

Plot the empirical cdf values.

figure
ecdf(intervalY)

Zoom into a smaller region to see the interval estimates.

idx_roi = 21:30;
xlim([x(idx_roi(1),1) x(idx_roi(end),2)])

Display the corresponding x and f values.

table(idx_roi',x(idx_roi,:),f(idx_roi,:), ...
    'VariableNames',{'Index','x','Empirical cdf F(x)'})

ans=10×3 table
    Index          x           Empirical cdf F(x)
    _____    ______________    __________________

     21      377.5    378.5         0.069909     
     22      382.5    383.5         0.075988     
     23      384.5    385.5         0.079027     
     24      390.5    391.5         0.082067     
     25      395.5    396.5         0.085106     
     26      397.5    398.5         0.091185     
     27      400.5    401.5         0.094225     
     28      401.5    402.5         0.097264     
     29      403.5    404.5          0.10334     
     30      409.5    410.5          0.10638

The shaded rectangles indicate the change of empirical cdf values F(x) within the corresponding intervals. For example, the second shaded rectangle from the left in the zoomed plot corresponds to the interval (382.5,383.5]. F(382.5) is 0.075988, F(383.5) is 0.079027, and the change from 0.075988 to 0.079027 occurs in the interval (382.5,383.5]. The exact timing of the change is uncertain.

You can plot the interval estimates in different ways. If you assume that the probability change occurs at the start of each interval, you can plot the F(x) values using the first column of x.

figure
stairs(x(:,1),f)
title('Probability changes at the start')
xlabel('x')
ylabel('F(x)')
xlim([x(idx_roi(1),1) x(idx_roi(end),2)])

Alternatively, you can plot the F(x) values using the second column of x with the assumption that the probability change occurs at the end of each interval.

figure
stairs(x(:,2),f)
title('Probability changes at the end')
xlabel('x')
ylabel('F(x)')
xlim([x(idx_roi(1),1) x(idx_roi(end),2)])

Combine the previous two plots to visualize the intervals.

figure
stairs(x(:,1),f)
hold on
stairs(x(:,2),f)
title('Probability changes in the interval')
xlabel('x')
ylabel('F(x)')
xlim([x(idx_roi(1),1) x(idx_roi(end),2)])
hold off

Create Piecewise Linear Distribution Object from Empirical cdf

Compute the empirical cumulative distribution function (cdf) for data, and create a piecewise linear distribution object using an approximation to the empirical cdf.

Load the sample data. Visualize the patient weight data using a histogram.

load patients
histogram(Weight(strcmp(Gender,'Female')))
hold on
histogram(Weight(strcmp(Gender,'Male')))
legend('Female','Male')

The histogram shows that the data has two modes, one for female patients and one for male patients.

Compute the empirical cdf for the data.

[f,x] = ecdf(Weight);

Construct a piecewise linear approximation to the empirical cdf by taking a value every five points.

f = f(1:5:end);
x = x(1:5:end);

Plot the empirical cdf and the approximation.

figure
ecdf(Weight)
hold on
plot(x,f,'ko-','MarkerFace','r') 
legend('Empirical cdf','Piecewise linear approximation', ...
    'Location','best')

Create a piecewise linear probability distribution object using the piecewise approximation of the empirical cdf.

pd = makedist('PiecewiseLinear','x',x,'Fx',f)

pd = 
  PiecewiseLinearDistribution

F(111) = 0
F(118) = 0.05
F(124) = 0.13
F(130) = 0.25
F(135) = 0.37
F(142) = 0.5
F(163) = 0.55
F(171) = 0.61
F(178) = 0.7
F(183) = 0.82
F(189) = 0.94
F(202) = 1

Generate 100 random numbers from the distribution.

rng('default') % For reproducibility
rw = random(pd,[100,1]);

Plot the random numbers to visually compare their distribution to the original data.

figure
histogram(Weight)
hold on
histogram(rw)
legend('Original data','Generated data')

The random numbers generated from the piecewise linear distribution have the same bimodal distribution as the original data.

Input Arguments

`y` — Sample data and censorship information
vector | two-column matrix

Sample data and censorship information, specified as a vector of sample data or a two-column matrix of sample data and censorship information.

You can specify the censorship information for the sample data by using either the y argument or the Censoring name-value argument. ecdf ignores the Censoring argument value if y is a two-column matrix.

Specify y as a vector or a two-column matrix depending on the censorship types of the observations in y.

Fully observed data — Specify y as a vector of sample data.
Data that contains fully observed, left-censored, or right-censored observations — Specify y as a vector of sample data, and specify the Censoring name-value argument as a vector that contains the censorship information for each observation. The Censoring vector can contain 0, –1, and 1, which refer to fully observed, left-censored, and right-censored observations, respectively.
Data that includes interval-censored observations — Specify y as a two-column matrix of sample data and censorship information. Each row of y specifies the range of possible survival or failure times for each observation, and can have one of these values.
- [t,t] — Fully observed at t
- [–Inf,t] — Left-censored at t
- [t,Inf] — Right-censored at t
- [t₁,t₂] — Interval-censored between [t₁,t₂], where t₁ < t₂

ecdf ignores NaN values in y. Additionally, any NaN values in the censoring vector (Censoring) or frequency vector (Frequency) cause ecdf to ignore the corresponding rows in y.

Data Types: single | double

`ax` — Target axes
`Axes` object

Target axes for the figure to which ecdf plots, specified as an Axes object.

For instance, if h is a target Axes object for a figure, then ecdf can plot to that figure as shown in the following example.

Example: ecdf(h,x)

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Censoring',c,'Function','cumulative hazard','Alpha',0.025,'Bounds','on' instructs ecdf to return the cumulative hazard function and the 97.5% confidence bounds, accounting for the censored data specified by vector c.

`Function` — Type of function returned
`'cdf'` (default) | `'survivor'` | `'cumulative hazard'`

Type of function returned by ecdf, specified as one of these values.

Value	Description
`'cdf'` (default)	Cumulative distribution function (cdf)
`'survivor'`	Survivor Function
`'cumulative hazard'`	Cumulative Hazard Function

Example: 'Function','cumulative hazard'

`Censoring` — Indicator of censored data
vector of 0s (default) | vector consisting of 0, –1, and 1

Indicator of censored data, specified as a vector consisting of 0, –1, and 1, which indicate fully observed, left-censored, and right-censored observations, respectively. Each element of the Censoring value indicates the censorship status of the corresponding observation in y. The Censoring value must have the same size as y. The default is a vector of 0s, indicating all observations are fully observed.

You cannot specify interval-censored observations using this argument. If the sample data includes interval-censored observations, specify y using a two-column matrix. ecdf ignores the Censoring value if y is a two-column matrix.

ecdf ignores any NaN values in the censoring vector. Additionally, any NaN values in y or the frequency vector (Frequency) cause ecdf to ignore the corresponding values in the censoring vector.

Example: 'Censoring',censored, where censored is a vector that contains censorship information.

Data Types: logical | single | double

`Frequency` — Frequency of observations
vector of 1s (default) | vector of nonnegative scalars

Frequency of observations, specified as a vector of nonnegative integer counts that has the same number of rows as y. The jth element of the Frequency value gives the number of times the jth row of y was observed. The default is a vector of 1s, indicating one observation per row of y.

ecdf ignores any NaN values in this frequency vector. Additionally, any NaN values in y or the censoring vector (Censoring) cause ecdf to ignore the corresponding values in the frequency vector.

Example: 'Frequency',freq, where freq is a vector that contains the observation frequencies.

Data Types: single | double

`IterationLimit` — Maximum number of iterations
`1e7` (default) | positive integer

Maximum number of iterations, specified as a positive integer. This argument is valid only for double-censored data and interval-censored data.

Example: 'IterationLimit',1e5

Data Types: single | double

`Tolerance` — Termination tolerance on function value
`1e-7` (default) | positive scalar

Termination tolerance on the function value f, specified as a positive scalar. This argument is valid only for double-censored data and interval-censored data.

Example: 'Tolerance',1e-5

Data Types: single | double

`ICMFrequency` — Frequency of ICM step
10 (default) | positive integer

Frequency of the iterative convex minorant (ICM) step, specified as a positive integer. This argument is valid only for interval-censored data.

ecdf uses the expectation-maximization iterative convex minorant (EMICM) algorithm [5] to compute the output f for interval-censored data. The EMICM algorithm uses either the EM algorithm or the ICM algorithm at each iteration. ecdf runs the ICM step every specified number of iterations. For example, by default, ecdf iterates the EM step nine times, runs the ICM step once, and then goes back to the EM step.

Example: 'ICMFrequency',1

Data Types: single | double

`Alpha` — Significance level
0.05 (default) | scalar in the range (0,1)

Significance level for the confidence interval of the evaluated function, specified as a scalar in the range (0,1). The default is 0.05 for 95% confidence. For a given value alpha, the confidence level is 100(1 – Alpha)%.

This argument is not valid for interval-censored data.

Example: 'Alpha',0.01 specifies the confidence level as 99%.

Data Types: single | double

`Bounds` — Indicator for including confidence bounds in plot
`'off'` (default) | `'on'`

Indicator for including the confidence bounds in the plot, specified as one of these values.

Value	Description
`'off'` (default)	Omit the confidence bounds.
`'on'`	Include the confidence bounds.

This argument is not valid for interval-censored data.

Note

This argument is valid only for plotting.

Example: 'Bounds','on'

Output Arguments

`f` — Function values
column vector

Function values evaluated at the points or intervals in x, returned as a column vector.

The point estimate indicates that the function value at x(i) is f(i).
The interval estimate indicates that the function value changes from f(i–1) to f(i) within the interval (x(i,1),x(i,2)]. The exact timing of the change is uncertain. For an example, see Empirical cdf of Interval-Censored Data.

The function type of f can be the cdf (default), Survivor Function, or Cumulative Hazard Function, as specified by the Function name-value argument.

`x` — Evaluation points or intervals
column vector | two-column matrix

Evaluation points or intervals, specified as a column vector or a two-column matrix, respectively.

ecdf returns a column vector for fully observed, left-censored, right-censored, and double-censored data.
- For fully observed, left-censored, and right-censored data, ecdf removes values for censored observations from y, sorts the remaining values, removes duplicate values in the sorted values, and saves the results to the output x.
- For double-censored data, ecdf determines which values of y correspond to the event times, sorts the values, removes duplicate values in the sorted values, and saves the results to the output x.
The output x includes the minimum value of y as its first two values. These two values are useful for plotting the outputs of ecdf using the stairs function.
ecdf returns a two-column matrix for interval-censored data. ecdf evaluates the function values f at intervals called Turnbull intervals. For details, see Algorithms.

`flo` — Lower confidence bound
column vector

Lower confidence bound for the evaluated function, returned as a column vector. ecdf computes the bound for each observation. flo is not a simultaneous bound for the curve.

This argument is not valid for interval-censored data.

`fup` — Upper confidence bound
column vector

Upper confidence bound for the evaluated function, returned as a column vector. ecdf computes the bound for each observation. fup is not a simultaneous bound for the curve.

This argument is not valid for interval-censored data.

More About