# Documentation

## Cox Proportional Hazards Regression

Cox proportional hazards regression is a semiparametric method for adjusting survival rate estimates to quantify the effect of predictor variables. The method represents the effects of explanatory variables as a multiplier of a common baseline hazard function, h0(t). The hazard function is the nonparametric part of the Cox proportional hazards regression function, whereas the impact of the predictor variables is a loglinear regression. For a baseline relative to 0, this model corresponds to

${h}_{X}\left(t\right)={h}_{0}\left(t\right){e}^{\sum _{i}{X}_{i}{b}_{i}},$

where hX(t) is the hazard rate at X and h0(t) is the baseline hazard rate function.

The Cox proportional hazards model relates the hazard rate for individuals or items at the value X, to the hazard rate for individuals or items at the baseline value. It produces an estimate for the hazard ratio, HR = hX(t)/h0(t). The model is based on the assumption that the baseline hazard function depends on time, t, but the predictor variables do not. This is also called the proportional hazards assumption, which states that the hazard rate does not change over time for any individual. The hazard ratio represents the relative risk of instant failure for individuals or items having the predictive variable value X compared to the ones having the baseline values. For example, if the predictive variable is smoking status, where nonsmoking is the baseline category, the hazard ratio shows the relative instant failure rate of smokers compared to the baseline category, that is, nonsmokers.

For a baseline relative to X* and the predictor variable value X, the hazard ratio is

$HR=\frac{{h}_{X}\left(t\right)}{{h}_{{X}^{*}}\left(t\right)}=\mathrm{exp}\left[\sum _{i}\left({X}_{i}-{X}_{i}{}^{*}\right){b}_{i}\right].$

For example, if the baseline is the mean values of the predictor variables (`mean(X)`), then the hazard rate model becomes

${h}_{X}\left(t\right)={h}_{\overline{X}}\left(t\right)\mathrm{exp}\left[\sum _{i}\left({X}_{i}-\overline{X}\right)\text{ }{b}_{i}\right].$

Hazard rates are related to survival rates, such that the survival rate at time t for an individual with the explanatory variable value x is

${S}_{X}\left(t\right)={S}_{0}{\left(t\right)}^{H{R}_{X}\left(t\right)},$

where S0(t) is the survivor function with the baseline hazard rate function h0(t), and HRx(t) is the hazard ratio of the predictor variable value x relative to the baseline value.

A point estimate of the effect of each explanatory variable, that is, the estimated hazard ratio for the effect of each explanatory variable is exp(b), given all other variables are held constant, where b is the coefficient estimate for that variable. The coefficient estimates are found by maximizing the likelihood function of the model. The likelihood function for the proportional hazards regression model is based on the observed order of events. It is the product of likelihood of a failure estimated for each failure time. If there are n failures at n distinct failure times, then the likelihood is

$L=\left[\frac{h\left({t}_{1}\right)}{{\sum }_{i=1}^{n}h\left(t{}_{i}\right)}\right]×\left[\frac{h\left({t}_{2}\right)}{{\sum }_{i=2}^{n}h\left(t{}_{i}\right)}\right]×\cdot \cdot \cdot ×\left[\frac{h\left({t}_{n}\right)}{h\left({t}_{n}\right)}\right].$

You can use a likelihood ratio test to assess the significance of adding a term or terms in a model. Consider the two models where the first model has p predictive variables and the second model has p + r predictive variables. Then, comparing the two models, –2*(L1/L2) has a chi-square distribution with r degrees of freedom (the number of terms being tested).

## References

[1] Cox, D. R., and D. Oakes. Analysis of Survival Data. London: Chapman & Hall, 1984.

[2] Lawless, J. F. Statistical Models and Methods for Lifetime Data. Hoboken, NJ: Wiley-Interscience, 2002.

[3] Kleinbaum, D. G., and M. Klein. Survival Analysis. Statistics for Biology and Health. 2nd edition. Springer, 2005.