Survival Analysis with Stata
This is the web site for the Survival Analysis with Stata materials prepared by Professor Stephen P. Jenkins (formerly of the Institute for Social and Economic Research, now at the London School of Economics and a Visiting Professor at ISER). The materials have been used in the Survival Analysis component of the University of Essex MSc module EC968, in the Survival Analysis course taught annually at the University of Essex Summer School, and at various other short courses e.g. those organised by the Centre for Microdata Methods and Practice. Please email your comments and suggestions to Stephen Jenkins. The permanent URL for these pages is http://www.iser.essex.ac.uk/survival-analysis.
These pages were first made available in January 2000, and based on Stata version 6. This June 2008 release is based on Stata version 10.
- To provide an introduction to the analysis of spell duration data (‘survival analysis’); and
- To show how the methods can be implemented using Stata, a program for statistics, graphics and data management.
The focus of the Lessons is on models for single-spell survival time data with no left censoring or left truncation (see the Lecture Notes for more details about these issues).
How to use these resources
These materials are a do-it-yourself learning resource. Work through the Lessons below in parallel with reading of the draft book manuscript (see below). There is material to read followed by exercises. Stata do files (names prefixed by ‘ex’) provide code to reproduce the material shown in the lessons and also to do the exercises. You are encouraged to run the do files yourself (do filename) – preferably after attempting the exercises by yourself!
You can download module materials from here. There are Lessons and related materials (pdf files), Exercises (Stata do files, i.e. ascii format), and Data Sets (Stata dta files). See below. University of Essex readers: you are recommended to create a new subdirectory called ‘ec968’ in your ‘home’ directory (drive m: on the University of Essex network) and then download all the files to
m:\ec968. (Change ‘ec968’ to some other name of your choosing, if you prefer.)
- Preliminaries – Introduction to Lessons and Stata (ec968st1)
- The shapes of hazard and survival functions (ec968st2)
- Preparing survival time data for analysis and estimation (ec968st3)
- Estimation of the (integrated) hazard and survivor functions: Kaplan-Meier product-limit and lifetable methods (ec968st4)
- Estimation: (i) continuous time models – parametric and Cox (ec968st5)
- Estimation: (ii) discrete time models (ec968st6)
- Unobserved heterogeneity (‘frailty’) (ec968st7)
- Competing risks models (ec968st8)
- Assorted other topics (ec968st9)
In order to view the pdf files, you need the Adobe Reader. If you do not already have it, it is downloadable for free from Adobe Reader website.
Other related materials, including draft book manuscript
- Module materials for EC968, including topic outline, reading list, assessment, and previous examination papers
- Survival Analysis by Stephen P. Jenkins (draft book manuscript)
- ex.zip (zip file containing all the do files)
Stata data sets
There are a number of sample data sets referred to in the Lessons and Exercises:
auto.dta, cancer.dta, kva.dta, kennan.dta, duration.dta, unemp.dta, bc.dta, hmohiv.dta, dropout.dta.
The data sets are documented (and sources acknowledged) in Lesson 1 .
All the data sets are contained in a single zip file: dta.zip (37Kb)
See section 7.2 of Lesson 1 above (ec968st1).
Stata programs for survival analysis written by S.P. Jenkins
This is a program for discrete time proportional hazards regression, estimating the models proposed by Prentice and Gloeckler (Biometrics 1978) and Meyer (Econometrica 1990), and was circulated in the Stata Technical Bulletin STB-39 (insert ‘sbe17’). pgmhaz runs with Stata version 5 or later. Users with version 8.2 should use pgmhaz8.
A pre-print of the STB article is available from here (STB-39-pgmhaz.pdf).
Get the programs by typing
net describe sbe17, from (http://www.stata.com/stb/stb39) or
ssc install pgmhaz8 in an up-to-date Stata
The program estimates by ML two discrete time (grouped duration data) proportional hazards regression models, one of which incorporates a gamma mixture distribution to summarize unobserved individual heterogeneity (or ‘frailty’). Covariates may include regressor variables summarizing observed differences between persons (either fixed or time-varying), and variables summarizing the duration dependence of the hazard rate. With suitable definition of covariates, models with a fully non-parametric specification for duration dependence may be estimated; so too may parametric specifications. Your data must be suitably organised before using the model: see the help file after installation, the STB article, or Lesson 3. The program is used in Lesson 8.
Note: the likelihood ratio test of whether the gamma variance is equal to zero that pgmhaz reports does not take account of the fact that the null distribution is not the usual chi-squared(d.f. = 1) but is rather a 50:50 mixture of a chi-squared(d.f. = 0) variate (which is a point mass at zero) and chi-squared(d.f. = 1). See Gutierrez et al. (2001) for more details (Gutierrez, R.G., Carter, S., and Drukker, D., ‘On boundary-value likelihood-ratio tests’, insert sg160, Stata Technical Bulletin, STB-60, StataCorp, College Station TX.) In the meantime, note that the LR test statistic is correct, but the correct p-value for the test is half the reported p-value. The correct statistic is reported by pgmhaz8.
Discrete time hazard models with Normally distributed unobserved heterogeneity (rather than Gamma) can be now estimated in Stata. See also Lesson 7.
This is a program for estimating ‘split population’ survival models, otherwise known in biostatistics as ‘cure’ models. Like pgmhaz, spsurv is for discrete time (grouped duration) data. It runs with Stata version 6 or later. The data need to be organised in the same way as for pgmhaz (see above) and one may also use time-varying covariates or non-parametric duration dependence in the same way.
You can download from here a copy of the presentation discussing the program that was given at the 7th UK Stata Users’ Group meeting (May 2001). (UKSUG7-spsurv.pdf)
In the standard survival model, all cases are assumed to fail within finite time. The split population model generalises this to suppose that an estimable fraction of the population never fails. Thus there is a form of mover-stayer heterogeneity within the population.
This is a program for discrete time proportional hazards regression but, unlike pgmhaz8, hshaz assumes that the mixture distribution summarizing frailty is a discrete one, following Heckman and Singer (1984). The distribution is characterised by a number of ‘mass points’ and associated probabilities. (The location of the mass points, and probabilities, are estimable parameters; the number of mass points may be chosen by the user, with two being the default.)
Get the program by typing
ssc install hshaz in an up-to-date Stata.