# Survival Analysis with Stata

This is the web site for the Survival Analysis with Stata materials prepared by Professor Stephen P. Jenkins (formerly of the Institute for Social and Economic Research, now at the London School of Economics and a Visiting Professor at ISER). The materials have been used in the Survival Analysis component of the University of Essex MSc module EC968, in the Survival Analysis course taught annually at the University of Essex Summer School, and at various other short courses e.g. those organised by the Centre for Microdata Methods and Practice. Please email your comments and suggestions to Stephen Jenkins. The permanent URL for these pages is http://www.iser.essex.ac.uk/survival-analysis.

These pages were first made available in January 2000, and based on Stata version 6. This June 2008 release is based on Stata version 10.

## Contents

## Aims

- To provide an introduction to the analysis of spell duration data (‘survival analysis’); and
- To show how the methods can be implemented using Stata, a program for statistics, graphics and data management.

The focus of the Lessons is on models for single-spell survival time data with no left censoring or left truncation (see the Lecture Notes for more details about these issues).

## How to use these resources

These materials are a do-it-yourself learning resource. Work through the Lessons below in parallel with reading of the draft book manuscript (see below). There is material to read followed by exercises. Stata do files (names prefixed by ‘ex’) provide code to reproduce the material shown in the lessons and also to do the exercises. You are encouraged to run the do files yourself (do filename) – preferably after attempting the exercises by yourself!

You can download module materials from here. There are Lessons and related materials (pdf files), Exercises (Stata do files, i.e. ascii format), and Data Sets (Stata dta files). See below. University of Essex readers: you are recommended to create a new subdirectory called ‘ec968’ in your ‘home’ directory (drive m: on the University of Essex network) and then download all the files to `m:\ec968`

. (Change ‘ec968’ to some other name of your choosing, if you prefer.)

## Lessons

- Preliminaries – Introduction to Lessons and Stata (ec968st1)
- The shapes of hazard and survival functions (ec968st2)
- Preparing survival time data for analysis and estimation (ec968st3)
- Estimation of the (integrated) hazard and survivor functions: Kaplan-Meier product-limit and lifetable methods (ec968st4)
- Estimation: (i) continuous time models – parametric and Cox (ec968st5)
- Estimation: (ii) discrete time models (ec968st6)
- Unobserved heterogeneity (‘frailty’) (ec968st7)
- Competing risks models (ec968st8)
- Assorted other topics (ec968st9)

In order to view the pdf files, you need the Adobe Reader. If you do not already have it, it is downloadable for free from Adobe Reader website.

## Other related materials, including draft book manuscript

- Module materials for EC968, including topic outline, reading list, assessment, and previous examination papers
- Survival Analysis by Stephen P. Jenkins (draft book manuscript)

## Do files

- ex1_1.do
- ex1_2.do
- ex2_1.do
- ex3_1.do
- ex4_1.do
- ex5_1.do
- ex6_1.do
- ex7_1.do
- ex8_1.do
- ex.zip (zip file containing all the do files)

## Stata data sets

There are a number of sample data sets referred to in the Lessons and Exercises:

auto.dta, cancer.dta, kva.dta, kennan.dta, duration.dta, unemp.dta, bc.dta, hmohiv.dta, dropout.dta.

The data sets are documented (and sources acknowledged) in Lesson 1 .

All the data sets are contained in a single zip file: dta.zip (37Kb)

## Stata resources

See section 7.2 of Lesson 1 above (ec968st1).

## Stata programs for survival analysis written by S.P. Jenkins

#### pgmhaz(8)

This is a program for discrete time proportional hazards regression, estimating the models proposed by Prentice and Gloeckler (Biometrics 1978) and Meyer (Econometrica 1990), and was circulated in the Stata Technical Bulletin STB-39 (insert ‘sbe17’). **pgmhaz** runs with Stata version 5 or later. Users with version 8.2 should use **pgmhaz8**.

A pre-print of the STB article is available from here (STB-39-pgmhaz.pdf).

Get the programs by typing `net describe sbe17`

, from (http://www.stata.com/stb/stb39) or `ssc install pgmhaz8 `

in an up-to-date Stata

The program estimates by **ML** two discrete time (grouped duration data) proportional hazards regression models, one of which incorporates a gamma mixture distribution to summarize unobserved individual heterogeneity (or ‘frailty’). Covariates may include regressor variables summarizing observed differences between persons (either fixed or time-varying), and variables summarizing the duration dependence of the hazard rate. With suitable definition of covariates, models with a fully non-parametric specification for duration dependence may be estimated; so too may parametric specifications. Your data must be suitably organised before using the model: see the help file after installation, the STB article, or Lesson 3. The program is used in Lesson 8.

Note: the likelihood ratio test of whether the gamma variance is equal to zero that **pgmhaz** reports does not take account of the fact that the null distribution is not the usual chi-squared(d.f. = 1) but is rather a 50:50 mixture of a chi-squared(d.f. = 0) variate (which is a point mass at zero) and chi-squared(d.f. = 1). See Gutierrez et al. (2001) for more details (Gutierrez, R.G., Carter, S., and Drukker, D., ‘On boundary-value likelihood-ratio tests’, insert sg160, Stata Technical Bulletin, STB-60, StataCorp, College Station TX.) In the meantime, note that the LR test statistic is correct, but the correct p-value for the test is half the reported p-value. The correct statistic is reported by **pgmhaz8**.

Discrete time hazard models with Normally distributed unobserved heterogeneity (rather than Gamma) can be now estimated in Stata. See also Lesson 7.

#### spsurv

This is a program for estimating ‘split population’ survival models, otherwise known in biostatistics as ‘cure’ models. Like **pgmhaz**, **spsurv** is for discrete time (grouped duration) data. It runs with Stata version 6 or later. The data need to be organised in the same way as for pgmhaz (see above) and one may also use time-varying covariates or non-parametric duration dependence in the same way.

You can download from here a copy of the presentation discussing the program that was given at the 7th UK Stata Users’ Group meeting (May 2001). (UKSUG7-spsurv.pdf)

In the standard survival model, all cases are assumed to fail within finite time. The split population model generalises this to suppose that an estimable fraction of the population never fails. Thus there is a form of mover-stayer heterogeneity within the population.

#### hshaz

This is a program for discrete time proportional hazards regression but, unlike **pgmhaz8**, **hshaz** assumes that the mixture distribution summarizing frailty is a discrete one, following Heckman and Singer (1984). The distribution is characterised by a number of ‘mass points’ and associated probabilities. (The location of the mass points, and probabilities, are estimable parameters; the number of mass points may be chosen by the user, with two being the default.)

Get the program by typing `ssc install hshaz`

in an up-to-date Stata.