***********************************************
** SC968 PANEL DATA METHODS for SOCIOLOGISTS
** DO-FILE FOR LECTURE 1 part 1
***********************************************
***********************************************
** 1.1.0 OBJECTIVES
***********************************************
***********************************************
** 1.1.1 HOW TO WORK THROUGH THESE EXERCISES
***********************************************
***********************************************
** 1.1.2 GETTING STARTED
***********************************************
** cd m:
cd "C:\Users\rrluthra\Documents\teaching SC968"
clear all
use teaching
***********************************************
** 1.1.3 START LOOKING AT THE DATA
***********************************************
describe
summarize
summarize jbhrs
** Question: Is the mean generated by this command very meaningful? If not, why not?
** Answer: No! It's contaminated by negative codes which don't mean that a person works
** -9 or -7 hours a week - they indicate missing or otherwise invalid answers.
codebook
labelbook
mvdecode jbhrs, mv(-9 -8 -2 -1 98 99)
***********************************************
** 1.1.4 MEAN AND MEDIAN HOURS OF WORK
***********************************************
** First, investigate missing values and decide which to exclude
tabulate jbhrs, m
** Exclude hours less than zero or greater than 97
** (note that 97 is coded "97 or above" so it's not clear what 98 or 99 mean.
sum jbhrs if jbhrs >=0 & jbhrs <= 97 // mean for everyone, including zero hours
sum jbhrs if jbhrs > 0 & jbhrs <= 97 // mean for those reporting +ve hours of work
sum jbhrs if jbhrs >=0 & jbhrs <= 97, d // the "detail" option gives medians
sum jbhrs if jbhrs > 0 & jbhrs <= 97, d
** note that simpler code would work, since we have already declared negative values plus 98 and 99 as missing.
** The output you should be getting is:
** Mean Median
** >=0 33.48281 37
** >0 34.00491 37
** The figures don't vary a great deal between the rows because the jbhrs variable excludes anyone who doesn't have a job
** So our adjustment only excluded the small number of people who have a job but reported zero hours.
***********************************************************
** 1.1.5 START LOOKING AT MENTAL HEALTH AND GENDER
***********************************************************
tab hlghq2 sex if hlghq2 >= 0, col nof
** Question: In very basic terms, what does this table tell you about the relationship between mental health and gender?
** Answer: Women appear to have higher scores on the GHQ scale, indicating a greater likelihood of psychological problems.
gen PM = hlghq2
recode PM 0/2 = 0 3/12 = 1 -9 -8 -7 = .
tab hlghq2 PM, m
** the tab command at the end is used only to check that we have done the re-coding correctly.
tab PM sex, col nof chi2
** Question: How much more likely are women than men to be at risk of minor psychiatric disorder? Is this difference statistically significant?
** Answer: 31.47% of women against 21.83% of men are at such risk. This is a difference of just under 10%.
** The ,chi2 option, and the P-value of 0.000 which it generates, shows that this difference is statistically significant.
*******************************************************************
** 1.1.6 OLS REGRESSION - REGRESS LIKERT SCORE ON SEX AND AGE
*******************************************************************
gen LIKERT = hlghq1
recode LIKERT -7 -8 -9 = .
tab LIKERT
tab age,m // to check for missing codes, etc - there aren't any
gen age2 = age * age
** We include the age-squared variable to capture non-linearities in the effect of age on the dependent variable
tab sex // shows that there are no outlying values
gen female = (sex == 2)
tab female // again, just for checking!
reg LIKERT age age2 female
** Question: Calculate turning point as (- coefficient on age) / (twice coefficient on age squared)
display -(_b[age]) /(2* _b[age2])
** Question: It comes out at 52.46 years of age.
** Question: What proportion of the variation in the dependent variable is explained by this regression?
** 2.5% of the variation is explained. This is indicated by the R-squared coefficient.
*******************************************
** 1.1.7 IMPROVING THE OLS SPECIFICATION
*******************************************
bysort female: reg LIKERT age age2
reg LIKERT c.age##i.female c.age2##i.female
gen ed_deg = (qfedhi == 1 | qfedhi == 2) if qfedhi >= 1 & qfedhi != .
gen ed_sec = (qfedhi >= 3 & qfedhi <= 6) if qfedhi >= 1 & qfedhi != .
gen partner = (mastat == 1 | mastat == 2) if mastat >= 1 & mastat != .
gen ue_sick = (jbstat == 3 | jbstat == 8) if jbstat >= 1 & jbstat != .
gen badhealth = hlstat if hlstat >= 1 & hlstat != .
reg LIKERT age age2 female ed_deg ed_sec partner ue_sick badhealth nch02
reg LIKERT age age2 female ed_deg ed_sec partner ue_sick badhealth nch02, cluster(pid)
bysort sex: reg LIKERT age age2 female ed_deg ed_sec partner ue_sick badhealth nch02, cluster(pid)
** From separate regressions, it looks as though the coefficients on "partner" and "nch02"
** might be significant for women but not men.
** To test whether these differences are significant statistically, include them in a new regression
reg LIKERT age age2 ed_deg ed_sec i.partner##i.female ue_sick badhealth c.nch02##i.female, cluster(pid)
** Both interaction terms are highly significant.
****************************************
** 1.1.8 LOGIT AND PROBIT REGRESSIONS
****************************************
logit PM age age2 female ed_deg ed_sec partner ue_sick badhealth nch02
probit PM age age2 female ed_deg ed_sec partner ue_sick badhealth nch02
** Question: Does Amemiya's scaling factor 1.6 * (Probit coeffs) = (Logit coeffs) work?
** Answer: It gives broadly comparable estimates.
** Use the "display" function in Stata to compare the coefficients on ue_sick:
dis _b[ue_sick] * 1.6
** 54666448
logit PM age age2 female ed_deg ed_sec partner ue_sick badhealth nch02
margins, dydx(*) atmeans
margins, dydx(*)
margins, dydx(*) at(age=(20 30 40 50))
probit PM age age2 female ed_deg ed_sec partner ue_sick badhealth nch02
margins, dydx(*) atmeans
reg PM age age2 female ed_deg ed_sec partner ue_sick badhealth nch02 // for comparison purposes
** You'll see that while the OLS coefficients are not identical to the logit or probit coefficients,
** they are in the same ball park.
logit PM age age2 female ed_deg ed_sec partner ue_sick badhealth nch02
test ue_sick = badhealth
** STATA does not reject the hypothesis that the coefficients in question are identical
** However, we shouldn't jump to the conclusion that being in poor health has
** the "same effect" on mental health as being unemployed or being on sick leave.
** Question: Why?
** Answer: One is a dichotomous variable, and the other is being treated as a continuous variable
** ranging from 1 to 5.
logit PM age age2 female ed_deg ed_sec partner ue_sick badhealth nch02
estimates store ALL
gen byte SAMPLE = e(sample)
logit PM age age2 female ed_deg ed_sec ue_sick badhealth nch02 if e(sample)
estimates store DROP_PNR
lrtest ALL DROP_PNR
drop SAMPLE
** Question: Should the "partner" variable be dropped?
** Answer: Yes. The p-value of 0.0737 on the chi-2 statistic means that we do not reject the hypothesis
** that the restricted specification is better.
** In fact, we could have discovered this from the P-value on the partner coefficient in the orignal regression.
** An LR test is especially useful when you are testing for the joint significance of multiple variables at once,
** such as interaction effects with a categorical variable
****************************************
** 1.1.9 AT THE END OF THE SESSION
****************************************
save SESSION1-1
***********************************************
** SC968 PANEL DATA METHODS for SOCIOLOGISTS
** DO-FILE FOR LECTURE 2
***********************************************
***********************************************
** 1.2.0 OBJECTIVES
***********************************************
***********************************************
** 1.2.1 APPENDING FILES
***********************************************
clear
use extra1
append using extra2
save newfile1, replace
des using extra1
des using extra2
des using newfile1
***********************************************
** 1.2.2 MERGING FILES
***********************************************
use SESSION1-1
merge 1:1 pid wave using newfile1
drop _merge
save SESSION1-2
use hhtenure
merge 1:m wave hid using SESSION1-2
drop _merge
save SESSION1-2, replace
**************************************************
** 1.2.3 CONVERTING BETWEEN LONG AND WIDE FORMS
**************************************************
sort pid wave
keep in 1/50
keep pid wave age sex LIKERT
reshape wide LIKERT age, i(pid) j(wave)
reshape long LIKERT age, i(pid) j(wave)
**************************************************
** 1.2.4 OLS USING A LAGGED DEPENDENT VARIABLE
**************************************************
use SESSION1-2, clear
sort pid wave
gen LIKERT_LAG = LIKERT[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1
sort pid wave
list pid wave LIKERT LIKERT_LAG in 1/200
** Question: How many missing values are there for LIKERT and LIKERT_LAG? Why are there more missing values for LIKERT_LAG?
** Answer: There are 11557 missing values for LIKERT, and 14738 missing values for LIKERT_LAG
** There are more missing values for the lagged value, because it is zero for every observation in wave 1
reg LIKERT age age2 female ed_deg ed_sec partner ue_sick badhealth nch02, cluster(pid)
reg LIKERT LIKERT_LAG age age2 female ed_deg ed_sec partner ue_sick badhealth nch02, cluster(pid)
** Question: Compare the two sets of estimates. How do they differ?
** Answer: The r-squared statistic is much larger in the second specification, and the coefficient on LIKERT_LAG is very large.
** Most other coefficients are reduced in magnitude, some by around a half.
**************************************************
** 1.2.5 MODELS OF CHANGE
**************************************************
gen D_LIKERT = LIKERT - LIKERT_LAG
sort pid wave
gen D_LIKERT2 = LIKERT - LIKERT[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1
gen D_partner = partner - partner[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1
gen D_ue_sick = ue_sick - ue_sick[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1
gen D_badhealth = badhealth - badhealth[_n-1] if pid == pid[_n-1] & wave == wave[_n-1] + 1
reg D_LIKERT D_partner D_ue_sick D_badhealth, cluster(pid)
** Question: What do these results tell us?
** Answer: Getting a partner is associated with feeling better (and/or losing a partner with feeling worse)
** Losing a job or going off sick is associated with feeling worse (and/or stopping being u/e or off sick with feeling better)
** Worsening health is associated with worse mental health (and/or the reverse)
** Note that because "good" and "bad" changes have been lumped together,
** we cannot tell whether the relationship is being driven by changes in both directions, or only one.
**************************************************
** 1.2.6 AT THE END OF THE SESSION
**************************************************
save SESSION1-2, replace