<?xml version="1.0" encoding="UTF-8"?>
<paper xmlns="http://www.w3.org/2005/Atom">
  <title>Measuring Inequality Using Censored Data: A Multiple Imputation Approach</title>
  <url>http://www.iser.essex.ac.uk/publications/working-papers/iser/2009-04</url>
  <summary>For assessing trends in the inequality of earnings or of household income inequality in the U.S.A., the March Current Population Survey (CPS) is the premier survey data source, widely used both within and outside government. The CPS is, however, subject to an important limitation: the data are right censored (&#8216;topcoded&#8217;). To maximize confidentiality and to minimize disclosure risk, income values for any of the 24 income sources in internal CPS data that are above their source-specific threshold are replaced in the public use data files by the threshold itself (the &#8216;topcode&#8217;). Thus, for example, if someone reports wage and salary earnings of a million dollars, the value researchers will see in their data set is not one million dollars, but a lower value (the topcode). Furthermore, even the internal CPS data, used by the U.S. Census Bureau to produce official income distribution statistics, are topcoded for the same reasons, albeit to a substantially lesser degree than the public use data. 

Topcoded data cause problems for inequality analysis because they censor the range of incomes that are observed. Inequality is underestimated because very high incomes appear as less high incomes. This problem would be less of an issue when one is looking at inequality trends over time if the nature and extent of top coding were constant. However, CPS topcodes have changed over time in a number of ways, leading to a potentially serious time-inconsistency problem for inequality analysis. Topcoding also affects estimates of the statistical reliability of estimates of inequality measures. 

We propose a &#8216;multiple imputation&#8217; approach to estimating inequality using topcoded data exploiting results derived by Reiter (Survey Methodology, 2003). We argue that this approach provides good estimates of inequality measures and their statistical reliability. We use the method to analyze recent trends in household income inequality in the U.S.A. over the period 1995&#8211;2004, exploiting our unprecedented access to internal CPS data. The estimates based on multiply imputed internal data series (which we label Internal-MI) form our gold standard reference point. They are compared with two sets of estimates derived from public use CPS data. The Public-MI series arises when we apply the multiple imputation to public use data rather than internal data, and the Public-CM series arises when topcoded values are replaced by cell mean average incomes that are provided by the U.S. Census Bureau and derived from internal CPS data. 

We show that the inequality of household income in the U.S.A. did not change significantly between 1995 and 2004. We find that the cell mean augmented data lead to substantial under-estimates of inequality levels in every year, though the trends over time are tracked relatively well. However, the statistical precision of estimates derived from the cell-mean-augmented distributions is also over-estimated and, as a result, there is a tendency for inequality trends over the period to be shown (incorrectly) as statistically significant.

More positively, we also show that a multiple imputation approach applied to topcoded public use CPS data can yield results about income inequality that in several senses lie between those derived using multiple imputation applied to internal data and those derived using cell-mean augmented public use data. This is helpful because few researchers find it practical to go through the procedures required to access the internal CPS data and to undertake the research using the data within a U.S. Census Bureau Data Center.

Although our arguments are developed with reference to CPS data, the methods are more widely applicable to other datasets and other countries: topcoding is a relatively common feature of income data sets.</summary>
  <abstract>To measure income inequality with right censored (topcoded) data, we propose multiple imputation for censored observations using draws from Generalized Beta of the Second Kind distributions to provide partially synthetic datasets analyzed using complete data methods. Estimation and inference uses Reiter&#8217;s (Survey Methodology 2003) formulae. Using Current Population Survey (CPS) internal data, we find few statistically significant differences in income inequality for pairs of years between 1995 and 2004. We also show that using CPS public use data with cell mean imputations may lead to incorrect inferences about inequality differences. Multiply-imputed public use data provide an intermediate solution.</abstract>
  <paper_series>Working Paper</paper_series>
  <series_number>2009-04</series_number>
  <published_date>2009-02-09</published_date>
  <author>
    <firstname>Stephen</firstname>
    <familyname>Jenkins</familyname>
    <instutitue>Institute for Social and Economic Research</instutitue>
    <email>stephenj@essex.ac.uk</email>
  </author>
  <author>
    <firstname>Richard</firstname>
    <familyname>Burkhauser</familyname>
    <instutitue>Cornell University</instutitue>
    <email>rvb1@cornell.edu</email>
  </author>
  <author>
    <firstname>Shuaizhang</firstname>
    <familyname>Feng</familyname>
    <instutitue>Shanghai University of Finance and Economics</instutitue>
  </author>
  <author>
    <firstname>Jeff</firstname>
    <familyname>Larrimore</familyname>
    <instutitue>Cornell University</instutitue>
  </author>
</paper>
