Finding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning
This paper considers the problem of linking individuals in the Health and Retirement Study, a household-level survey, to workplaces in the Census Business Register, a population-scale administrative data set on the universe of employers. The authors handle two statistical issues associated with record linkage in this setting: First, unique identifiers are not systematically available, so the linkage method is probabilistic rather than deterministic. Second, household-to-employer record linkage is challenging because the distribution of workers across employers is highly skewed. Consequently, there are tens of thousands of false matches for each household survey respondent that must be appropriately down weighted or completely removed from consideration when linking records. The authors develop a novel method that relies on machine learning to improve linkage prediction and multiple imputation to propagate linkage uncertainty to create a new data set known as the CenHRS. They then use the CenHRS to re-examine previous studies’ findings that larger employers pay observationally equivalent workers higher wages compared with smaller employers.
Key Findings
- The authors’ analysis of the positive gradient between wages and employer size using the CenHRS shows that both non-classical measurement error and selective non-response in the Health and Retirement Study reports of workplace size generate upward bias in this gradient.
Implications
The paper’s findings serve as new evidence on how household survey responses about workplace characteristics are selectively misreported or not reported at all. With the CenHRS linkages to administrative information on workplaces, the authors are able to characterize measurement and non-response errors that are not observed in other household-survey data sets. Furthermore, the CenHRS could provide researchers novel ways to investigate wide-ranging questions about the roles that employer- and workplace-specific factors play in influencing wages, consumption and savings decisions, health outcomes, and retirement choices of older workers.
Abstract
This paper considers the problem of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across establishments is highly skewed. To address these difficulties, this paper develops a probabilistic record linkage methodology that combines machine learning (ML) with multiple imputation (MI). This ML-MI methodology is applied to link survey respondents in the Health and Retirement Study to their workplaces in the Census Business Register. The linked data reveal new evidence that non-sampling errors in household survey data are correlated with respondents’ workplace characteristics.