Finding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning
This paper considers the problem of linking individuals in the Health and Retirement Study, a household-level survey, to workplaces in the Census Business Register, a population-scale administrative data set on the universe of employers. The authors handle two statistical issues associated with record linkage in this setting: First, unique identifiers are not systematically available, so the linkage method is probabilistic rather than deterministic. Second, household-to-employer record linkage is challenging because the distribution of workers across employers is highly skewed. Consequently, there are tens of thousands of false matches for each household survey respondent that must be appropriately down weighted or completely removed from consideration when linking records. The authors develop a novel method that relies on machine learning to improve linkage prediction and multiple imputation to propagate linkage uncertainty to create a new data set known as the CenHRS. They then use the CenHRS to re-examine previous studies’ findings that larger employers pay observationally equivalent workers higher wages compared with smaller employers.