Finding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning

By John M. Abowd, Joelle Abramowitz, Margaret C. Levenstein, Kristin McCue, Dhiren Patki, Trivellore Raghunathan, Ann M. Rodgers, Matthew D. Shapiro, Nada Wasi, and Dawn Zinsser

Full Text Document (pdf)

This paper considers the problem of linking individuals in the Health and Retirement Study, a household-level survey, to workplaces in the Census Business Register, a population-scale administrative data set on the universe of employers. The authors handle two statistical issues associated with record linkage in this setting: First, unique identifiers are not systematically available, so the linkage method is probabilistic rather than deterministic. Second, household-to-employer record linkage is challenging because the distribution of workers across employers is highly skewed. Consequently, there are tens of thousands of false matches for each household survey respondent that must be appropriately down weighted or completely removed from consideration when linking records. The authors develop a novel method that relies on machine learning to improve linkage prediction and multiple imputation to propagate linkage uncertainty to create a new data set known as the CenHRS. They then use the CenHRS to re-examine previous studies’ findings that larger employers pay observationally equivalent workers higher wages compared with smaller employers.

collapse all down

expand all

Key Findings

The authors’ analysis of the positive gradient between wages and employer size using the CenHRS shows that both non-classical measurement error and selective non-response in the Health and Retirement Study reports of workplace size generate upward bias in this gradient.

Implications

The paper’s findings serve as new evidence on how household survey responses about workplace characteristics are selectively misreported or not reported at all. With the CenHRS linkages to administrative information on workplaces, the authors are able to characterize measurement and non-response errors that are not observed in other household-survey data sets. Furthermore, the CenHRS could provide researchers novel ways to investigate wide-ranging questions about the roles that employer- and workplace-specific factors play in influencing wages, consumption and savings decisions, health outcomes, and retirement choices of older workers.

Abstract

This paper considers the problem of record linkage between a household-level survey and an establishment-level frame in the absence of unique identifiers. Linkage between frames in this setting is challenging because the distribution of employment across establishments is highly skewed. To address these difficulties, this paper develops a probabilistic record linkage methodology that combines machine learning (ML) with multiple imputation (MI). This ML-MI methodology is applied to link survey respondents in the Health and Retirement Study to their workplaces in the Census Business Register. The linked data reveal new evidence that non-sampling errors in household survey data are correlated with respondents’ workplace characteristics.

Resources

Full Text Document (pdf)

Site Topics

Monetary Policy & Economic Research

Keywords

administrative data ,
machine learning ,
multiple imputation ,
probabilistic record linkage ,
survey data

JEL Codes

C13 ,
C18 ,
C81

Citation

Abowd, John M., Joelle Abramowitz, Margaret C. Levenstein, Kristin McCue, Dhiren Patki, Trivellore Raghunathan, Ann M. Rodgers, Matthew D. Shapiro, Nada Wasi, and Dawn Zinsser. 2022. “Finding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning.” Federal Reserve Bank of Boston Research Department Working Papers No. 22-11. https://doi.org/10.29412/res.wp.2022.11

2022 • 22–11

Research Department Working Papers

Finding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning

Key Findings

Implications

Abstract

Resources

Visitor Information & Directions

Finding Needles in Haystacks: Multiple-Imputation Record Linkage Using Machine Learning

Key Findings

Implications

Abstract

Resources

Contributing business areas Research

Contributing business areas

Research