Clean-up and Preliminary Analysis for Sets of National Files


The goal of various clean-up methods is to improve the quality of files to make them suitable for economic and statistical analyses. To fill-in missing data and ‘correct’ fields, we need generalized software that implements the Fellegi-Holt model (JASA 1976) to preserve joint distributions and assure that records satisfy edits. To identify/correct duplicates within and across files, we need generalized software that implements the Fellegi-Sunter model (JASA 1969). The goal of the clean-up procedures is to reduce the error in files to at most 1% (not currently attainable in many situations).

In this presentation, we cover methods of modeling/edit/imputation and record linkage that naturally morph into methods of adjusting statistical analyses in files to linkage error. The modeling/edit/imputation software has four algorithms that may be each 100 times as fast as algorithms in commercial or experimental university software.

The record linkage software used in the 2010 Decennial Census matches 10^17 pairs (300 million x 300 million) in 30 hours using 40 cpus on an SGI Linux machine. It is 50 times as recent parallel software from Stanford (Kawai et al. 2006) and 500 times as fast as software used in some agencies (Wright 2010).

The main parameter-estimation methods apply the EMH algorithm (Winkler 1993) that generalizes the ECM algorithm (Meng and Rubin 1993) from linear to convex constraints. Following the introduction of the two quality methods, we cover some of the research into adjusting statistical analyses for linkage error that began in Scheuren and Winkler (1993) and that is an area needing considerable additional research. With these new methods, a skilled team of individuals can do data mining on a set of national files in 3-6 months; with methods/software that are 100-1000 times as slow, it is not clear how long the work would take.