Using Occupancy Models to Estimate the Number of Duplicate Cases in a Data System without Unique Identifiers

by Ruiguang Song, Timothy Green, Matthew McKenna, and M. Kathleen Glynn

Journal of Data Science, v.5, no.1, 53-66

Abstract

Data systems collecting information from different sources or over long periods of time can receive multiple reports from the same individual. An important example is public health surveillance systems that monitor conditions with long natural histories. Several state-level systems for surveillance of one such condition, the human immunodeficiency virus (HIV), use codes composed of combinations of non-unique personal characteristics such as birth date, soundex (a code based on last name), and sex as pati ent identifiers. As a result, these systems cannot distinguish between several different individuals having identical codes and a unique individual erroneously represented several times. We applied results for occupancy models to estimate the potential magnitude of duplicate case counting for AIDS cases reported to the Centers for Disease Control and Prevention with only non-unique partial personal identifiers. Occupancy models with equal and unequal occupancy probabilities are considered. Unbiased est imators for the numbers of true duplicates within and between case reporting areas are provided. Formulas to calculate estimators' variances are also provided. These results can be applied to evaluating duplicate reporting in other data systems that have no unique identifier for each individual.

Homepage | Table of Contents | Full Text of This Article