Published April 4, 2023 | Version v1
Journal article Open

No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile

  • 1. University of Maryland
  • 2. University of Chicago
  • 3. Arizona State University
  • 4. University of Pennsylvania
  • 5. National Bureau of Economic Research

Description

While linking records across large administrative datasets ["big data"] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to "ground-truth" examples—matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use "active learning" algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.

Data availability

The dataset we used in the present study contains the universe of criminal charges filed in Oregon courts between 1990 and 2012. The dataset includes personally identifiable information of each defendant, including name (first, last), date of birth, race, and sex. The dataset is maintained by the Oregon Judicial Department. Individuals who wish to replicate the analyses in the paper can apply to access the data via a public records request (https://www.oregon.gov/cjc/about/Pages/Public-Records-Request.aspx), or through a data sharing agreement with the Oregon Judicial Department. Interested parties may email the Oregon Judicial Department at OJCIN Online OJCIN Online, or contact Stephanie Guerena via email contact Stephanie Guerena via email.

Files

No-ground-truth-No-problem-Improving-administrative-data-linking-using-active-learning-and-a-little-bit-of-guile.pdf

Files (1.3 MB)

Name Size Download all
Article
md5:09bedf031d2453df46b5b07dd5b55640
993.6 kB Preview Download
Supporting information
md5:e96c96b210deed260a13583aa8225aea
271.6 kB Preview Download

Additional details

Identifiers

DOI
10.1371/journal.pone.0283811
Other
oai:uchicago.tind.io:5700

UChicago Information

Division(s)
Social Sciences Division
Center(s) or Institute(s)
Crime Lab