Addressing discretization-induced bias in demographic prediction

Dong, Evan; Schein, Aaron; Wang, Yixin; Garg, Nikhil

Dong, Evan; Schein, Aaron; Wang, Yixin; Garg, Nikhil

2025

Download

Formats

Add to Basket

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions—e.g. based on name and geography—and then to often discretize the predictions by selecting the most likely class (argmax), potentially with a minimum threshold (thresholding). We study how this practice produces discretization bias. For example, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of Black voters, e.g. by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a joint optimization approach—and a tractable data-driven threshold heuristic—that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences.

Details

Title

Addressing discretization-induced bias in demographic prediction

Author

Dong, Evan : Cornell University : (https://orcid.org/0009-0009-8663-9836)
Schein, Aaron : University of Chicago : (https://orcid.org/0000-0002-5507-2904)
Wang, Yixin : University of Michigan : (https://orcid.org/0000-0002-6617-4842)
Garg, Nikhil : Cornell Tech : (https://orcid.org/0000-0002-1988-792X)

Content Type

Article

Published in

PNAS Nexus

Identifier(s)

DOI: https://doi.org/10.1093/pnasnexus/pgaf027

Data availability statement

The replication dataset in SI Appendix C.1 is public at Barber and Argyle (57) and the result of work by Argyle and Barber (21). The replication dataset in SI Appendix C.4 is public at Greengard and Gelman (58) and the result of work by Greengard and Gelman (18). The specific code and jupyter notebook used for these analyses are available at https://github.com/evan-dong/demographic-prediction-argmax-bias. A more general repository of code with a jupyter notebook for other researchers and practitioners to discretize and analyze their own model outputs is at https://github.com/evan-dong/demographic-discretization. The commercial dataset used in our analysis is privately owned by TargetSmart, a political data and analytics company, a copy of which we accessed with a research license from PredictWise, a campaign analytics firm. We are unable to provide public access to this proprietary dataset. Researchers can apply for access to TargetSmart data by contacting TargetSmart at: https://targetsmart.com/contact-us/.

Publication Date

2025-01-30

Language

English

Copyright Statement

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited

Licensing

CC BY

Record Appears in

Centers and Institutes > Data Science Institute
Physical Sciences Division > Statistics
All

Record Created

2025-03-20

Actions

PDF

Statistics

Download Full History