How to Get the Most out of Your Curation Effort

Rzhetsky, Andrey; Shatkay, Hagit; Wilbur, W. John

doi:10.6082/5n6cz-a6a71

Published May 22, 2009 | Version v1

Journal article Open

How to Get the Most out of Your Curation Effort

1. University of Chicago
2. Queen's University
3. National Institutes of Health

Large-scale annotation efforts typically involve several experts who may disagree with each other. We propose an approach for modeling disagreements among experts that allows providing each annotation with a confidence value (i.e., the posterior probability that it is correct). Our approach allows computing certainty-level for individual annotations, given annotator-specific parameters estimated from data. We developed two probabilistic models for performing this analysis, compared these models using computer simulation, and tested each model's actual performance, based on a large data set generated by human annotators specifically for this study. We show that even in the worst-case scenario, when all annotators disagree, our approach allows us to significantly increase the probability of choosing the correct annotation. Along with this publication we make publicly available a corpus of 10,000 sentences annotated according to several cardinal dimensions that we have introduced in earlier work. The 10,000 sentences were all 3-fold annotated by a group of eight experts, while a 1,000-sentence subset was further 5-fold annotated by five new experts. While the presented data represent a specialized curation task, our modeling approach is general; most data annotation studies could benefit from our methodology.

Files

journal.pcbi.1000391.pdf

Files (5.4 MB)

Name	Size	Download all
journal.pcbi.1000391.pdf Article md5:348657695066bf5117f1c19195360d94	1.1 MB	Preview Download
journal.pcbi.1000391.zip md5:2704e37dc6bcff4b218006a96e27e6e2	4.3 MB	Preview Download

Additional details

DOI: 10.1371/journal.pcbi.1000391
Other: oai:uchicago.tind.io:10217

National Institutes of Health
GM61372
National Science Foundation
0438291
National Science Foundation
0121687
Cure Autism Now Foundation
National Institutes of Health
Intramural Research Program
National Library of Medicine

Division(s): Biological Sciences Division
Department(s): Human Genetics, Medicine
Center(s) or Institute(s): Institute for Genomics and Systems Biology

	All versions	This version
Views	4	4
Downloads	9	9
Data volume	22.8 MB	22.8 MB

How to Get the Most out of Your Curation Effort

Files

journal.pcbi.1000391.pdf

Files (5.4 MB)

Additional details

Identifiers

Funding

UChicago Information

How to Get the Most out of Your Curation Effort

Creators

Description

Files

journal.pcbi.1000391.pdf

Files (5.4 MB)

Additional details

Identifiers

Funding

UChicago Information