NERO: A biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Wang, Kanix; Stevens, Robert; Alachram, Halima; Li, Yu; Soldatova, Larisa; King, Ross; Ananiadou, Sophia; Schoene, Annika M.; Li, Maolin; Christopoulou, Fenia; Ambite, José Luis; Matthew, Joel; Garg, Sahil; Hermjakob, Ulf; Marcu, Daniel; Sheng, Emily; Beißbarth, Tim; Wingender, Edgar; Galstyan, Aram; Gao, Xin; Chambers, Brendan; Pan, Weidi; Khomtchouk, Bohdan B.; Evans, James A.; Rzhetsky, Andrey

doi:10.6082/fwbvp-gsq27

Published October 20, 2021 | Version v1

Journal article Open

NERO: A biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

1. University of Chicago
2. University of Manchester
3. University of Göttingen
4. King Abdullah University of Science and Technology
5. University of London
6. University of Cambridge
7. University of Southern California
8. geneXplain GmbH

Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades, the most dramatic advances in MR have followed in the wake of critical corpus development. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

Data availability

The datasets generated during and/or analyzed during the current study are available in the Github repository at https://github.com/arzhetsky/Chicago_corpus. NERO in OWL format is available at: https://bioportal.bioontology.org/ontologies/NERO

We also deployed a package called NERO-nlp for researchers interested in diving deeper into our annotated corpus; the installation guides and scripts are available online at https://pypi.org/project/NERO-nlp and https://github.com/Bohdan-Khomtchouk/NERO-nlp, respectively.

Files

NERO-A-biomedical-named-entity-recognition-ontology.pdf

Files (4.6 MB)

Name	Size	Download all
41540_2021_200_MOESM1_ESM.pdf Supplementary information md5:e9ff3c46043c676bb5d4b75b469f46ae	1.8 MB	Preview Download
NERO-A-biomedical-named-entity-recognition-ontology.pdf Article md5:05e2e7c64bd8c7054d834cdb38eb0ba0	2.8 MB	Preview Download

Additional details

DOI: 10.1038/s41540-021-00200-x
Other: oai:uchicago.tind.io:7704

ARO
DARPA Big Mechanism program
National Institutes of Health
R01HL122712
National Institutes of Health
1P50MH094267
National Institutes of Health
K12HL143959
National Institutes of Health
U01HL108634-01
Liz and Kent Dauten
King Abdullah University of Science and Technology
FCS/1/4102-02-01
King Abdullah University of Science and Technology
FCC/1/1976-26-01
King Abdullah University of Science and Technology
REI/1/0018-01-01
King Abdullah University of Science and Technology
REI/1/4473-01-01

Division(s): Biological Sciences Division, Social Sciences Division
Department(s): Genetics, Genomics, and Systems Biology, Human Genetics, Medicine, Sociology
Center(s) or Institute(s): Institute for Genomics and Systems Biology, Knowledge Lab

	All versions	This version
Views	10	10
Downloads	16	16
Data volume	37.2 MB	37.2 MB

NERO: A biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Data availability

Files

NERO-A-biomedical-named-entity-recognition-ontology.pdf

Files (4.6 MB)

Additional details

Identifiers

Funding

UChicago Information

NERO: A biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Creators

Description

Data availability

Files

NERO-A-biomedical-named-entity-recognition-ontology.pdf

Files (4.6 MB)

Additional details

Identifiers

Funding

UChicago Information