Published October 20, 2021 | Version v1
Journal article Open

NERO: A biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Description

Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades, the most dramatic advances in MR have followed in the wake of critical corpus development. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

Data availability

The datasets generated during and/or analyzed during the current study are available in the Github repository at https://github.com/arzhetsky/Chicago_corpus. NERO in OWL format is available at: https://bioportal.bioontology.org/ontologies/NERO

We also deployed a package called NERO-nlp for researchers interested in diving deeper into our annotated corpus; the installation guides and scripts are available online at https://pypi.org/project/NERO-nlp and https://github.com/Bohdan-Khomtchouk/NERO-nlp, respectively.

Files

NERO-A-biomedical-named-entity-recognition-ontology.pdf

Files (4.6 MB)

Name Size Download all
Supplementary information
md5:e9ff3c46043c676bb5d4b75b469f46ae
1.8 MB Preview Download
Article
md5:05e2e7c64bd8c7054d834cdb38eb0ba0
2.8 MB Preview Download

Additional details

Identifiers

DOI
10.1038/s41540-021-00200-x
Other
oai:uchicago.tind.io:7704

Funding

ARO
DARPA Big Mechanism program
National Institutes of Health
R01HL122712
National Institutes of Health
1P50MH094267
National Institutes of Health
K12HL143959
National Institutes of Health
U01HL108634-01
Liz and Kent Dauten
King Abdullah University of Science and Technology
FCS/1/4102-02-01
King Abdullah University of Science and Technology
FCC/1/1976-26-01
King Abdullah University of Science and Technology
REI/1/0018-01-01
King Abdullah University of Science and Technology
REI/1/4473-01-01

UChicago Information

Division(s)
Biological Sciences Division, Social Sciences Division
Department(s)
Genetics, Genomics, and Systems Biology, Human Genetics, Medicine, Sociology
Center(s) or Institute(s)
Institute for Genomics and Systems Biology, Knowledge Lab