NERO: A biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Wang, Kanix; Stevens, Robert; Alachram, Halima; Li, Yu; Soldatova, Larisa; King, Ross; Ananiadou, Sophia; Schoene, Annika M.; Li, Maolin; Christopoulou, Fenia; Ambite, José Luis; Matthew, Joel; Garg, Sahil; Hermjakob, Ulf; Marcu, Daniel; Sheng, Emily; Beißbarth, Tim; Wingender, Edgar; Galstyan, Aram; Gao, Xin; Chambers, Brendan; Pan, Weidi; Khomtchouk, Bohdan B.; Evans, James A.; Rzhetsky, Andrey

NERO: A biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

2021

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Cite

Files

Abstract

Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades, the most dramatic advances in MR have followed in the wake of critical corpus development. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.

Details

Title

NERO: A biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding

Author

Wang, Kanix : University of Chicago : (http://orcid.org/0000-0003-1355-577X)
Stevens, Robert : University of Manchester
Alachram, Halima : University of Göttingen
Li, Yu : King Abdullah University of Science and Technology : (http://orcid.org/0000-0002-3664-6722)
Soldatova, Larisa : University of London : (http://orcid.org/0000-0001-6489-3029)
King, Ross : University of Cambridge
Ananiadou, Sophia : University of Manchester
Schoene, Annika M. : University of Manchester
Li, Maolin : University of Manchester : (http://orcid.org/0000-0002-0828-2001)
Christopoulou, Fenia : University of Manchester
Ambite, José Luis : University of Southern California : (http://orcid.org/0000-0003-0087-080X)
Matthew, Joel : University of Southern California
Garg, Sahil : University of Southern California
Hermjakob, Ulf : University of Southern California
Marcu, Daniel : University of Southern California
Sheng, Emily : University of Southern California
Beißbarth, Tim : University of Göttingen
Wingender, Edgar : geneXplain GmbH
Galstyan, Aram : University of Southern California
Gao, Xin : King Abdullah University of Science and Technology : (http://orcid.org/0000-0002-7108-3574)
Chambers, Brendan : University of Chicago
Pan, Weidi : University of Chicago
Khomtchouk, Bohdan B. : University of Chicago : (http://orcid.org/0000-0001-9607-7528)
Evans, James A. : University of Chicago : (http://orcid.org/0000-0001-9838-0707)
Rzhetsky, Andrey : University of Chicago : (http://orcid.org/0000-0001-6959-7405)

Content Type

Article

Published in

npj Systems Biology and Applications

Identifier(s)

DOI: https://doi.org/10.1038/s41540-021-00200-x

Data availability statement

The datasets generated during and/or analyzed during the current study are available in the Github repository at https://github.com/arzhetsky/Chicago_corpus. NERO in OWL format is available at: https://bioportal.bioontology.org/ontologies/NERO

We also deployed a package called NERO-nlp for researchers interested in diving deeper into our annotated corpus; the installation guides and scripts are available online at https://pypi.org/project/NERO-nlp and https://github.com/Bohdan-Khomtchouk/NERO-nlp, respectively.

Funding Information

ARO, DARPA Big Mechanism program, W911NF1410333
National Institutes of Health, R01HL122712
National Institutes of Health, 1P50MH094267
National Institutes of Health, K12HL143959
National Institutes of Health, U01HL108634-01
Liz and Kent Dauten
King Abdullah University of Science and Technology, FCS/1/4102-02-01
King Abdullah University of Science and Technology, FCC/1/1976-26-01
King Abdullah University of Science and Technology, REI/1/0018-01-01
King Abdullah University of Science and Technology, REI/1/4473-01-01

Publication Date

2021-10-20

Language

English

Copyright Statement

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Licensing

CC BY

Record Appears in

Biological Sciences Division > Genetics, Genomics, and Systems Biology
Centers and Institutes > Institute for Genomics and Systems Biology
Biological Sciences Division > Human Genetics
Biological Sciences Division > Medicine
Social Sciences Division > Sociology
Centers and Institutes > Knowledge Lab
All

Record Created

2023-08-24

PDF

Statistics

Download Full History