Published October 20, 2021
| Version v1
Journal article
Open
NERO: A biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding
Creators
-
Wang, Kanix1
- Stevens, Robert2
- Alachram, Halima3
-
Li, Yu4
-
Soldatova, Larisa5
- King, Ross6
- Ananiadou, Sophia2
- Schoene, Annika M.2
-
Li, Maolin2
- Christopoulou, Fenia2
-
Ambite, José Luis7
- Matthew, Joel7
- Garg, Sahil7
- Hermjakob, Ulf7
- Marcu, Daniel7
- Sheng, Emily7
- Beißbarth, Tim3
- Wingender, Edgar8
- Galstyan, Aram7
-
Gao, Xin4
- Chambers, Brendan1
- Pan, Weidi1
-
Khomtchouk, Bohdan B.1
-
Evans, James A.1
-
Rzhetsky, Andrey1
- 1. University of Chicago
- 2. University of Manchester
- 3. University of Göttingen
- 4. King Abdullah University of Science and Technology
- 5. University of London
- 6. University of Cambridge
- 7. University of Southern California
- 8. geneXplain GmbH
Description
Machine reading (MR) is essential for unlocking valuable knowledge contained in millions of existing biomedical documents. Over the last two decades, the most dramatic advances in MR have followed in the wake of critical corpus development. Large, well-annotated corpora have been associated with punctuated advances in MR methodology and automated knowledge extraction systems in the same way that ImageNet4 was fundamental for developing machine vision techniques. This study contributes six components to an advanced, named entity analysis tool for biomedicine: (a) a new, Named Entity Recognition Ontology (NERO) developed specifically for describing textual entities in biomedical texts, which accounts for diverse levels of ambiguity, bridging the scientific sublanguages of molecular biology, genetics, biochemistry, and medicine; (b) detailed guidelines for human experts annotating hundreds of named entity classes; (c) pictographs for all named entities, to simplify the burden of annotation for curators; (d) an original, annotated corpus comprising 35,865 sentences, which encapsulate 190,679 named entities and 43,438 events connecting two or more entities; (e) validated, off-the-shelf, named entity recognition (NER) automated extraction, and; (f) embedding models that demonstrate the promise of biomedical associations embedded within this corpus.
Data availability
The datasets generated during and/or analyzed during the current study are available in the Github repository at https://github.com/arzhetsky/Chicago_corpus. NERO in OWL format is available at: https://bioportal.bioontology.org/ontologies/NERO
We also deployed a package called NERO-nlp for researchers interested in diving deeper into our annotated corpus; the installation guides and scripts are available online at https://pypi.org/project/NERO-nlp and https://github.com/Bohdan-Khomtchouk/NERO-nlp, respectively.
Files
NERO-A-biomedical-named-entity-recognition-ontology.pdf
Files
(4.6 MB)
| Name | Size | Download all |
|---|---|---|
|
Supplementary information md5:e9ff3c46043c676bb5d4b75b469f46ae |
1.8 MB | Preview Download |
|
Article md5:05e2e7c64bd8c7054d834cdb38eb0ba0 |
2.8 MB | Preview Download |
Additional details
Identifiers
- DOI
- 10.1038/s41540-021-00200-x
- Other
- oai:uchicago.tind.io:7704
Funding
- ARO
- DARPA Big Mechanism program
- National Institutes of Health
- R01HL122712
- National Institutes of Health
- 1P50MH094267
- National Institutes of Health
- K12HL143959
- National Institutes of Health
- U01HL108634-01
- Liz and Kent Dauten
- King Abdullah University of Science and Technology
- FCS/1/4102-02-01
- King Abdullah University of Science and Technology
- FCC/1/1976-26-01
- King Abdullah University of Science and Technology
- REI/1/0018-01-01
- King Abdullah University of Science and Technology
- REI/1/4473-01-01