Files
Abstract
Scientific discoveries have traditionally been communicated through written papers, but these documents pose challenges for computer understanding due to the inherent ambiguity and variability of natural languages. Consequently, valuable knowledge and groundbreaking insights often remain buried within an overwhelming volume of publications, rendering them undiscoverable, inaccessible, and unusable for both humans and machines. While efforts have been made to construct scientific databases and repositories from these papers, these initiatives typically rely on laborious and error-prone manual extraction processes, which are not scalable to keep up with the millions of papers published annually. The inability to efficiently extract experimental data from existing literature poses a significant obstacle that hinders the adoption of cost-effective, safe, and data-driven simulations to inform traditional experiments across multiple disciplines.
To address this challenge and enable disciplines to leverage data-based simulations for cheaper, safer, and easier insights and guidance, Natural Language Processing (NLP) methods have emerged as a promising solution. Leveraging advanced machine learning techniques, NLP empowers computers to analyze, comprehend, and derive meaningful information from the ever-expanding scientific literature.
This work explores various methodologies for automatically extracting precise and structured information from unstructured scientific text. Major contributions include a model-in-the-loop pipeline for rapid training data annotation without relying on domain experts, special features designed to adapt classical named entity recognition and relation extraction models to scientific texts, a foundational large language model for science with up to 770M parameters, a joint graph-language model for processing semi-structured data, as well as several real-world applications. By investigating these techniques, we seek to demonstrate their potential in facilitating real-world scientific research and enabling researchers to efficiently leverage the vast knowledge accumulated in literature for accelerated scientific breakthroughs.