Go to main content

The rate of growth in scientific output has surged past the human cognitive throughput of information ingestion. As submissions to biomedical publication databases like PubMed alone exceed over three articles per minute, researchers are consistently surveyed to read 22 articles per month on average. Large Language Models (LLMs) emerged as powerful engines for ingesting and synthesizing this growing corpus. However, their innately probabilistic working principle is reliant on their static and obsolescent pretraining data. Moreover, training corpora of frontier LLMs are compiled for breadth and broad applicability, not for scientific depth and domain expertise. Naturally, this composition renders LLMs unreliable and prone to hallucination in knowledge-intensive scientific tasks. Retrieval-Augmented Generation (RAG) addresses these shortcomings by equipping LLMs with an extrinsic knowledge source at inference time, allowing for the ingestion of additional knowledge without altering the model parameters. Yet scaling RAG to handle the deluge of scientific output– in the order of millions of documents– introduces challenges at every stage of the pipeline. To name a few, parsing dense multi-modal PDFs, encoding domain-specific, terminology-rich text, evaluation models on unmistakably contaminated benchmarks, and orchestrating hundreds of compute nodes with thousands of GPUs to simply overcome the sheer scale of the problem. This dissertation presents the design, implementation, and evaluation of retrieval-augmented reasoning systems that operate at the scale of millions of scientific documents and leverage exascale supercomputing infrastructure to transform how scientists– human and AI alike– interface with the growing body of scientific literature. We base this dissertation on three principal contributions. First, we present HiPerRAG, a distributed high-performance computing workflow for RAG over scientific literature that indexes over 3.6 million full-text articles using three leadership-class supercomputers. Through the Oreo multimodal parser and the ColTrast query-aware encoder fine-tuning objective, HiPerRAG empowers open-source models with retrieval and leads them to outperform proprietary frontier LLMs on scientific question-answering tasks. Second, we introduce a scalable automated multiple-choice question answering (MCQA) benchmark generation pipeline that produces hundreds of thousands of provenance-tracked questions from tens of thousands of full-text articles. Through this work, we identify distillation through retrieval as a viable alternative to weight-based distillation. By treating frontier-model reasoning traces as retrievable artifacts rather than training or fine-tuning signals, we enable small language models such as TinyLlama-1.1B to achieve a 4× improvement in domain accuracy and bring several 7-8B parameter models within striking distance of frontier performance on an expert-annotated radiation oncology examination. Third, we introduce Swarm Retrieval, a forward-looking paradigm in which documents are not treated as passive points in a vector space, but as agents embodying documents that can judge their own relevance to a given query in an interpretable manner. This paradigm challenges the status-quo of documents as inert static entries, and reimagines them as active participants of the retrieval process. Taken together, these contributions chart a path toward a computable scientific corpus which we define as a unified, queryable interface over the full body of published science that is technically feasible with current exascale infrastructure.

Metric
From
To
Interval
Export
Download Full History