Toward a Computable Scientific Corpus: Retrieval-Augmented Reasoning Systems for Scientific Discovery on Exascale Supercomputers

Gökdemir, Ozan

doi:10.6082/uchicago.17014

Toward a Computable Scientific Corpus: Retrieval-Augmented Reasoning Systems for Scientific Discovery on Exascale Supercomputers

Gökdemir, Ozan

2026

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Cite

The rate of growth in scientific output has surged past the human cognitive throughput of information ingestion. As submissions to biomedical publication databases like PubMed alone exceed over three articles per minute, researchers are consistently surveyed to read 22 articles per month on average. Large Language Models (LLMs) emerged as powerful engines for ingesting and synthesizing this growing corpus. However, their innately probabilistic working principle is reliant on their static and obsolescent pretraining data. Moreover, training corpora of frontier LLMs are compiled for breadth and broad applicability, not for scientific depth and domain expertise. Naturally, this composition renders LLMs unreliable and prone to hallucination in knowledge-intensive scientific tasks. Retrieval-Augmented Generation (RAG) addresses these shortcomings by equipping LLMs with an extrinsic knowledge source at inference time, allowing for the ingestion of additional knowledge without altering the model parameters. Yet scaling RAG to handle the deluge of scientific output– in the order of millions of documents– introduces challenges at every stage of the pipeline. To name a few, parsing dense multi-modal PDFs, encoding domain-specific, terminology-rich text, evaluation models on unmistakably contaminated benchmarks, and orchestrating hundreds of compute nodes with thousands of GPUs to simply overcome the sheer scale of the problem. This dissertation presents the design, implementation, and evaluation of retrieval-augmented reasoning systems that operate at the scale of millions of scientific documents and leverage exascale supercomputing infrastructure to transform how scientists– human and AI alike– interface with the growing body of scientific literature. We base this dissertation on three principal contributions. First, we present HiPerRAG, a distributed high-performance computing workflow for RAG over scientific literature that indexes over 3.6 million full-text articles using three leadership-class supercomputers. Through the Oreo multimodal parser and the ColTrast query-aware encoder fine-tuning objective, HiPerRAG empowers open-source models with retrieval and leads them to outperform proprietary frontier LLMs on scientific question-answering tasks. Second, we introduce a scalable automated multiple-choice question answering (MCQA) benchmark generation pipeline that produces hundreds of thousands of provenance-tracked questions from tens of thousands of full-text articles. Through this work, we identify distillation through retrieval as a viable alternative to weight-based distillation. By treating frontier-model reasoning traces as retrievable artifacts rather than training or fine-tuning signals, we enable small language models such as TinyLlama-1.1B to achieve a 4× improvement in domain accuracy and bring several 7-8B parameter models within striking distance of frontier performance on an expert-annotated radiation oncology examination. Third, we introduce Swarm Retrieval, a forward-looking paradigm in which documents are not treated as passive points in a vector space, but as agents embodying documents that can judge their own relevance to a given query in an interpretable manner. This paradigm challenges the status-quo of documents as inert static entries, and reimagines them as active participants of the retrieval process. Taken together, these contributions chart a path toward a computable scientific corpus which we define as a unified, queryable interface over the full body of published science that is technically feasible with current exascale infrastructure.

Title

Toward a Computable Scientific Corpus: Retrieval-Augmented Reasoning Systems for Scientific Discovery on Exascale Supercomputers

Author

Gökdemir, Ozan : University of Chicago : (https://orcid.org/0000-0001-5299-1983)

Degree Type

Ph.D.

Content Type

Dissertation

Academic Advisor

Rick L. Stevens

Committee Member

Ian T. Foster
Arvind Ramanathan

Keywords

AI for Science; Large Language Models; Retrieval-Augmented Generation; AI Agent Swarms; Reasoning Language Models

Digital Object Identifier

https://doi.org/10.6082/uchicago.17014

Funding Information

Argonne National Laboratory, The U.S. Department of Energy, Laboratory Directed Research and Development (LDRD), DE-AC02-06CH11357
Argonne Leadership Computing Facility (ALCF), The U.S. Department of Energy, Laboratory Directed Research and Development (LDRD), DE-AC02-06CH11357
Oak Ridge Leadership Computing Facility (OLCF), The U.S. Department of Energy, Laboratory Directed Research and Development (LDRD), DE-AC05-00OR2272
Coalition for Epidemic Preparedness Innovations (CEPI), Disease X Program, US DOE LUCID (Low-Dose Understanding, Cellular Insights, and Molecular Discoveries)

Publication Date

2026-06

Language

English

Copyright Statement

Licensing

CC BY

Record Appears in

Physical Sciences Division > Computer Science
All

Record Created

2026-05-04

Download Full History

Toward a Computable Scientific Corpus: Retrieval-Augmented Reasoning Systems for Scientific Discovery on Exascale Supercomputers

Files

Abstract

Details

PDF

Statistics