Tabular Data Extraction and Discovery Using Natural Language Questions

Wang, Qiming

doi:10.6082/uchicago.13641

Wang, Qiming

2024

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

Tabular data extraction and discovery from a large corpus are two long-standing challenges in data management. Traditional solutions involve much human effort in writing rules or annotating training data and this expensive manual work has to be repeated for each new domain of source corpus making these solutions not scalable. So can we avoid this repetitive, expensive manual work but still maintain comparable performance across different datasets? In this dissertation, we show that we can do both, 1)By reducing table extraction from a large text corpora as the task of question answering over that corpora, it is possible to build a table extraction system that generalizes to other domain once trained, so that it avoids repeating manual work in collecting training data for new table domains. 2) Given any table corpora, by learning from the corpora itself with the help of a large language model, we can build a table discovery system that matches the query quality of those trained on human-annotated data. Specifically, we build three systems/tools to demonstrate the hypothesis we made. 1) FabricQA-Extractor, a system to extract tables from a text corpora using natural language questions. 2) SOLO: a self-supervised system for table discovery using natural language questions. 3) Pneuma-Benchmark: an automatic tool to evaluate models/systems for table discovery using natural language questions. These works collectively contribute a solution that requires little human effort to tabular data extraction and discovery.