Files
Abstract
Tabular data extraction and discovery from a large corpus are two long-standing challenges in data management. Traditional solutions involve much human effort in writing rules or annotating training data and this expensive manual work has to be repeated for each new domain of source corpus making these solutions not scalable. So can we avoid this repetitive, expensive manual work but still maintain comparable performance across different datasets? In this dissertation, we show that we can do both, 1)By reducing table extraction from a large text corpora as the task of question answering over that corpora, it is possible to build a table extraction system that generalizes to other domain once trained, so that it avoids repeating manual work in collecting training data for new table domains. 2) Given any table corpora, by learning from the corpora itself with the help of a large language model, we can build a table discovery system that matches the query quality of those trained on human-annotated data. Specifically, we build three systems/tools to demonstrate the hypothesis we made. 1) FabricQA-Extractor, a system to extract tables from a text corpora using natural language questions. 2) SOLO: a self-supervised system for table discovery using natural language questions. 3) Pneuma-Benchmark: an automatic tool to evaluate models/systems for table discovery using natural language questions. These works collectively contribute a solution that requires little human effort to tabular data extraction and discovery.