Files
Abstract
The explosive growth of data-driven fields such as machine learning and data science has led to a proliferation of large amounts of data and competing systems, tools, and techniques to acquire, clean, process, prepare, curate, wrangle and analyze data. This has led to the creation of data lakes, which are large repositories of data that are often used as a central source of truth for data-driven applications. However, the lack of lineage information in data lakes can affect the quality of data processed and the insights derived from data lakes. Existing solutions for lineage involve manual annotation of lineage information or capturing lineage as data is manipulated and transformed. This does not solve the problem of lineage and quality for data generated in the past and is often cited as an impediment to the overall vision of reducing the time to insight from vast amounts of data organized in a central data lake. This thesis proposes using similarity metrics to infer the lineage of data artifacts in data lakes. We show the feasibility of recovering the lineage of data artifacts under varying assumptions of the availability of metadata using RELIC. We scale RELIC using sketching and indexing techniques to show that we can accurately and efficiently answer lineage queries with a suitable index structure. We also introduce FUZZDATA, a dataframe benchmarking system that can generate dataframe workflows of varying complexity using different dataframe clients.