Efficient Lossless Compression in and beyond Columnar Databases

Jiang, Hao

doi:10.6082/uchicago.3572

Jiang, Hao

2021

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

Columnar databases have dominated the data analysis market for their superior performance in query processing with Big data. However, the extensive data size also brings challenges to data storage and transfer. While people often rely on lossless compression techniques to reduce storage size, database researchers overlook compression in row-wise databases. There are two primary reasons. First, available compression algorithms in row-wise databases are limited. Row-wise databases blend data fields of all types together. Byte-oriented compression algorithms such as Gzip and Snappy are the only choices. Second, Gzip-like algorithms process data in blocks and decompress an entire block before accessing a data row. The decompression is CPU intensive and has a significant impact to query performance. Lack of alternatives and implications to performance impede the applications of compression in row-wise databases. The prosperity of columnar databases changes this situation. Storing data in separated columns enable the application of compression algorithms designed for a single data type. There are also algorithms performing record-level compression, allowing the queries to skip irrelevant records and executes more efficiently. Compression in columnar databases thus reduces data storage and brings the opportunities of improving query efficiency. Besides relational databases, we also explore the benefit of compression in key-value stores. Key-value stores have wide applications, including game, IoT, Social Media, Mobile Devices, and Enterprise Applications. They could provide far better performance than the relational database in specific scenarios. This thesis proposes innovative compression algorithms and system designs to improve the storage and query efficiency in columnar databases and key-value stores. We address three challenges of lossless compression in columnar databases: better encoding algorithms, faster query on encoded data, and selecting proper encoding algorithms for data columns. We present PIDS, a novel compression approach for string columns; SBoost, a C++ library for fast queries on encoded data; and CodecDB, an encoding-aware database with a data-driven encoding selection. We explore the possibility of using compression to accelerate LSM-tree and present CoLoM. This key-value store utilizes a columnar layout and lightweight encoding to improve LSM-tree efficiency. We show that these innovations allow columnar databases and key-value stores to excel the competitors in storage efficiency and query speed.