Files
Abstract
Large volumes of data generated by scientific simulations, genome sequencing, and other applications need to be moved among clusters for data collection/analysis. Data compression techniques have effectively reduced data storage and transfer costs. However, users' requirements on interactively controlling both data quality and compression ratios are non-trivial to fulfill. Lossy compression methods need to respect several data constraints to be useful in a realistic data transfer scenario. In this thesis, I propose a novel Compression-as-a-Service (CaaS) platform called GlobaZip with five important contributions: (1) a multi-interval/multi-region based compression algorithm that supports several data constraints to further limit the distortion in data fidelity even though the compression is lossy; (2) a layer-by-layer compression technique that allows much higher parallel compression rate in HPC systems and can coordidate CPU cores on multiple compute nodes to compress extremely large files without out-of-memory errors; (3) a decision tree-based compression performance prediction model that allows users to use very limited computation overhead to estimate compression characteristics including compression ratio, time and data fidelity; (4) an optimized reference-based genome sequence compression algorithm that exeeds the performance of state-of-the-art algorithms by using more fine-grained sequence alignment procedure, reordering reads, a novel dominant bitmap method for quality score compression, and a few other small optimizations; (5) a Qt5-based user-facing app that utilizes Globus Compute and Globus Transfer to provide users with a universal interface to orchestrate remote data compression and transfer. Experiments on multiple real-world datasets on geographically distributed computers show that GlobaZip can significantly improve data transfer efficiency with a performance gain of more than 10x in computing clusters with relatively slow networks.