Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Large volumes of data generated by scientific simulations, genome sequencing, and other applications need to be moved among clusters for data collection/analysis. Data compression techniques have effectively reduced data storage and transfer costs. However, users' requirements on interactively controlling both data quality and compression ratios are non-trivial to fulfill. Lossy compression methods need to respect several data constraints to be useful in a realistic data transfer scenario. In this thesis, I propose a novel Compression-as-a-Service (CaaS) platform called GlobaZip with five important contributions: (1) a multi-interval/multi-region based compression algorithm that supports several data constraints to further limit the distortion in data fidelity even though the compression is lossy; (2) a layer-by-layer compression technique that allows much higher parallel compression rate in HPC systems and can coordidate CPU cores on multiple compute nodes to compress extremely large files without out-of-memory errors; (3) a decision tree-based compression performance prediction model that allows users to use very limited computation overhead to estimate compression characteristics including compression ratio, time and data fidelity; (4) an optimized reference-based genome sequence compression algorithm that exeeds the performance of state-of-the-art algorithms by using more fine-grained sequence alignment procedure, reordering reads, a novel dominant bitmap method for quality score compression, and a few other small optimizations; (5) a Qt5-based user-facing app that utilizes Globus Compute and Globus Transfer to provide users with a universal interface to orchestrate remote data compression and transfer. Experiments on multiple real-world datasets on geographically distributed computers show that GlobaZip can significantly improve data transfer efficiency with a performance gain of more than 10x in computing clusters with relatively slow networks.

Details

PDF

from
to
Export
Download Full History