Extreme Acceleration and Seamless Integration of Raw Data Processing

Fang, Yuanwei

doi:10.6082/uchicago.1869

Fang, Yuanwei

2019

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

New sources of big data such as the Internet, mobile applications, data-driven science and large-scale sensors (IoT) are driving demand for growing computing performance. Efficient analysis of data in native raw formats in real-time is increasingly important because of rapid data generation, demand for analytics, and insights for immediate responses. Traditional data processing systems can deliver high-performance on loaded data, but transforming raw data into these loaded formats is expensive. Data transformations, rather than arithmetic operations, dominate the task performance. Such transformation is a critical performance bottleneck of raw data processing. We propose the ACCelerated Operators for Raw Data Analysis (ACCORDA), a combined software and hardware approach, to accelerate data analytics on unloaded raw data. ACCORDA enables real-time decision making and fast knowledge exploration on dirty, diverse, and ad-hoc raw data, such as fresh sensor data, web crawled, and business records. The Unified Transformation Accelerator (UTA) is ACCORDA’s hardware approach. It creates flexible architecture support for data transformation in analytical workloads. Exploiting efficient hardware customization, a scratchpad memory, and MIMD parallelism, Unstructured Data Processor (UDP) is a novel hardware accelerator based on the UTA approach. UDP demonstrates the feasibility of the UTA approach. We propose the UDP’s instruction set, micro-architecture, and compiler toolchain. UDP has four unique features: multi-way dispatch, variable-size symbol, flexible-source dispatch, and flexible addressing. Extensive evaluation of data transformation kernels, ranging from compression to pattern matching, shows UDP achieves 20x average speedup and 1,900x energy efficiency when compared to an 8-thread CPU. The UDP’s implementation is >100x less power and area than a single CPU core. The Accelerated Transformation Operators (ATO) is ACCORDA’s software approach. ATO applies two design choices for integrating data transformation acceleration – sub-typing operator interface with encodings and uniform worker model. The encoding-extended interface enables new accelerated operators to be included in a query plan. Runtime data formats can be transformed to meet the encoding requirements of accelerated operator implementations. In addition, query optimizer can re-order encoding operators for lazy data transformation, and fuse them to improve data locality and reduce transformation cost. Uniform worker model preserves system software architectures and provides a uniform runtime to the execution engine, empowering rule-based optimizers to drive flexible encoding-based optimization. We demonstrate that the key enablers are the UTA’s low-cost, high-performance design and its in memory-hierarchy integration for efficient, low-overhead data sharing with CPUs. Together, they enable flexible software exploitation of hardware acceleration and worker thread integration. ACCORDA achieves significant acceleration on data transformation tasks, with speedups up to 4.9x on regex matching, 2.6x on decompression, 2x on parsing, and 20x on deserialization when compared to an 8-thread CPU. We evaluate ACCORDA using end-to-end TPC-H queries on unloaded data with raw format. Hardware acceleration contributes 1.1x-6.3x improvement alone, and software elements such as query optimization for data encoding unlocked by ATO deliver an additional 1.2x-11.8x speedup. Combining UTA’s acceleration and ATO’s encoding optimization, ACCORDA achieves 3.3x-13.2x overall speedups on single- thread performance when compared to the baseline Spark SQL. We further show that this performance benefit is robust across format complexity of query predicates and selectivity (data statistics). Furthermore, ACCORDA robustly matches or even outperforms (by up to 11.4x) prior systems that depend on caching transformed data, while computing on raw, unloaded data.