Predictive Big Data Analytics: A Study of Parkinson’s Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations

Dinov, Ivo D.; Heavner, Ben; Tang, Ming; Glusman, Gustavo; Chard, Kyle; Darcy, Mike; Madduri, Ravi; Pa, Judy; Spino, Cathie; Kesselman, Carl; Foster, Ian; Deutsch, Eric W.; Price, Nathan D.; Van Horn, John D.; Ames, Joseph; Clark, Kristi; Hood, Leroy; Hampstead, Benjamin M.; Dauer, William; Toga, Arthur W.

2016

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DataCite
DublinCore
EndNote
NLM
RefWorks
RIS

Add to Basket

Files

Abstract

Background: A unique archive of Big Data on Parkinson’s Disease is collected, managed and disseminated by the Parkinson’s Progression Markers Initiative (PPMI). The integration of such complex and heterogeneous Big Data from multiple sources offers unparalleled opportunities to study the early stages of prevalent neurodegenerative processes, track their progression and quickly identify the efficacies of alternative treatments. Many previous human and animal studies have examined the relationship of Parkinson’s disease (PD) risk to trauma, genetics, environment, co-morbidities, or life style. The defining characteristics of Big Data–large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of information-generating sources–all pose challenges to the classical techniques for data management, processing, visualization and interpretation. We propose, implement, test and validate complementary model-based and model-free approaches for PD classification and prediction. To explore PD risk using Big Data methodology, we jointly processed complex PPMI imaging, genetics, clinical and demographic data.

Methods and Findings: Collective representation of the multi-source data facilitates the aggregation and harmonization of complex data elements. This enables joint modeling of the complete data, leading to the development of Big Data analytics, predictive synthesis, and statistical validation. Using heterogeneous PPMI data, we developed a comprehensive protocol for end-to-end data characterization, manipulation, processing, cleaning, analysis and validation. Specifically, we (i) introduce methods for rebalancing imbalanced cohorts, (ii) utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, and (iii) generate reproducible machine-learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. We evaluated several complementary model-based predictive approaches, which failed to generate accurate and reliable diagnostic predictions. However, the results of several machine-learning based classification methods indicated significant power to predict Parkinson’s disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using statistical n-fold cross-validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting.

Conclusions: Model-free Big Data machine learning-based classification methods (e.g., adaptive boosting, support vector machines) can outperform model-based techniques in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). We observed that statistical rebalancing of cohort sizes yields better discrimination of group differences, specifically for predictive analytics based on heterogeneous and incomplete PPMI data. UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson’s disease. Even without longitudinal UPDRS data, however, the accuracy of model-free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer’s, Huntington’s, amyotrophic lateral sclerosis), as well as for other predictive Big Data analytics applications.

Details

Title

Predictive Big Data Analytics: A Study of Parkinson’s Disease Using Large, Complex, Heterogeneous, Incongruent, Multi-Source and Incomplete Observations

Author

Dinov, Ivo D. : University of Michigan
Heavner, Ben : Institute for Systems Biology
Tang, Ming : University of Michigan
Glusman, Gustavo : Institute for Systems Biology
Chard, Kyle : University of Chicago
Darcy, Mike : University of Southern California
Madduri, Ravi : University of Chicago
Pa, Judy : University of Southern California
Spino, Cathie : University of Michigan
Kesselman, Carl : University of Southern California
Foster, Ian : University of Chicago
Deutsch, Eric W. : Institute for Systems Biology
Price, Nathan D. : Institute for Systems Biology
Van Horn, John D. : University of Southern California
Ames, Joseph : University of Southern California
Clark, Kristi : University of Southern California
Hood, Leroy : Institute for Systems Biology
Hampstead, Benjamin M. : University of Michigan
Dauer, William : University of Michigan
Toga, Arthur W. : University of Southern California

Content Type

Article

Published in

PLOS ONE

Identifier(s)

DOI: https://doi.org/10.1371/journal.pone.0157077

Data availability statement

All relevant data are within the paper, its Supporting Information files, and the resource references provided in the manuscript.

Funding Information

National Science Foundation, 1023115
National Science Foundation, 1022560
National Science Foundation, 1022636
National Science Foundation, 0089377
National Science Foundation, 9652870
National Science Foundation, 0442992
National Science Foundation, 0442630
National Science Foundation, 0333672
National Science Foundation, 0716055
National Institutes of Health, P20 NR015331
National Institutes of Health, P50 NS091856
National Institutes of Health, P30 DK089503
National Institutes of Health, U54 EB020406

Publication Date

2016-08-05

Language

English

Copyright Statement

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Licensing

CC BY

Record Appears in

Physical Sciences Division > Computer Science
All

Record Created

2023-08-07

Actions

PDF

Statistics

Download Full History