Published April 11, 2019 | Version v1
Journal article Open

Reproducible big data science: A case study in continuous FAIRness

Description

Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility—thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes.

Data availability

Pointers (URLs) to all relevant data and analysis are within the paper. The identifiers and URLs for the data objects used as inputs and generated as outputs can be found below. These are in the manuscript as Table 2. DNase-Seq have the unique identifier minid:b9dt2t with a landing page at: http://minid.bd2k.org/minid/landingpage/ark:/57799/b9dt2t The data are available at URL: http://s3.amazonaws.com/bdds-public/bags/bagofbags/FASTQ_ENCODE_Input_BagOfBags.zip. Aligned BAM files have an identifier of minid:b9vx04 with a landing page at: http://minid.bd2k.org/minid/landingpage/ark:/57799/b9vx04. The data are available from https://s3.amazonaws.com/bdds-public/bags/bagofbags/BAMS_BagOfBags.zip. The collection of BED files of footprints have an identifier minid:b9496p with a landing page http://minid.bd2k.org/minid/landingpage/ark:/57799/b9496p. The footprints data are available from the URL http://s3.amazonaws.com/bdds-public/bags/bagofbags/BagOfBags_Of_Footprints.zip The non-redundant Motifs database has an identifier: minid:b97957 and a landing page at URL: http://minid.bd2k.org/minid/landingpage/ark:/57799/b97957. The non-redundant motifs database is available from the URL: http://s3.amazonaws.com/bdds-public/fimo/non-redundant_fimo_motifs.meme. The motif intersected hits have an identifier: minid:b9p09p and a landing page URL: http://minid.bd2k.org/minid/landingpage/ark:/57799/b9p09p. The hits are available from the URL: http://s3.amazonaws.com/bdds-public/index_dbs/2017_07_27_fimo The Transcription Factor Binding Sites generated from the study have an identifier: minid:b9v398 and a landing page: http://minid.bd2k.org/minid/landingpage/ark:/57799/b9v398. The TFBS factors are available from URL: http://s3.amazonaws.com/bdds-public/bags/bagofbags/TFBS_BagOfBags.zip. Additionally, this web resource: http://fair-data.net provides pointers to instructions on how the datasets can be further used.

Files

journal.pone.0213013.pdf

Files (2.4 MB)

Name Size Download all
Article
md5:6ae63b66febbc1008c515c279954a969
2.4 MB Preview Download

Additional details

Identifiers

DOI
10.1371/journal.pone.0213013
Other
oai:uchicago.tind.io:6305

Funding

National Institutes of Health
Big Data for Discovery Science Center
National Institutes of Health
A Commons Platform for Promoting Continuous Fairness
National Institutes of Health
Hardening Globus Genomics
DOE
DE-AC02-06CH11357

UChicago Information

Division(s)
Physical Sciences Division
Department(s)
Computer Science
Center(s) or Institute(s)
Becker Friedman Institute for Economics