Published March 13, 2023 | Version v1
Journal article Open

Towards self-describing and FAIR bulk formats for biomedical data

  • 1. University of Chicago

Description

We introduce a self-describing serialized format for bulk biomedical data called the Portable Format for Biomedical (PFB) data. The Portable Format for Biomedical data is based upon Avro and encapsulates a data model, a data dictionary, the data itself, and pointers to third party controlled vocabularies. In general, each data element in the data dictionary is associated with a third party controlled vocabulary to make it easier for applications to harmonize two or more PFB files. We also introduce an open source software development kit (SDK) called PyPFB for creating, exploring and modifying PFB files. We describe experimental studies showing the performance improvements when importing and exporting bulk biomedical data in the PFB format versus using JSON and SQL formats.

Data availability

The PyPFB software can be obtained from https://github.com/uc-cdis/pypfb. The data from the experimental studies can be obtained from: https://github.com/uc-cdis/pfb-paper-artifacts.

Files

Towards-self-describing-and-FAIR-bulk-formats-for-biomedicaldata.pdf

Files (875.7 kB)

Additional details

Identifiers

DOI
10.1371/journal.pcbi.1010944
Other
oai:uchicago.tind.io:5670

Funding

National Heart, Lung, and Blood Institute
U2CHL138346

UChicago Information

Division(s)
Biological Sciences Division, Physical Sciences Division
Department(s)
Computer Science, Medicine
Center(s) or Institute(s)
Center for Translational Data Science