The Genomic Data Commons (GDC) is a data platform for managing, processing, analyzing, and sharing cancer genomics data. The data processing component of the GDC is called the GDC Pipeline Automation System (GPAS). GPAS currently uses an on-premises cluster that uses virtual machines (VMs) and bare metal machines to run multiple bioinformatics pipelines. The GPAS has been used in production for over two years and valuable pipeline statistics are scattered in multiple databases across the platform. This dissertation presents a platform-wide statistics collecting service for the GPAS, and based the synthesized statistics, several performance issues have been identified and investigated. The first performance issue examined is that jobs on VMs exhibit highly varied performance. In particular, there can be a very long tail, with some VMs taking significantly longer than others to execute the same jobs. Through an analysis of jobs statistics and traces, we find that the root cause is the virtual machine memory management layer in the VM hypervisor. When the layer is overwhelmed by intense searches for memory mappings from virtual machine to the physical host, it causes the performance of the VM to degrade. The second performance issue examined concerns job scheduling. Through an analysis of production statistics, we find that GPAS's overall work progress can be delayed by days even if only a small percentage of jobs fail. A few other drawbacks of the current simple job scheduling model have been listed with evidence in the dissertation. A more sophisticated task-based scheduling model is proposed in this dissertation. Lastly, a thorough literature review is presented in this dissertation towards a vision for the GPAS with further improved pipeline performance.




Downloads Statistics

Download Full History