Files
Abstract
To date, over 170 types of modifications have been identified in RNA, in which around 10 types are discovered in mRNA. RNA modifications play important roles in transcription, mRNA stability, decay, splicing, translation, regulate the expression of genes and affect metabolisms. Thus, it’s important to understand the abundance and distribution of RNA modifications in transcriptome, to better understand how these modifications affect the metabolisms and how these modifications are regulated to execute proper functions. Next generation sequencing methods provide a group of strategies to map the transcriptome wide distributions of RNA modifications and has resulted in meaningful biological discoveries. However, only DNA molecules could be directly run by NGS methods and thus all RNA modifications are detected by indirect approaches, depending on mutations, indels, reverse transcription stops, or immunoprecipitation enrichment brought about by the modified sites. In the past decade, the development of Nanopore sequencing enables the direct sequencing of RNA molecules, as well as RNA modifications. In this dissertation, I developed machine learning based pipelines NanoPsu and NanoSPA for mRNA modification identification from nanopore direct RNA sequencing data. NanoPsu identifies pseudouridine modifications from human transcriptome and the correlation of interferon induced gene expression and pseudouridylation is revealed. NanoSPA enables simultaneous mapping of mRNA m6A and pseudouridine in human transcriptome and reveals the anti-coordination of the two modifications. Both m6A and pseudouridine are discovered to have positive effect on translation and the effect of pseudouridine is stronger than m6A. Besides, I and others in the Pan Lab also attempted to develop a pipeline to predict pseudouridine based on single reads and revealed the stoichiometry of pseudouridine and the linkages between multiple modification sites. The study develops pipelines to facilitate the modification identification from nanopore direct RNA sequencing data and reveals the potential roles of the modifications in viral infection response and translation. The methods could be applied to other species and samples for more biological discoveries. The pipelines are designed for convenient usage of public users and could be easily expanded to more RNA modifications in the future.