Rapid Development and Distribution of Statistical Tools for High-Throughput Sequencing Data

WP5 Transcript-level Expression Estimation and Comparison

RNA-Seq data provide the opportunity to identify different transcript isoforms generated from a gene locus and to distinguish between alleles (in a diploid organism, typically two). Across different experimental conditions, tissue types etc., they allow us to ask whether isoform usage or allele-specific expression is differential, i. e. shows evidence of regulation. A complication is raised by the fact that current HTS produces short reads, which are generally not long enough to directly identify the isoform.
Probabilistic models have been introduced to estimate transcript expression levels for genes with multiple alternative isoforms and/or allelic variants from RNA-Seq data. Bayesian methods provide an attractive approach as they can be used to quantify the uncertainty and covariation in expression estimates from closely related transcripts. However, current Bayesian approaches rely on Gibbs samplers to generate samples from the posterior distribution of transcript expression levels. Gibbs samplers can mix notoriously slowly in the case where the posterior distribution is highly covarying, and this is often the case for the most challenging gene models with many similar alternative transcripts. There is also a pressing need to extend existing models to the identification of unannotated transcripts and incorrect gene model annotation. Probabilistic models for this task have a very large state-space, again making MCMC prohibitively slow. In this objective we will develop efficient algorithms to make advanced transcript-level modelling more practical. We aim to make inference in the most advanced read-level probabilistic models practical given the rapidly increasing size of datasets and the increasing complexity of gene models.

University of Manchester (Lead Partner)
University of Zurich
University of Sheffield