From Promoter Sequence to Expression: A Probabilistic Framework

E. Segal, Y. Barash, I. Simon, N. Friedman, and D. Koller

Proc. Sixth Annual Inter. Conf. on Computational Molecular Biology (RECOMB), 2002.



We present a probabilistic framework that models the process by which transcriptional binding explains the mRNA expression of different genes. Our joint probabilistic model unifies the two key components of this process: the prediction of gene regulation events from sequence motifs in the gene's promoter region, and the prediction of mRNA expression from combinations of gene regulation events in different settings. Our approach has several advantages. By learning promoter sequence motifs that are directly predictive of expression data, it can improve the identification of binding site patterns. It is also able to identify combinatorial regulation via interactions of different transcription factors. Finally, the general framework allows us to integrate additional data sources, including data from the recent binding localization assays. We demonstrate our approach on the cell cycle data of Spellman et al., combined with the binding localization information of Simon et al. We show that the learned model predicts expression from sequence, and that it identifies coherent co-regulated groups with significant transcription factor motifs. It also provides valuable biological insight into the domain via these co-regulated ``modules'' and the combinatorial regulation effects that govern their behavior.