Modeling Dependencies in Protein-DNA Binding Sites

Y. Barash, G. Elidan, N. Friedman, and T. Kaplan

Proc. Seventh Annual Inter. Conf. on Computational Molecular Biology (RECOMB), 2003.



The availability of whole genome sequences and high-throughput genomic assays opens the door for in silico analysis of transcription regulation. This includes methods for discovering and characterizing the binding sites of DNA-binding proteins, such as transcription factors. These methods often use position specific score matrices (PSSMs) for representing probability distributions over possible sequences at the binding site. The PSSM model makes the strong assumption that binding site positions are independent of each other. In this work, we explore Bayesian network representations of binding sites that provide different tradeoffs between complexity (number of parameters) and the richness of dependencies between positions. We develop the formal machinery for learning such models from data and for estimating the statistical significance of putative binding sites. We then evaluate the ramifications of these richer representations in characterizing binding site motifs and predicting their genomic locations. We show that richer representations improve over the PSSM model at both tasks.