Modeling Dependencies in Protein-DNA Binding Sites

Yoseph Barash, Gal Elidan, Nir Friedman, Tommy Kaplan

School of Computer Science & Engineering, Hebrew University, Jerusalem, 91904 Israel


Paper (Postscript, PDF) to appear in RECOMB'03.



The availability of whole genome sequences and high-throughput genomic assays opens the door for in silico analysis of transcription regulation. This includes methods for discovering and characterizing the binding sites of DNA-binding proteins, such as transcription factors. A common representation of transcription factor binding sites is a position specific score matrix (PSSM). This representation makes the strong assumption that binding site positions are independent of each other. In this work, we explore Bayesian network representations of binding sites that provide different tradeoffs between complexity (number of parameters) and the richness of dependencies between positions. We develop the formal machinery for learning such models from data and for estimating the statistical significance of putative binding sites. We then evaluate the ramifications of these richer representations in characterizing binding site motifs and predicting their genomic locations. We show that these richer representations improve over the PSSM model in both tasks.



A.                 Supplement to Section 3.2.


1. Test data performance on aligned binding sites from TRANSFAC


2. Improvement in log-loss/instance on 95 test sets


B.        Supplements to Section 6:


1.      Synthetic Experiments


2.      Test data performance on location data of yeast genes (based on Lee et al, 2002, Supplementary data).


3.      Comparison to AlignAce on functional groups of genes (based on Hughes et al, 2000, Supplementary data).


4.      Test data performance on functional groups of genes (based on Hughes et al, 2000, Supplementary data).


5.      Test data performance on gene expression clusters (based on Tavazoie et al, 2000, Supplementary data).


C.        Supplement to Section 4: Comparison of p-value computation procedure


Contact information:
Yoseph Barash <hoan at>
Gal Elidan <galel at>
Tommy Kaplan <tommy at>