Modeling Dependencies in Protein-DNA Binding Sites:
Statistical Significance of Binding Sites
In order to use the learned models in identifying putative binding sites within given sequences, we should know how to measure their statistical confidences. To score a given K-mer we use the log likelihood ratio between the binding model and the background model. We then would like to assign the statistical significance (p-value) to a given score, i.e. to evaluate the probability of getting such a score (or a better one) for a K-mer generated by the background model.

Exact p-value calculation
Ideally, we would like to use an exact method to describe the score distribution and cumulative distribution, (see for example Bejerano, RECOMB 2003). This option, however, is not efficiently practical in our case due to the high complexity of our models.

The Naive Approximation
Empirical p-values can be computed using a naive sampling approach. Using this scheme, a very large pool of K-mers is sampled from the background distribution, then each K-mer is scored according to its log-likelihood ratio. For each possible threshold, the cumulative distribution function (a CDF) is calculated by counting how many K-mers from the pool have passed it.

The main drawback of this method concerns the huge number of samples needed for the exact estimation of significant p-values, due to the spareness of high-scoring samples in the background distribution. This problem only increases as we introduce a Bonferroni Correction to the p-values, taking into account the number of trials we've done when search for putative significance sites.

Gaussian Approximation
In the cases where our model contains no hidden variables (e.g. a PSSM or a Tree network), it is possible to derive decomposable exact formulas for the score's first two moments under the background model. For other models, these statistics can be approximated easily and accurately. This opens the way to the approximation of the score distribution using a Gaussian (i.e. a Normal distribution).

Approximation using Importance Sampling
Once again, as in the Naive approach, we'll approximate the statistical significance of a K-mer using an empirical distribution function, derived from a sampled set of K-mers. However, to avoid vast amounts of samplings we can use the Importance Sampling method. This technique approximates the background score distribution, but with a higher resolution on areas of interest (i.e. those with extremely high scores). It does so by sampling from a different distribution Q( ) over the K-mers, and weighting the samples according to their likelihood ratio between the background distribution and Q( ). Formally speaking, for a sampled K-mer:
Weight(K\mbox{-mer}) = \frac{P(K\mbox{-mer}|BG)}{Q(K\mbox{-mer})}

In our case, we want Q( ) to be a mixture of models: Q(K\mbox{-mer}) = \sum_i p(Q_i) \cdot Q_i(K\mbox{-mer})
where Q0( ) is merely the background model, QI( ) is the binding model, and p(Qi) is the mixing ratio of the model Qi( ).

Figure 1: Example of Q( ) : a mixture of 5 distributions Q0 ... Q4
Q0 (bg model) Q1 Q2 Q3 Q4 (binding model)
10,000 samples 5,623 samples 3,162 samples 1,778 samples 1,000 samples

Comparison of score CDFs using 3 methods
The following two figures show cumulative distribution functions for the binding site model used for our synthetic results (See paper, section 6a). Here, 1,000,000 K-mers were sampled from a Human 3-order markov chain background model. These were assigned log-odds scores, and were used to build a cumulative distribution function (Naive approximating - red line). Another 1,000,000 K-mers were sampled from a Normal distribution whose mean and variance were calculated using the binding and background models (Normal distribution approximating - green line). Using the Importance Sampling method (blue line) 40,882 K-mers were sampled from ten different models Q0 ... Q9 (9986, 7732, 5987, 4636, 3590, 2780, 2153, 1667, 1291, 1000 samples).

Although all three CDFs look very similar, zooming in to the interesting area (of high scoring K-mers), show significant difference between the Normal approximation, and the other two CDFs.

Note that the the Importance Sampling approximation works well, resulting in a very accurate threshold for sites with significant p-values (smaller than 0.01 after Bonferroni Correction. With a typical promoter length of 500bp, checking on both strands, this equals to a p-value of 0.00001). The Normal distribution approximation, however, tends to predict a more significant p-value than the real one (up to a 20-fold fluctuation), resulting in many false-positive sites.

To summarize, this figures shows that using a small and efficient sample set, the Importance Sampling method is capable of accurately approximating the score CDF.