In order to use the learned models in identifying putative binding sites within
given sequences, we should know how to measure their statistical confidences.
To score a given Kmer we use the log likelihood ratio between the
binding model and the background model.
We then would like to assign the statistical significance (pvalue) to a given score,
i.e. to evaluate the probability of getting such a score (or a better one) for
a Kmer generated by the background model.
Exact pvalue calculation
Ideally, we would like to use an exact method to describe the score distribution
and cumulative distribution, (see for example
Bejerano, RECOMB 2003).
This option, however, is not efficiently practical in our case due to the high complexity
of our models.
The Naive Approximation
Empirical pvalues can be computed using a naive sampling approach.
Using this scheme, a very large pool of Kmers is sampled from the
background distribution, then each Kmer is scored according to its
loglikelihood ratio.
For each possible threshold, the cumulative distribution function (a CDF)
is calculated by counting how many Kmers from the pool have passed it.
The main drawback of this method concerns the huge number of samples needed for
the exact estimation of significant pvalues, due to the spareness of
highscoring samples in the background distribution. This problem only increases
as we introduce a Bonferroni Correction to the pvalues, taking
into account the number of trials we've done when search for putative significance
sites.
Gaussian Approximation
In the cases where our model contains no hidden variables (e.g. a PSSM or a
Tree network), it is possible to derive decomposable exact formulas for the
score's first two moments under the background model. For other models, these
statistics can be approximated easily and accurately.
This opens the way to the approximation of the score distribution using a
Gaussian (i.e. a Normal distribution).
Approximation using Importance Sampling
Once again, as in the Naive approach, we'll approximate the statistical significance
of a Kmer using an empirical distribution function, derived from a
sampled set of Kmers. However, to avoid vast amounts of samplings
we can use the Importance Sampling method.
This technique approximates the background score distribution, but with a higher
resolution on areas of interest (i.e. those with extremely high scores). It does
so by sampling from a different distribution Q( ) over the Kmers,
and weighting the samples according to their likelihood ratio between the
background distribution and Q( ).
Formally speaking, for a sampled Kmer:
In our case, we want Q( ) to be a mixture of models:
where Q_{0}( ) is merely the background model,
Q_{I}( ) is the binding model, and
p_{}(Q_{i}) is the
mixing ratio of the model Q_{i}( ).
Figure 1: Example of Q( ) : a mixture of 5 distributions Q_{0} ... Q_{4}





Q_{0} (bg model) 
Q_{1} 
Q_{2} 
Q_{3} 
Q_{4} (binding model) 
10,000 samples 
5,623 samples 
3,162 samples 
1,778 samples 
1,000 samples 
Comparison of score CDFs using 3 methods
The following two figures show cumulative distribution functions for the binding
site model used for our synthetic results (See paper, section 6a). Here,
1,000,000 Kmers were sampled from a Human 3order markov chain
background model. These were assigned logodds scores, and were used to build a
cumulative distribution function (Naive approximating  red
line). Another 1,000,000 Kmers were sampled from a
Normal distribution whose mean and variance were calculated using the
binding and background models (Normal distribution
approximating  green line). Using the Importance Sampling
method (blue line) 40,882 Kmers
were sampled from ten different models Q_{0} ... Q_{9}
(9986, 7732, 5987, 4636, 3590, 2780, 2153, 1667, 1291, 1000 samples).
Although all three CDFs look very similar, zooming in to the interesting area
(of high scoring Kmers), show significant difference between the
Normal approximation, and the other two CDFs.
Note that the the Importance Sampling approximation works well,
resulting in a very accurate threshold for sites with
significant pvalues (smaller than 0.01 after Bonferroni
Correction. With a typical promoter length of 500bp, checking on both
strands, this equals to a pvalue of 0.00001).
The Normal distribution approximation, however, tends to predict a more
significant pvalue than the real one (up to a 20fold fluctuation),
resulting in many falsepositive sites.
To summarize, this figures shows that using a small and efficient sample set,
the Importance Sampling method is capable of accurately approximating the
score CDF.
