PostScript

Presentation

A central task in the study of molecular evolution is the
reconstruction of a phylogenetic tree from sequences of current-day
taxa. The most established approach to tree reconstruction is maximum
likelihood (ML) analysis. Unfortunately, searching for the maximum
likelihood phylogenetic tree is computationally prohibitive for large
data sets. In this paper, we describe a new algorithm that uses *Structural EM* learning maximum likelihood phylogenetic trees. This
algorithm is similar to the standard EM method for edge-length
estimation, except that during iterations of the Structural EM algorithm the
topology is improved as well as the edge length. Our algorithm
performs iterations of two steps. In the *E-Step*, we use the
current tree topology and edge lengths to compute expected
sufficient statistics, which summarize the data. In the *
M-Step*, we search for a topology that maximizes the likelihood
with respect to these expected sufficient statistics. We show that
searching for better topologies inside the M-step can be done
efficiently, as opposed to standard methods for topology search. We
prove that each iteration of this procedure increases the likelihood
of the topology, and thus the procedure must converge. This
convergence point, however, can be a sub-optimal one. To escape from
such ``local optima'', we further enhance our basic EM procedure by
incorporating moves in the flavor of simulated annealing. We evaluate
these new algorithms on both synthetic and real sequence data, and
show that for protein sequences even our basic algorithm finds more
plausible trees than existing methods for searching maximum likelihood
phylogenies. Furthermore, our algorithms are dramatically faster than
such methods, enabling, for the first time, phylogenetic analysis of
large protein data sets in the maximum likelihood framework.

nir@cs.huji.ac.il