Class Discovery in Gene Expression Data

A. Ben-Dor, N. Friedman, and Z. Yakhini

Proc. Fifth Annual Inter. Conf. on Computational Molecular Biology (RECOMB), 2001.



Recent studies demonstrate the discovery of putative disease subtypes from gene expression data. The underlying computational problem is to partition the set of sample tissues into statistically meaningful classes. In this paper we present a novel approach to class discovery and develop automatic analysis methods. Our approach is based on statistically scoring candidate partitions according to the overabundance of genes that separate the different classes. Indeed, in biological datasets, an overabundance of genes separating known classes is typically observed. we measure overabundance against a stochastic null model. This allows for highlighting subtle, yet meaningful, partitions that are supported on a small subset of the genes. Using simulated annealing we explore the space of all possible partitions of the set of samples, seeking partitions with statistically significant overabundance of differentially expressed genes. We demonstrate the performance of our methods on synthetic data, where we recover planted partitions. Finally, we turn to tumor expression datasets, and show that we find several highly pronounced partitions.