Overabundance Analysis and Class Discovery in Gene Expression Data

A. Ben-Dor, N. Friedman, and Z. Yakhini




Recent studies demonstrate the discovery of disease subtypes from gene expression data. In this paper, we propose a principled and systematic approach to address the computational problem of partitioning the set of sample tissues into statistically meaningful classes. We start by describing a method, called overabundance analysis, for assessing how informative a given expression data set is with respect to a partition of the samples. As we show, in several published expression datasets, an overabundance of genes separating known classes is observed. Then, we use this method as the foundation to a novel approach to class discovery. In this approach, we search for partitions that have statistically significant overabundance score. We evaluate the performance of our approach on synthetic data, where we show it can recover planted partitions. Finally, we apply it to several published tumor expression datasets, and show that we find several highly pronounced partitions.