Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting

N. Friedman, M. Goldszmidt, and T. J. Lee

Fifteenth Inter. Conf. on Machine Learning (ICML), 1998.

PostScript
PDF

Abstract

In a recent paper, Friedman, Geiger, and Goldszmidt (1997) introduced a classifier based on Bayesian networks, called Tree Augmented Naive Bayes (TAN), that outperforms naive Bayes and performs competitively with C4.5 and other state-of-the-art methods. This classifier has several advantages including robustness and polynomial computational complexity. One limitation of the TAN classifier is that it applies only to discrete attributes, and thus, continuous attributes must be prediscretized. In this paper, we extend TAN to deal with continuous attributes directly via parametric (e.g., Gaussians) and semiparametric (e.g., mixture of Gaussians) conditional probabilities. The result is a classifier that can represent and combine both discrete and continuous attributes. In addition, we propose a new method that takes advantage of the modeling language of Bayesian networks in order to represent attributes both in discrete and continuous form simultaneously, and use both versions in the classification. This automates the process of deciding which form of the attribute is most relevant to the classification task. It also avoids the commitment to either a discretized or a (semi)parametric form, since different attributes may correlate better with one version or the other. Our empirical results show that this latter method usually achieves classification performance that is as good as or better than either the purely discrete or the purely continuous TAN models.


nir@cs.huji.ac.il