Discussion and Further Research Directions

A margin-based criterion for measuring the quality of a set of features has been presented. Using this criterion we derived algorithms that perform feature selection by searching for the set that maximizes it. We suggested two new methods for maximizing the margin based-measure, G-flip which does a naive local search, and Simba which performs a gradient ascent. These are just representatives of the variety of optimization techniques (search methods) which can be used. We have also showed that the well known Relief algorithm [KR92] approximates a gradient ascent algorithm that maximizes this measure. The nature of the different algorithms presented here was demonstrated on various feature selection tasks. It was shown that our new algorithm Simba, which is gradient ascent on our margin based measure, outperforms Relief on all these tasks. One of the main advantages of the margin based criterion is the high correlation that it exhibits with the features quality. This was demonstrated in figures 2 and 3.

The margin based criterion was developed using the 1-Nearest-Neighbor classifier but we expect it to work well for any distance based classifier. Additionally to the test we made with 1-NN, we also tested our algorithms with SVM-RBF classifier and showed that they compete successfully with state-of-the-art algorithm that was designed specifically for SVM and is much more computationally demanding.

Our main theoretical result is a new rigorous bound on the finite sample generalization error of the 1-Nearest Neighbor algorithm. This bound depends on the margin obtained following the feature selection.

In the experiments we have conducted, the merits of the new algorithms were demonstrated. However, our algorithms use the Euclidean norm and assume that it is meaningful as a measure of similarity in the data. When this assumption fails, our algorithms might not work. Coping with other similarity measures will be an interesting extension to the work presented here.

The user of G-flip or Simba should chose a utility function to work with. In this paper we have demonstrated three such functions: linear, zero-one and sigmoid utility functions. The linear utility function gives equal weight to all points and thus might be sensitive to outliers. The sigmoid utility function suppress the influence of such outliers. We have also experimented that G-flip with zero-one utility uses less features than G-flip with linear utility where the sigmoid utility sits in between. It is still an open problem how to adapt the right utility for the data being studied. Never the less, much like the choice of kernel for SVM, using a validation set it is possible to find a reasonable candidate. A reasonable initial value for the parameter $\beta$ of the sigmoid utility is something of the same order of magnitude as one over the average distance between training instances. As for the choosing between G-flip and Simba; G-flip is adequate when the goal is to chose the best feature subset, without a need to control its precise size. Simba is more adequate when ranking on the features is required.

Several other research directions can be further investigated. One of them is to utilize a better optimization algorithm for maximizing our margin-based evaluation function. It is also possible to use the margin based criteria and the Simba algorithm to learn distance measures.

Another interesting direction is to link the feature selection algorithms to the Learning Vector Quantization (LVQ) [Koh95] algorithm. As was shown in [CGBNT02], LVQ can be viewed as a maximization of the very same margin term. But unlike the feature selection algorithms presented here, LVQ does so by changing prototypes location and not the subset of the features. This way LVQ produces a simple but robust hypothesis. Thus, LVQ and our feature selection algorithms maximize the same margin criterion by controlling different (dual) parameters of the problem. In that sense the two algorithms are dual. One can combine the two by optimizing the set of features and prototypes location together. This may yield a winning combination.

Ran Gilad-Bachrach 2004-12-07