
A lot of the work described above - where we study machine learning and inference methods in the context of making progress in natural language - builds on our work in Learning Theory. Over the last few years we have concentrated on fours main problems in this area, some of which have already had direct impact on machine learning for natural language and other areas.
Specifically, we looked at efficiency versus convergence properties of kernels for on-line learning algorithms.
One method of increasing the expressiveness of learned hypotheses is to expand the feature set to include conjunctions of basic features. This can be done explicitly or, where possible, by using a kernel function. Focusing on the well known Perceptron and Winnow algorithms, we demonstrated a tradeoff between the computational efficiency with which the algorithm can be run over the expanded feature space and the generalization ability of the corresponding learning algorithm [(Khardon et. al., 2002), (Khardon et. al., 2005)]. In [(Cumby and Roth, 2003)] we focused on structured domains, designed a family of kernels for these, and studied the question of when, as a function of the domain size and the number of examples, is it better to use structured kernels, vs. an explicit generation of features.
We showed that the dominant ways machine learning practitioners study multi-class classification are not expressive, but that a slight modification of a commonly used algorithm is, provably, the right way to approach multi-class classification. In doing that, we developed the constraint classification framework which captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and category ranking. We presented a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension, and presented distribution independent and margin-based generalization bounds for this algorithm, as well as empirical and theoretical evidence showing the advantage of constraint classification over existing methods of multiclass classification.
The study of generalization abilities of learning algorithms and their dependence on sample complexity is one of the fundamental research efforts in learning theory. Understanding the inherent difficulty of learning problems allows one to evaluate the possibility of learning in certain situations and to estimate the degree of confidence in the predictions made and is thus crucial in understanding and analyzing learning algorithms.
Understanding generalization is even more important when learning in very high dimensional spaces, as in many natural language and computer vision applications. Specifically, can one count on the behavior of a 106 dimensional classifier that is trained on a few examples, or even a few thousands examples? In [(Garg et. al., 2002), (Garg and Roth, 2003)] we studied generalization properties of linear learning algorithms and developed a data dependent approach that is used to derive generalization bounds that depend on the margin distribution. Our method uses random projection techniques to allow the use of existing VC dimension bounds in the effective, lower, dimension of the data, resulting in bounds that are tighter than existing bounds. We showed that these results can be used as a model selection criteria and presented the Margin Distribution Optimization (MDO) learning algorithm, that directly optimizes this complexity measure.
In a series of recent works we have studied ranking problems and, specifically, generalization properties of learning ranking functions and their appropriate complexity measures.
In many learning problems, the goal is not simply to classify objects into one of a fixed number of classes; instead, a ranking of objects is desired. This is the case, for example, in information retrieval problems, where one is interested in retrieving documents from some database that are `relevant' to a given query or topic. In such problems, one wants to return to the user a list of documents that contains more relevant documents at the top and less relevant documents at the bottom; in other words, one wants a ranking of the documents such that relevant documents are ranked higher than irrelevant documents.
The problem can be shown to be different than the traditionally studied classification problem. We concentrated on the question of what is a good or ranking function? and showed that this question is also different than the analogous well studied question for classification, in the sense that a good classification function may not always translate into a good ranking function. We went on to study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the (bipartite) ranking problem.