Learning in Natural Language

We describe below some of the background, progress and impact of our work on Machine Learning in Natural Language. More details on some of the research directions that we have pursued are presented below in Sections [Learning and Inference], [Intelligent Information Access], and [Knowledge Representation and Inference].

Linear Classifiers and Discriminatory Learning:

Early empirical work in natural language processing was influenced by the success of statistical speech recognition and was dominated by relatively simple statistical methods. Many of the early works, from statistical part-of-speech tagging [(Church, 1988), (Church and Mercer, 1993)] to the noisy channel model for machine translation, have roots in work conducted in the speech field. Most of the early works can be viewed as based on generative probability models, which provide a principled way to study statistical classification. In these models, it is common to assume a generative model for the data, estimate its most likely parameters from training data and then use Bayes rule to obtain a classifier for this model. Naturally, estimating the most likely parameters involves making simplifying assumptions about the generating model.

Our earlier work in this area has contributed to developing better understanding of the relations between probabilistic models of classification and discriminative models and has had a significant effect on work in natural langauge processing [(Roth, 1998), (Roth, 1999), (Garg and Roth, 2001)].

In [(Roth, 1998), (Roth, 1999), (Roth, 1999a), (Roth, 2000)] we have shown that the decision surface of many probabilistic classifiers (and other classifiers used in earlier work in natural language) is linear over some feature space. We have used these observations to (1) develop a learning theoretical explanation for the success of some probabilistic methods despite the clear failure of their assumptions and (2) suggested that one should keep using the same linear representation, but develop other methods of parameter estimation, driven directly by the eventual goal: to support better predictions.

These observations, which are now considered "common knowledge" in machine learning, have influenced the SNoW learning architecture which we have developed [(Roth, 1998), (Carlson et. al., 1999)] -- consisting of enhanced, regularized versions of the perceptron and winnow algorithms and of naive Bayes -- which has been downloaded by thousands of researchers around the world. More importantly, this understanding has contributed to a vast use of discriminative approaches and a range of linear classifiers such as Boosting, SVMs, Winnow and Perceptron, all successfully applied to a broad range of natural language problems.

The significance of these results goes beyond explaining the generalization and robustness properties of widely used methods. Rather, they provide insight into possible extensions of these methods (1) to learn from more structured, knowledge intensive observations, as part of a learning centered approach to higher level natural language inferences and (2) to learn more structured output, where multiple output variables represent interdependent problem components. We describe below, and in more details in Sec. [Learning and Inference], our work in the latter direction and then, in Sec. [Knowledge Representation and Inference] our work on the former.

Learning with Structured Output:

The emphasis on discriminative methods applies not only to simple classification problems but also to machine learning work on more complex structured models. In [(Roth, 1999)] we showed that predicting the most likely state in a HMM (or other graphical probabilistic models) has a linear decision surface over features that correspond to transition probabilities and state-observations probabilities. Using dynamic programming this observation immediately leads to a discriminative algorithm for training models with structured output. Indeed, this work has motivated our work on learning with structured output (see below) [(Punyakanok and Roth, 2001), (Roth and Yi, 2002), (Roth and Yi, 2004), (Roth and Yi, 2005), (Punyakanok et. al., 2004), (Punyakanok et. al., 2005)], and has influenced other models that incorporate dependencies among the variables into the learning process, and directly induce estimators to optimize an appropriate performance measure [(Collins, 2002)]. Our recent work in this line of research is described in Sections [Learning and Inference] and [Intelligent Information Access].

Software Tools and NLP Solutions:

Along with developing theoretical understanding, models and algorithms for problems in natural language processing, we have developed a number of mature tools that were made available to the research community and have been downloaded and used by thousands of researchers, and are being used in computational linguistics classes and in industry. In addition to our basic machine learning package, SNoW, and a feature extraction language, FEX, we have made available a collection of state-of-the art NLP tools. In addition to some important general purpose pre-processing tools such as a sentence segmentation tool, we have made available one of the best part-of-speech taggers and one of the best shallow parsers available. Other tools include a name-entity recognizer, along with an applet for annotation, a question analysis and classification tool and others. A large number of on-line demos, both for the above mentioned packages and others were also made available and are being frequently used, including in NLP and Computational Linguistics classes. These include a context sensitive speller and the best semantic parser available (see details below; the latter is a real-time version of the approach that won the first place in the shared task competition in the Conference of Learning in Natural Language in June 2005.). Tools, packages and data are all available from http://L2R.cs.uiuc.edu/~cogcomp.