Learning and Inference

Our earlier work on learning and inference has focused on the Learning to Reason (L2R) framework -- an integrated theory of learning, knowledge representation and reasoning within a unified framework which uses learning to facilitate high level reasoning tasks [(Khardon and Roth, 1997), (Roth, 1995), (Khardon and Roth, 1999), (Roth, 1996c), (Greiner et. al., 2002)]. We view our current work on learning and inference as a concrete instantiation of the earlier framework in the sense that L2R served to (1) highlight the importance of the knowledge representation as a facilitator of learning and inference processes and the lack of commitment to a specific, comprehensible one and (2) to show that it is possible to support global inference even when learning the representation isn't exact, by exploiting global properties of the task.

In the last few years, we have concentrated on Learning and Inference over structured and constrained output. Many problems in natural language processing involve assigning values to sets of variables where a complex and expressive structure can influence, or even dictate, what assignments are possible. Assigning part-of-speech tags to the words of a sentence and many phrase segmentation and parsing problems, are key examples of such tasks. Traditionally, solutions to these problems were generative, as represented by HMM models for sequence learning problems and syntactic parsing. More recently, when discriminative methods for classification became popular, structured problems were addressed this way too, by decoupling learning from the task of maintaining structured output. Only after estimators are learned for each local output variable are they used to produce global output consistent with the structural constraints.

Our first work in this direction, within a framework we called Inference with classifiers [(Roth, 2001)] was [(Punyakanok and Roth, 2001)]. We proposed and studied three models for inference with classifiers under sequential constraints (This work was a runner-up for the best paper award is NIPS, given to papers co-authored by a student). In this work we started to move away from the study of single classifiers and studied ways to use classifiers in downstream inference processes. We studied the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. Beyond developing a Markovian approach that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies, the most promising approach we developed is an extended constraint satisfaction based formalism for combining different classifiers into a single coherent decision that respects both the classifiers' predictions and multiple constraints that their outcomes ought to satisfy. This approach led to the development of one of the best shallow parsers available [(Punyakanok and Roth, 2001), (Punyakanok and Roth, 2005), (Carreras, 2002), (Li and Roth, 2001)].

However, in many cases, the interactions between the variables of interest (e.g, variables representing the semantic role of phrases in a sentence, subject, object, temporal adjunct, etc.) cannot be expressed in a sequential manner, and thus inference to determine them cannot be done using dynamic programming techniques. This is the case when one wants to express long term dependencies among variables, number constraints (e.g., this verb cannot take more than three arguments), etc.

Over the last few years we have developed a fairly general setting to address these problems [(Roth and Yi, 2002), (Roth and Yi, 2004), (Roth and Yi, 2005), (Punyakanok et. al., 2005), (Punyakanok et. al., 2004), (Punyakanok et. al., 2005), (Punyakanok et. al., 2005a)]. We defined the problem in terms of a collection of discrete random variables representing binary relations and their arguments; we seek an optimal assignment to the variables in the presence of the constraints on the binary relations between variables and the relation types. We have formalized these problems as constrained optimization problems over variables that represent the outcome of learned classifiers, and have shown how to model Inference as Linear Programming, or Integer Linear Programming problems, yielding exact and efficient solutions.

We studied this approach theoretically and empirically and addressed multiple questions such as: (1) efficiency and modularity of the approach and issues related to integrality of solutions [(Roth and Yi, 2004)]; (2) comparing different training paradigms for learning structured output [(Punyakanok et. al., 2005)] e.g., we compared paradigms that decouple training from inference with paradigms that couple training and inference tightly, at the cost of losing efficiency and modularity [(Lafferty et. al., 2001), (Collins, 2002)]; (3) incorporating both statistical and declarative constraints within our framework [(Roth and Yi, 2005)]; and (4) the view of the framework as a generalization of sequential approaches (e.g., Viterbi), in the sense that when the constraints are less expressive, the inference reduces to simpler computation with polynomial complexity guarantees [(Roth and Yi, 2005)].

Most of this research was done in the context of developing an approach to the task of Semantic Role Labeling - a shallow semantic analysis of sentences at the level of "who did what to whom, when, where and how" [(Punyakanok et. al., 2004), (Punyakanok et. al., 2005), (Punyakanok et. al., 2005a)]. The system we developed for this task based on the aforementioned approach was the top ranked system in the shared task competition run by the Conference on Natural Language Learning in June 2005, out of 19 international groups participating. (An earlier version developed in 2004 won the second place.)

Our work on Inference as Constraint Optimization, and even the specific (Integer) Linear Programming formulation, already has had significant impact in natural language research, from information extraction work to works in natural language generation [(Marciniak and Strube, 2005)].