
Most of today's knowledge is available in a textual form. This information is typically accessed in one of two ways: key-word search (via search engines) or "semantic queries" presented to a database, in the rare case that the information is available in the form of a database with a known schema. The goal of our research in this area is to apply progress we make in our work on learning in natural language in order to bridging this gap, and allow access to free form text as if it was in a database with a known schema. To accomplish this goal we attempt to recognize some level of the semantics of the free form text and use it to develop better ways to access this information.
In the last few years we have concentrated on two problems in this direction. The first, Recognizing Entities and Relations is the problem of identifying entities in text (e.g., identifying phrases that represent names of people, locations, organizations, etc.) as well as the relations between them -- e.g., this sentence indicates that A is the AUTHOR of B or that C LIVES in D. The second problem, Identifying and Tracking Entities Across Documents builds on a relatively robust identification of mentions of entities in documents, and focuses on the ability to identify the entity itself, within single documents and across documents. Namely, we would like to be able to determine that the strings JFK, President Kennedy, John Kennedy, Kennedy (and other variations) all refer to the same person and, at the same time, determine based on the context that, sometimes, John Kennedy (the baseball player) refers to a different person. A third effort under the general Information Access line of research, on Question Answering, will be discussed in Sec. [Knowledge Representation and Inference] below.
Our work on Recognizing Entities and Relations is done within the Integer Linear Programming inference approach described above. The key technical problem we wanted to address when studying this information extraction approach is that of moving beyond the pipeline architecture. A pipeline approach is the typical strategy employed in solving complex natural language problems -- separating a task into several stages and solving them sequentially, where the features in stage i typically reflect a commitment to predictions made in stage i-1. For example, a named entity recognizer may be trained in advance, using some training data, and then given as a black box to a relation classifier, to be used as a feature extraction tool. This is often done at multiple levels, starting with tokenization, segmentation, part-of-speech tagging, etc. Clearly, this strategy disregards interactions across layers and propagation of error. Our approach, therefore, aims at developing a way in which multiple stages in this pipeline can interact, reaching a simultaneous final global decision on all the variables of interest. Specifically, we studied the problem of simultaneously recognizing named entities and relations between them, and have shown that the inference approach allows us to support bidirectional interaction between these stages, via the integer linear programming paradigm we developed [(Roth and Yi, 2002), (Roth and Yi, 2004)].
The second focus of our work in intelligent information access in the last few years was on the problem of Identifying and Tracking Entities Across Documents. A given entity -- representing a person, a location or an organization -- may be mentioned in text in multiple, ambiguous ways. Supporting concept-based (rather than "string-based") access to information requires resolving conceptual ambiguity and, in particular, identifying whether different mentions of real world entities, within and across documents, actually represent the same concept.
We developed several machine learning based approaches to this problem [(Li et. al., 2004), (Li et. al., 2004a), (Li et. al., 2005), (Li and Roth, 2005)] that differ in the amount of supervision they require, in efficiency of training the models, and (as it turns out) also in the robustness to parameter initialization and tuning. Along with addressing this problem we have also begun to develop a new approach to training clustering functions, as described below.
Our first approach to the problem is a global generative model [(Li et. al., 2004a)], at the heart of which is a view on how documents are generated and how names (of different entity types) are "sprinkled" into them. In its most general form, our model assumes: (1) a joint distribution over entities (e.g., a document that mentions "President Kennedy" is more likely to mention "Oswald" or "White House" than "Roger Clemens"), (2) an "author" model, that assumes that at least one mention of an entity in a document is easily identifiable, and then generates other mentions via (3) an appearance model, governing how mentions are transformed from the "representative" mention. We then developed a way to train this model in a discriminative manner [(Li et. al., 2004)] requiring the training of a pairwise local classifier in a supervised way, to determine whether two given mentions represent the same real world entity. This is followed, potentially, by a global clustering algorithm that uses the classifier as its similarity metric. Following lessons from these two works we have developed a new approach that is based on a new view of clustering [(Li and Roth, 2005)]. Clustering is an optimization procedure that partitions a set of elements to optimize some criteria, based on a fixed distance metric defined between the elements. The inherent noise in the data used in the entity identification problem has motivated us to develop a new view of clustering, in which clustering is viewed as a learning, rather than only an optimization task; we proposed a way to train a distance metric that is appropriate for the chosen clustering algorithm in the context of the given task.
Our work on this problem is summarized in an invited paper to the AI magazine [(Li et. al., 2005)] and has resulted in significant on-going collaboration with database researchers, with the hope of incorporating these ideas in the context of semantic integration of databases.
An additional effort with the Intelligent Information Access, on Question Answering, has focused both on our machine learning work on analysis and classification of questions [(Li and Roth, 2002), (Li et. al., 2004), (Li and Roth, 2005)], and on textual entailment [(Braz et. al., 2005), (Braz et. al., 2005a)], and is discussed below.