NLP Based Search

Overview:

With the abundance of data available online, search has become the primary operation in accessing unstructured information. Current search techniques excel in keyword-based searches, where the user intent is described with a few words that, the user hopes, are also present in the documents of interest. However, it is still a challenge to obtain relevant documents that do not use the same words as the query. This vocabulary mismatch is significant in many fields such as Relation Search, Searching for Entailment, Question Answering, and Exploratory Search. Most of these fall under areas in which Natural Language Processing can both influence and utilize Search technologies. Our research looks at techniques and challenges to join NLP and IR research and extend the synergy between these two areas.

Details:

Keyword-based search techniques often fail to capture the information needed by users, resulting in failure to retrieve essential information even when it is, in principle, available. For example, traditional keyword-based protocols perform relatively well for queries that list entities or concepts (typically nouns), since they are likely to actually appear in the target documents. However, they perform badly when the search is for actions or for relations, which typically are represented using verbs. This is due to the variability in expressing actions and relations -- the same meaning can be expressed in multiple ways. When a user queries for an action or a relation, such as "what does Hyundai produce" or "how to treat Cancer", he/she is looking for all documents that express these relations, even though they can be expressed in many different ways.

The goal of this project is to develop natural language processing capabilities, along with improved search protocols, to improve search capabilities. Specifically, we would like to support the search for relations and actions mentioned in text, as well as support search via entailment: given a query (e.g., "Yahoo acquired Overture"), we want to retrieve all the documents that contains this piece of information, even though this fact is expressed in a vastly different fashion (e.g., "The leading Internet portal, Sunnyvale-based Yahoo!, has taken over the pioneer of search technology, Overture Technologies." This approach intends to support semantic-based search and true content-based access to information. For example, when searching for "countries visited by the President of United States over the past year", the documents of interest may not contain the words "country", "President of the United States" or "last year", but the meaning and expected results are clear. Such queries can only be satisfied when the search system plays an active role in reformulating the query and in semantically analyzing the retrieved candidate text.

Addressing such user-information needs clearly depends on improving our natural language understanding capabilities. This project attempts to place advances in natural language understanding in the context of information retrieval, with the goal of expanding its scope to support semantic queries and meaning-based retrieval of documents.

Collaborations:

Relevant Publications: