Most of the information available today is in free form text. Current technologies (google, yahoo) allow us to access text only via key-word search.

We would like to enable content based access to information. Examples include:

  • Topical and Functional categorization of documents: Find documents that deal with stem cell research, but only Call for Proposals.
  • Semantic categorization: Find documents about Columbus (the City, not the Person).
  • Retrieval of concepts and entities rather than strings in text: Find documents about JFK, the president; include those documents that mention him as "John F. Kennedy, John Kennedy, Congressman Kennedy or any other possible writing; but not those that mention the baseball player John Kennedy, nor any of JFK's relatives.
  • Extraction of information based on semantic categorization: Find a list of all companies that participated in merges in the last year. List all professors in Illinois that do research in Machine Learning.

Achieving these tasks requires that we develop programs that can, at some level, understand natural language. The collection of demos below shows some of the technologies we are developing in order to address these and related questions. Some address direct Information Extraction tasks, and some exhibit fundamental natural language technologies that we are developing in order to support better access to information. These demonstrations build on our research in Machine Learning - the fundamental research area that allows us to write programs that learn from their experience, and thus support closer to human capabilities of natural language.

Comma Resolution  

Decomposes sentences to make implicit relations (expressed using comma structures) explicit.

 

Context Senstive Verb Paraphrasing  

Lexical paraphrasing (replacing one word with another) is an inherently context sensitive problem because a word's meaning depends on context. Most paraphrasing work finds patterns and templates that can replace other patterns or templates in some context, but we are attempting to make decisions for a specific context. We have developed a global classifier that takes a verb v and its context (sentence that v appears in, along with a candidate verb u, and determines whether u can replace v in the given sentence while maintaining the original meaning. The classifier makes its decision by finding other contexts that both v and u appear in, and seeing how similar these are to the given context of v. We train the classifier without supervision by utilizing a large set of local classifiers each trained to locate paraphrases of a single word. These local classifiers then generate labeled data for the global classifier.

Context-Sensitive Spelling Correction  

Standard errors resulting in valid words can not be caught by a standard dictionary spell checker, and account for some 25% of all spelling errors.

Examples include: "please feel this form"; "I'd like a peace of cake" etc. Context sensitive spelling correction has been shown to be extremely effective in learning to correct these errors, performing with an accuracy level greater than 95%. This demo allows used to input text as if they are using their own editor. The program will then suggest corrections for any errors it finds.

 

Coreference Resolution  

Dependency Parsing  

Dependency trees provide a syntactic representation that encodes functional relationships between words. They give us a lot of valuable information for analyzing the sentences. We develop a framework for dependency parsing by making decisions in the pipeline model based on the bottom-up parsing algorithm.

 

Information Extraction  

Useful and important information can be extracted from lots of unorganized documents such as news articles and emails, and stored in databases. Then, it is relatively easy to get answers to the type of structured queries that ordinary search engines do not support. We demonstrate the technology by showing its ability to extracts specific phrases of interest in two types of documents --- seminar announcements and job postings.

Multilingual Named Entity Discovery  

A basic sub-task of many natural language processing problems is the identification of words or phrases of specific types (e.g. locations, people, and organizations) in text, and is commonly called Named Entity Recognition (NER). Most successful approaches to NER require large amounts of text with Named Entities tagged by a human annotator. However, in many (especially less common) languages such resources do not exist. We demonstrate a method to automatically generate such resources from multilingual corpora (such as multilingual news streams).

 

Name Identification and Tracing  

Understanding natural language and supporting intelligent access to textual information require identifying whether different mentions of a name, within and across documents, represents the same entity. We demonstrate a browsing tool that incorporates some of our newly developed Machine Learning based technologies in this area. It enables users to trace different mentions of the same entity, presented in different textual forms, across documents.

Named Entity Recognition  

Named entity recognition refers to the task of identifying what phrases in text represent names of People, what represent names of Locations, Organizations, etc. This is a fundamental task in information extraction since it allows some level of abstraction that is required to support the level of interaction people are comfortable with. This is a context sensitive task, as is shown in: Jakob Washington left to Denver to meet with John Denver who works for Washington Mutual.

 

Number Quantization  

Number Quantization refers to the task of recognizing the values of numbers written in text. This tool recognizes numerical entities whether they are written as words or numerals, and can support comparison of commensurate numerical types (e.g. dates).

Part of Speech Tagging  

The importance of assigning each word in a sentence the part of speech (POS) that it assumes in that sentence stems from the fact that identifying POS is one of the early stages in the process performed by various natural language related processes such as speech recognition, translation, and information retrieval and extraction. See how it's done!

 

Preprocessor  

The preprocessor annotates raw text with Part-of-Speech, Shallow Parse and Named Entity information, writing out the results in column format. Phrase-level annotations are in BIO format.

Question Classification  

Responding correctly to a free form question requires the computer to have an awareness of what the question is about, and to the constraints that the question imposes on a possible answer. For instance, the answer to a question like: "Who is the president of France" needs to be a name of a person. Accurately classifying potential answers sets the stage for later selecting the correct answer from among several candidates. See how it's done.

 

Semantic Role Labeling  

Beyond the syntactical analysis of natural language sentences is the extraction of its semantic information. Semantic role labeling is one of such task which identifies the verb and argument structure in natural language sentences, and is an important task toward natural language understanding.

Shallow Parsing  

Enabling a machine to respond to natural language input demands that the machine is equipped with the capacity to identify syntactical phrases in sentences. It is virtually impossible to manually write a comprehensive set of rules the accurately defines the appropriate solutioin to every task of the this nature. However, the availability of annotated corpora (collections of text) and robust machine learning techniques make it possible to emply machines to learn this task from training examples.

 

Text Analysis  

This analysis tool annotates different syntactic and semantic information, including syntactic parse trees, named entities, semantic roles and nominal relations on raw text.

Textual Entailment  

It is not hard for a human to know that a sentence "Joe Smith offers a generous gift to the university." also means "Joe Smith contributes to academia.". But it is extremely hard for a machine. Being able to tackle this task will be an important step toward natural language understanding. This demonstration presents a system that aims to tackle this problem.

 

Word Similarity  

A word similarity metric using WordNet and other resources.

Are the servers running?
Demo usage statistics