Most of the information available today is in free form text.
Current technologies (google, yahoo) allow us to access text only
via key-word search.
We would like to enable content based access to information. Examples include:
Achieving these tasks requires that we develop programs that can,
at some level, understand natural language. The collection of
demos below shows some of the technologies we are developing in
order to address these and related questions. Some address direct
Information Extraction tasks, and some exhibit fundamental natural
language technologies that we are developing in order to support
better access to information. These demonstrations build on our
research in Machine Learning - the fundamental research area that
allows us to write programs that learn from their experience, and
thus support closer to human capabilities of natural language.
|
Decomposes sentences to make implicit relations (expressed using comma structures) explicit.
| |
Lexical paraphrasing (replacing one word with another) is an inherently
context sensitive problem because a word's meaning depends on context.
Most paraphrasing work finds patterns and templates that can replace
other patterns or templates in some context, but we are attempting to make
decisions for a specific context. We have developed a global classifier
that takes a verb v and its context (sentence that v appears in, along with
a candidate verb u, and determines whether u can replace v in
the given sentence while maintaining the original meaning.
The classifier makes its decision by finding other contexts that both v and u
appear in, and seeing how similar these are to the given context of v. We train
the classifier without supervision by utilizing a large set of local classifiers
each trained to locate paraphrases of a single word. These local classifiers then
generate labeled data for the global classifier.
|
|
Standard errors resulting in valid words can not be caught by a
standard dictionary spell checker, and account for some 25% of
all spelling errors.
Examples include: "please feel this form"; "I'd like a peace of
cake" etc. Context sensitive spelling correction has been shown to
be extremely effective in learning to correct these errors,
performing with an accuracy level greater than 95%. This demo
allows used to input text as if they are using their own editor.
The program will then suggest corrections for any errors it finds.
| |
|
|
Dependency trees provide a syntactic representation that encodes functional relationships between words. They give us a lot of valuable information for analyzing the sentences. We develop a framework for dependency parsing by making decisions in the pipeline model based on the bottom-up parsing algorithm.
| |
Useful and important information can be extracted from lots of unorganized documents such as news articles and emails, and stored in databases. Then, it is relatively easy to get answers to the type of structured queries that ordinary search engines do not support. We demonstrate the technology by showing its ability to extracts specific phrases of interest in two types of documents --- seminar announcements and job postings.
|
|
A basic sub-task of many natural language processing problems is the identification
of words or phrases of specific types (e.g. locations, people, and organizations)
in text, and is commonly called Named Entity Recognition (NER). Most successful
approaches to NER require large amounts of text with Named Entities tagged by a human
annotator. However, in many (especially less common) languages such resources do not
exist. We demonstrate a method to automatically generate such resources from multilingual
corpora (such as multilingual news streams).
| |
Understanding natural language and supporting intelligent access to textual information require identifying whether different mentions of a name, within and across documents, represents the same entity. We demonstrate a browsing tool that incorporates some of our newly developed Machine Learning based technologies in this area. It enables users to trace different mentions of the same entity, presented in different textual forms, across documents.
|
|
Named entity recognition refers to the task of identifying what
phrases in text represent names of People, what represent names of Locations, Organizations, etc. This is a fundamental task in
information extraction since it allows some level of abstraction
that is required to support the level of interaction people are
comfortable with. This is a context sensitive task, as is shown
in: Jakob Washington left to Denver to meet with John Denver who works for Washington Mutual.
| |
Number Quantization refers to the task of recognizing the values of numbers written in text. This tool recognizes numerical entities whether they are written as words or numerals, and can support comparison of commensurate numerical types (e.g. dates).
|
|
The importance of assigning each word in a sentence the part of speech (POS) that it assumes in that sentence stems from the fact that identifying POS is one of the early stages in the process performed by various natural language related processes such as speech recognition, translation, and information retrieval and extraction. See how it's done!
| |
The preprocessor annotates raw text with Part-of-Speech, Shallow Parse and Named Entity information, writing out the results in column format. Phrase-level annotations are in BIO format.
|
|
Responding correctly to a free form question requires the computer to have an awareness of what the question is about, and to the constraints that the question imposes on a possible answer. For instance, the answer to a question like: "Who is the president of France" needs to be a name of a person. Accurately classifying potential answers sets the stage for later selecting the correct answer from among several candidates. See how it's done.
| |
Beyond the syntactical analysis of natural language sentences is the extraction of its semantic information. Semantic role labeling is one of such task which identifies the verb and argument structure in natural language sentences, and is an important task toward natural language understanding.
|
|
Enabling a machine to respond to natural language input demands that the machine is equipped with the capacity to identify syntactical phrases in sentences. It is virtually impossible to manually write a comprehensive set of rules the accurately defines the appropriate solutioin to every task of the this nature. However, the availability of annotated corpora (collections of text) and robust machine learning techniques make it possible to emply machines to learn this task from training examples.
| |
This analysis tool annotates different syntactic and semantic information, including syntactic parse trees, named entities, semantic roles and nominal relations on raw text.
|
|
It is not hard for a human to know that a sentence "Joe Smith offers a generous gift to the university." also means "Joe Smith contributes to academia.". But it is extremely hard for a machine. Being able to tackle this task will be an important step toward natural language understanding. This demonstration presents a system that aims to tackle this problem.
| |
A word similarity metric using WordNet and other resources.
|