Reflex:
Named Entity Recognition and Transliteration for 50 Languages

Overview:

The research on methods for Named Entity Recognition (NER) is voluminous but has tended to focus on the problem in widely used languages such as English, other Western European languages, Arabic, and Asian languages such as Chinese, Japanese and Korean.

The purpose of this project is to fill the gap by providing resources and tools that will allow one to rapidly build named entity detectors for a collection of 50 languages, nearly all with speaker populations numbering in the millions, in which we have expertise. This includes nearly all languages that fall into the category of "Less Commonly Taught Languages". We will focus on the recognition of named entities falling into the categories of PERSON, ORGANIZATION and LOCATION.

Details:

The research on methods for Named Entity Recognition (NER) is voluminous but has tended to focus on the problem in widely used languages such as English, otherWestern European languages, Arabic, and Asian languages such as Chinese, Japanese and Korean.

The purpose of this project is to fill the gap by providing resources and tools that will allow one to rapidly build named entity detectors for a collection of 50 languages, nearly all with speaker populations numbering in the millions, in which we have expertise. This includes nearly all languages that fall into the category of "Less Commonly Taught Languages". We will focus on the recognition of named entities falling into the categories of PERSON, ORGANIZATION and LOCATION. Why do we believe that it is feasible to provide resources for so many languages? Research in the last few years has shown that machine learning approaches can learn to recognize and classify named entities reliably. Recognizing named entities requires:

  1. Recognizing phrase boundaries of entity phrases and
  2. Classifying the target phrases to one of several types of entities (or a miscellaneous one).

The key to our technical approach is the observation that semi-supervised learning methods can be used for both stages. With a lexicon of entities in a given language and language specific features for what constitute an entity as input, we will develop methods to bootstrap a named entity detector for new languages. For example, as demonstrated in (Collins and Singer, 1999), the second stage of this process can be solved reliably with a small number of initial examples. We propose a two year project. The first part of the project, accomplished in Year 1, will involve collecting plausible initial rule sets for the 50 languages listed below. These will include the widely spoken languages that have already received significant attention for NER (which we will include as a litmus test for our methods), as well as many other languages to which no attention has been given. For each language, we will collect-from dictionaries, grammars or online resources and with native speaker expertise-the following kinds of resources:

  1. Unambiguous personal titles (e.g. English Mr.)
  2. Unambiguous organization titles (e.g. Corporation, Incorporated)
  3. Unambiguous place names.
  4. Language-particular rules for titles that determine on which side of the title the name occurs. (E.g. Mr. occurs on the left of the name in English, but xiansheng occurs on the right of the name in Mandarin.)

In parallel with the first part of the project, in the second part we will develop Machine-Learning algorithms that will produce high quality NE detectors from a small set of initial seed rules. We propose to use semi-supervised learning methods, similar to those suggested in (Yarowsky, 1995), (Collins and Singer, 1999) and others. Our approach will develop phrase boundary detectors (Punyakanok and Roth, 2001) and classifiers for entities (Roth and Yih, 2001; Roth and Yih, 2002; Roth and Yih, 2004) and will make use of the SNoW learning architecture (Carlson et al., 1999). Our approach will make use of named entity specific linguistics features to identify different renditions of the same entities (Li, Morie, and Roth, 2004b; Li, Morie, and Roth, 2004a), including abbreviations, and document level inferences to learn from multiple occurrences of entities in the same document.

In a third part of the project, to be addressed in Years 1 and 2, we will also concern ourselves with the problem of identifying transliteration equivalents between arbitrary languages and English. In languages that use the Roman script this will generally not be an issue, but for languages that use other scripts we would like some way to determine that a particular named entity might correspond to a well-known entity name in English. We propose to make use of previous work on automatic transliteration (e.g. (Knight and Graehl, 1997)), coupled with a document-level model that compares the distribution of names in a given non-English document with the distribution of names in similar documents in English. More details will be given in the Tasks section below. Finally, in Year 2, we will evaluate our work. We cannot evaluate on 50 languages, but we can take a sampling of languages for which we develop resources in Year 1, and demonstrate the performance of our methods on these. This will require acquiring (unannotated) training corpora and (annotated) testing corpora. We expect to be able to develop these corpora from online sources. Since annotation is required for the testing portion, we will limit ourselves to languages in which we have local expertise. We propose the following ten languages for evaluation: Chinese, Arabic, Hindi, Marathi, Thai, Farsi, Amharic, Indonesian, Swahili, Quechua. This list includes both widespread languages, such as Chinese, as well as LCTL's. For languages like Chinese, we can use corpora that are already used for NER evaluation, as a way of comparing our methods with those of others. This will in turn give us a metric for comparison with performance on LCTL's so that we will have some sense, from these ten languages, of what the range of difficulties is for NER in various languages.

Collaborations:

Relevant Publications: