Transliteration

Overview:

Named entity (NE) transliteration is the process of transcribing a NE from a source language to some target language based on phonetic similarity between the entities. Identifying transliteration pairs is an important component in many linguistic applications which require identifying out-of-vocabulary words, such as machine translation and multilingual information retrieval. We have developed (almost) unsupervised discriminative techniques for learning transliteration models as well as constrained optimization inference techniques for extracting better transliteration features.

Details:

The first line of work exploits (weak) temporal alignment between two sides of the bilingual corpus to derive a nearly unsupervised learning algorithm for automatic discovery of Named Entities (NEs) in the resource poor language. NEs have similar time distributions across such corpora, and often some of the tokens in a multi-word NE are transliterated. We develop an algorithm that exploits both observations iteratively. The algorithm makes use of a new, frequency-based metric for time distributions and a resource-free discriminative approach to transliteration. Seeded with a small number of transliteration pairs, our approach discovers multi-word NEs, and takes advantage of a dictionary (if one exists) to account for translated or partially translated NEs. We evaluate the algorithm on an English-Russian corpus, and show high level of NEs discovery in Russian.

A recurring problem in many unsupervised methods is the bias introduced when selecting training data according to labeling confidence, which may not be representative of the actual data used in practice. In the second line of work we investigate how to direct the sample selection process to samples that will help adapt the learning process to unseen data, possibly taken from different domains. This is done in two stages; first, we obtain and analyze a large set of unlabeled samples representative of the testing data, to be used as reference. Second, during the sample-selection process only samples which minimize the distributional distance of features in the reference and sampled sets are added to the training data. We show experimentally that using data obtained in this manner improves performance compared to using more data biased towards parts of the sample distribution for which automatic annotation is possible.

We have also explored constrained optimization inference techniques for extracting better transliteration features for a discriminative training model.

See also the multilingual NE discovery demo.

Relevant Software:

Relevant Publications: