
[ Overview | Publications ]
A given entity - representing a person, a location, or an organization - may be mentioned in text in multiple, ambiguous ways. Understanding natural language and supporting intelligent access to textual information requires identifying whether different entity mentions actually represent the same entity.
We apply newly developed Machine Learning-based technologies to enable users to trace different mentions of the same entity both within and across documents.Within a document, we use a discriminative approach to model the probability that two words or phrases, even when dissimilar on the surface, refer to the same entity. Across documents, we developed both a generative model and a discriminative approach and use those to model the variety of ways that names can be written, along with the topic and author of the documents, to enable a system to learn whether names in different documents refer to the same entity.
This task is challenging, since an entity can be mentioned in a variety of potentially ambiguous ways. For example, going across different documents, the names, nicknames, and abbreviations must be traced. Moreover, within documents, the titles, roles, and pronouns also must be analyzed. Additionally, we handle references to anonymous entities, such as "the burial site," within documents.