Dataless Classification

Overview:

Today, we have to handle an increasing number of electronic documents from a growing number of sources such as, for example, email, blogs and news sources. As the volume of documents expands, organizing them for easy accessibility becomes increasingly difficult.

Traditionally, categorization of documents has been studied as the problem of training of a classifier using labeled data. However, people can categorize documents into named categories without any explicit training because we know the meaning of category names. We have introduced the Dataless Classification model, a learning protocol that uses world knowledge to induce classifiers without the need for any labeled data. Like humans, a dataless classifier interprets a string of words as a set of semantic concepts. Within our proposed model we show that the label name alone is often sufficient to induce classifiers.

Details:

One common solution to text categorization is to label documents with category names and using these to drill down to the required documents. For example, if we want to organize our emails into three categories - "university," "administrative mails," and "friends," - we have to manually label each email. This task presents two immediate problems. First, it requires manual labeling of documents and filters. Second, it does not scale well with refinements of labels.

Further, if we want to split the category named "administrative mails" into two categories, "accounting and billing" and "meetings and events", we will have to redefine the filters or, in the worst case, manually label each document. However, it is clear that people can perform such expansion and refinement easily - in other words, people are adept at creating ontologies of concepts. We can do so because we have an intuitive understanding of the meaning of the words.

We developed an approach to automatically and transparently categorize and organize electronic documents into ontologies; our approach is inspired by the way people can infer whether or not a document discusses a particular topic by only using the meaning of the labels. Our approach allows users to provide short descriptions of the categories in the ontology and, using only the descriptions, uses ideas from machine learning to organize the documents accordingly. We show that in many cases, just the label name (such as "accounting") will suffice for accurate categorization. We also show that this approach, that is based on using Wikipedia to develop an understanding to the meaning of words, adapts well to new genres of documents.