

Part of Speech (pos) tagging is the problem of assigning each word in a sentence the part of speech that it assumes in that sentence.
Since words are ambiguous in terms of their part of speech, the correct part of speech is usually identified from the context the word appears in. Consider for example the sentence "Many lights will light the play room so that the light people can play." The word "light" takes in this sentence a role of a verb, a noun and an adjective. The word "play" takes both a noun and a verb, and other words, like "will" and "can" take modal-verb, but can be also tagged, in a different context, as nouns. This leads to many possible POS tagging of the sentence, only one of which is correct.
The importance of the problem stems from the fact that identifying pos is one of the first stages in the process performed by various natural language related processes such as speech recognition, translation, information retrieval and extraction and others.
It is difficult to manually write down rules for the POS tag of a word in its context. We use learning techniques, based on the SNoW learning architecture, to generate those "rules". To do that, our system reads many correctly tagged sentences, used as training data, and learns a function that can be used to POS tag any English sentence.
SNoW is a learning architecture that is tailored for learning in the presence of a very large number of information sources (features). SNoW learns a network of linear functions. For the POS tagger, each target node in this network corresponds to a distinct part of speech. Each part of speech is represented as a function of the words in the sentence and the pos of words in the neighborhood of the target word.
The POS tagger makes use of the Sequential Model. This is a model that facilitates the learning and evaluation of the learned function in cases where the number of potential targets for each decision is large (in this case, there are about 50 different potential POS tags).
The current system has been trained on a collection of articles from the Wall Street Journal, consisting of about 1 million words, that were tagged for pos by the Penn Treebank project.