LBJ Part of Speech Tagger

(531 total downloads)

Download | User Guide ]

This POS tagger is substantially the same as our SNoW-based POS tagger, except that this one performs better, outputs a more standardized tag set, and accepts raw, natural language text as input (i.e., it should not be sentence-split or word-split). The output format is the same.

Another difference between this version and the SNoW-based POS tagger is that LBJ makes this tagger much easier to incorporate into other Java applications. Simply import the tagger and call it on a LBJ2.nlp.seg.Token object.

See the online Javadoc documentation.

Using the Part of Speech Tagger

Testing

The tagger's performance can be tested on labeled test data with the following command:

   
java LBJ2.classify.TestDiscrete LBJ2.nlp.pos.POSTagger \
     LBJ2.nlp.pos.POSLabel LBJ2.nlp.pos.POSBracketToToken <test data>

where <test data> is the path to the labeled test data. See the online Javadoc documentation of the LBJ2.classify.TestDiscrete class for more information.

Evaluating

A stand-alone program that takes plain, unannotated text as input is also provided. It accepts raw, natural language text that has not been sentence-split or word-split as input. Run it with the following command line.

   
java LBJ2.nlp.pos.POSTagPlain <input file>

Importing

The LBJ part of speech tagger expects that words are represented internally using the LBJ library's LBJ2.nlp.seg.Token class. If your LBJ source code defines a learning classifier that also takes a Token as input, you can import the POS tagger and use it like so:

   
// Begin Foo.lbj

import LBJ2.nlp.pos.POSTagger;
import LBJ2.nlp.seg.Token;

discrete FooClassifier(Token w) <-
learn FooLabeler
  using Feature1, Feature2, POSTagger
  ...
end

If your Java application uses the Token class as well, you can import the POS tagger and use it like so:

   
// Begin Bar.java

import LBJ2.nlp.pos.POSTagger;
import LBJ2.nlp.seg.Token;
import LBJ2.nlp.SentenceSplitter;
import LBJ2.nlp.WordSplitter;
import LBJ2.nlp.seg.PlainToTokenParser;
import LBJ2.parse.ChildrenFromVectors;

public class Bar
{
  ...
  void myMethod(String plainTextFile)
  {
    ...
    POSTagger tagger = new POSTagger();
    ...
    PlainToTokenParser parser =
      new PlainToTokenParser(
        new WordSplitter(
          new SentenceSplitter(plainTextFile)));
    Token w = (Token) parser.next();
    String tag = tagger.discreteValue(w);
    ...
  }
  ...
}

The list of tags returned by the discreteValue(Object) method in the context shown above can be found in the online Javadoc at

http://flake.cs.uiuc.edu/~rizzolo/LBJ2/library/LBJ2/nlp/POS.html#tokens

Participants: