LBJ Chunker

(274 total downloads)

Download | User Guide ]

A chunker or ("shallow parser"), is a program that partitions plain text into sequences of semantically related words. The type of partition is also computed. For example:

   
[NP Jack and Jill ] [VP went ] [ADVP up ] [NP the hill ]
[VP to fetch ] [NP a pail ] [PP of ] [NP water ] .

This task is simpler than "full parsing" (in which a parse tree indicating nested phrase structure is produced), and it was originally intended to be an aid for full parsers.

See the online Javadoc documentation.

Using the Chunker

Testing

Assuming the chunker's class files are on the CLASSPATH, its performance can be tested on test data labeled in the same format as the CoNLL 2000 corpus with the following command:

   
java LBJ2.nlp.chunk.ChunkTester <test data>
where <test data> is the path to the labeled test data. This very simple program makes use of the LBJ2.nlp.seg.BIOTester class which collects precision, recall, and F1 statistics over the segments (i.e., chunks, in this case) discovered by a "BIO" style classifier (such as this chunker; see below for a description of the tags produced).

If your data has chunk labels but not part of speech tags, use the same CoNLL 2000 corpus format with a single dash in place of each POS tag. These tags will then be computed automatically during feature extraction.

Evaluating

The LBJ runtime library contains a class that implements a general purpose segmenter based on a word classifier that returns "BIO" style tags, such as this chunker. To invoke this program, type:

   
java LBJ2.nlp.seg.SegmentTagPlain LBJ2.nlp.chunk.Chunker <plain text file>

For more information about the SegmentTagPlain program, see LBJ's online documentation.

Importing

This implementation uses the LBJ library's Token class to internally represent the words whose chunk tags it computes. If your Java application uses the Token class as well, you can import the chunker and use it like so:

   
// Begin Foo.java

import LBJ2.nlp.chunk.Chunker;
import LBJ2.nlp.seg.Token;

public class Foo
{
  ...
  void myMethod()
  {
    ...
    Chunker tagger = new Chunker();
    ...
    Token word = ...
    ...
    String tag = tagger.discreteValue(word);
    ...
  }
  ...
}

Note that if your word object does not have its partOfSpeech field filled, the LBJ POS tagger (which must be on your CLASSPATH) will be loaded automatically by the chunker to compute the tag for use as a feature.

Used as shown above, the chunker will return one of the following tags for each word:

Tag Explanation: "The chunker predicts that the word ..."
B-ADJP   begins an adjective phrase.
I-ADJP   is inside an adjective phrase.
B-ADVP   begins an adverbial phrase.
I-ADVP   is inside an adverbial phrase.
B-CONJP   begins a conjunctive phrase.
I-CONJP   is inside a conjunctive phrase.
B-INTJ   begins an interjection.
I-INTJ   is inside an interjection.
B-LST   begins a list marker.
I-LST   is inside a list marker.
B-NP   begins a noun phrase.
I-NP   is inside a noun phrase.
B-PP   begins a prepositional phrase.
I-PP   is inside a prepositional phrase.
B-PRT   begins a particle.
I-PRT   is inside a particle.
B-SBAR   begins a subordinated clause.
I-SBAR   is inside a subordinated clause.
B-UCP   begins an unlike coordinated phrase.
I-UCP   is inside an unlike coordinated phrase.
B-VP   begins a verb phrase.
I-VP   is inside a verb phrase.
O   is outside of any chunk.

Participants: