Learning Based Java

(1120 total downloads)

Download | User Guide | Projects ]

What is LBJ?

Learning Based Java is a modeling language for the rapid development of software systems with one or more learned functions, designed for use with the JavaTM programming language. LBJ offers a convenient, declarative syntax for classifier and constraint definition directly in terms of the objects in the programmer's application. With LBJ, the details of feature extraction, learning, model evaluation, and inference are all abstracted away from the programmer, leaving him to reason more directly about his application.

Introduction

Many software systems are in need of functions that are simple to describe but that no one knows how to implement. Recently, more and more designers of such systems have turned to machine learning to plug these gaps. Given data, a discriminative machine learning algorithm yields a function that classifies instances from some problem domain into one of a set of categories. For example, given an instance from the domain of email messages (i.e., given an email), we may desire a function that classifies that email as either "spam" or "not spam". Given data (in particular, a set of emails for which the correct classification is known), a machine learning algorithm can provide such a function. We call systems that utilize machine learning technology learning based programs.

Modern learning based programs often involve several learning components (or, at least a single learning component applied repeatedly) whose classifications are dependent on each other. There are many approaches to designing such programs; here, we focus on the following approach. Given data, the various learning components are trained entirely independently of each other, each optimizing its own loss function. Then, when the learned functions are applied in the wild, the independent predictions made by each function are reconciled according to user specified constraints. This approach has been applied successfully to complicated domains such as Semantic Role Labeling.

LBJ

Learning Based Java (LBJ) is a modeling language that expedites the development of learning based programs, designed for use with the JavaTM programming language. The LBJ compiler accepts the programmer's classifier and constraint specifications as input, automatically generating efficient Java code and applying learning algorithms (i.e., performing training) as necessary to implement the classifiers' entire computation from raw data (i.e., text, images, etc.) to output decision (i.e., part of speech tag, type of recognized object, etc.). The details of feature extraction, model evaluation (i.e., evaluating the function that the learning algorithm returned), and inference (i.e., reconciling the predictions in terms of the constraints at runtime) are abstracted away from the programmer.

A classifier may be defined by:

  • coding it explicitly in Java,
  • using operators to build it from existing classifiers, or
  • identifying feature extraction classifiers and a data source to learn it over.

Under the LBJ programming philosophy, the designer of a learning based program will first design an object oriented internal representation (IR) of the application's raw data using pure Java. A classifier is then any method that produces one or more discrete or real valued classifications with respect to a single object from the programmer's IR. Using LBJ, these classifications are easily interpretable either at face value as the application requires or as features amenable for input to a learning algorithm. Learning algorithms are employed to create learning classifiers, which are classifiers that can change their representation with experience. Once the LBJ compiler has generated these representations from their specifications and user supplied training objects, the application, written in pure Java, simply invokes any classifier on an IR object just like any other method. Programming with LBJ, the practitioner reasons in terms of his data directly, disregarding the cumbersome implementation details of feature extraction and learning.

LBJ is supported by a library of interfaces and classes that implement a standardized functionality for features and classifiers. The library includes learning and inference algorithm implementations, general purpose and domain specific internal representations, and domain specific parsers.

The LBJ compiler also operates similarly to a makefile. When changes are made to one or more supporting classifiers, the compiler only re-trains those learned classifiers that were affected by the changes.


Feature Library

LBJ makes it easy to develop and use classifiers as features. In addition to the simple, hard-coded classifiers that come packaged with LBJ (see the online Javadoc), a constantly growing suite of learned classifiers is available. Simply import them into your LBJ or Java source code, and call them just like methods (see below for examples).

LBJ Part of Speech Tagger

   

This is an implementation of our SNoW-based POS tagger for use with LBJ.

LBJ Chunker

   

A classifier that partitions plain text into sequences of semantically related words, indicating a shallow (i.e., non-hierarchical) phrase structure.

LBJ Coreference

   

A Coreference Resolver, based on LBJ, trained on the ACE corpus.

LBJ Named Entity Tagger

   

This is a state of the art NE tagger that tags plain text with named entitites (people / organizations / locations / miscellaneous). It uses gazetteers extracted from Wikipedia, word class model derived from unlabeled text and expressive non-local features. The best performance is 90.8 F1 on the CoNLL03 shared task data.



Example: The LBJ Chunker

The LBJ Chunker is written in 33 lines of LBJ:

   
// chunk.lbj
package LBJ2.nlp.chunk;

// Note how classifiers defined elsewhere, e.g. those defined in LBJ's NLP
// library, learned or hard-coded, can simply be imported and invoked.
import LBJ2.nlp.*;
import LBJ2.nlp.pos.POSTagger;
import LBJ2.nlp.seg.Token;


// This hard-coded classifier computes features whose values come from the
// learned classifier defined below.
discrete% PreviousTags(Token word) <-
{
  int i;
  Token w = word;
  for (i = 0; i > -2 && w.previous != null; --i) w = (Token) w.previous;

  for (; w != word; w = (Token) w.next)
  {
    if (Chunker.isTraining) sense i++ : w.label;
    else sense i++ : Chunker(w);
  }
}

// The features computed by this classifier are the parts of speech in a
// window around the target word.
discrete% POSWindow(Token word) <-
{
  int i;
  Token w = word, last = word;
  for (i = 0; i <= 2 && last != null; ++i) last = (Token) last.next;
  for (i = 0; i > -2 && w.previous != null; --i) w = (Token) w.previous;

  for (; w != last; w = (Token) w.next) sense i++ : POSTagger(w);
}

discrete ChunkLabel(Token word) <- { return word.label; }

// The chunker is learned "using" certain classifiers to extract features,
// "from" data, "with" a learning algorithm.
discrete Chunker(Token word) cachedin word.type <-
learn ChunkLabel
  using Forms, Capitalization, WordTypeInformation, Affixes, PreviousTags,
        POSWindow
  from
    new ChildrenFromVectors(new CoNLL2000Parser(Constants.trainingData))
    50 rounds
  with new SparseNetworkLearner(new SparseAveragedPerceptron(.1, 0, 2))
end


Example: Using the Part of Speech Tagger

In the previous example, we saw how easy it was to import and use the learned POSTagger classifier for feature extraction. (It was learned using an LBJ source file very similar to the chunker's above.) In this example, we'll see how to import it into a Java application.

   
// POSTagPlain.java
package LBJ2.nlp.pos;

import LBJ2.nlp.*;
import LBJ2.parse.*;


public class POSTagPlain
{
  public static void main(String[] args)
  {
    // Code that parses the command line omitted.

    // Here, the POSTagger, translated to a regular Java class by the LBJ
    // compiler, is instantiated.
    POSTagger tagger = new POSTagger();
    // A chain of utility parsers from the LBJ library parses plain text into
    // Word objects.
    Parser parser =
      new ChildrenFromVectors(
          new WordSplitter(new SentenceSplitter(testingFile)));
    String sentence = "";

    for (Word word = (Word) parser.next(); word != null;
         word = (Word) parser.next())
    {
      // The tagger is invoked, and a part of speech is returned as a String.
      word.partOfSpeech = tagger.discreteValue(word);
      sentence += " (" + word.partOfSpeech + " " + word.form + ")";

      if (word.next == null)
      {
        System.out.println(sentence.substring(1));
        sentence = "";
      }
    }
  }
}

Participants:

Relevant Projects: