
What is LBJ?Learning Based Java is a modeling language for the rapid development of software systems with one or more learned functions, designed for use with the JavaTM programming language. LBJ offers a convenient, declarative syntax for classifier and constraint definition directly in terms of the objects in the programmer's application. With LBJ, the details of feature extraction, learning, model evaluation, and inference are all abstracted away from the programmer, leaving him to reason more directly about his application. |
Many software systems are in need of functions that are simple to describe but that no one knows how to implement. Recently, more and more designers of such systems have turned to machine learning to plug these gaps. Given data, a discriminative machine learning algorithm yields a function that classifies instances from some problem domain into one of a set of categories. For example, given an instance from the domain of email messages (i.e., given an email), we may desire a function that classifies that email as either "spam" or "not spam". Given data (in particular, a set of emails for which the correct classification is known), a machine learning algorithm can provide such a function. We call systems that utilize machine learning technology learning based programs.
Modern learning based programs often involve several learning components (or, at least a single learning component applied repeatedly) whose classifications are dependent on each other. There are many approaches to designing such programs; here, we focus on the following approach. Given data, the various learning components are trained entirely independently of each other, each optimizing its own loss function. Then, when the learned functions are applied in the wild, the independent predictions made by each function are reconciled according to user specified constraints. This approach has been applied successfully to complicated domains such as Semantic Role Labeling.
Learning Based Java (LBJ) is a modeling language that expedites the
development of learning based programs, designed for use with the
JavaTM programming language. The LBJ compiler accepts
the programmer's classifier and constraint specifications as input,
automatically generating efficient Java code and applying learning algorithms
(i.e., performing training) as necessary to implement the classifiers' entire
computation from raw data (i.e., text, images, etc.) to output decision (i.e.,
part of speech tag, type of recognized object, etc.). The details of feature
extraction, model evaluation (i.e., evaluating the function that the learning
algorithm returned), and inference (i.e., reconciling the predictions in terms
of the constraints at runtime) are abstracted away from the programmer.
A classifier may be defined by:
Under the LBJ programming philosophy, the designer of a learning based program will first design an object oriented internal representation (IR) of the application's raw data using pure Java. A classifier is then any method that produces one or more discrete or real valued classifications with respect to a single object from the programmer's IR. Using LBJ, these classifications are easily interpretable either at face value as the application requires or as features amenable for input to a learning algorithm. Learning algorithms are employed to create learning classifiers, which are classifiers that can change their representation with experience. Once the LBJ compiler has generated these representations from their specifications and user supplied training objects, the application, written in pure Java, simply invokes any classifier on an IR object just like any other method. Programming with LBJ, the practitioner reasons in terms of his data directly, disregarding the cumbersome implementation details of feature extraction and learning.
LBJ is supported by a library of interfaces and classes that implement a standardized functionality for features and classifiers. The library includes learning and inference algorithm implementations, general purpose and domain specific internal representations, and domain specific parsers.
The LBJ compiler also operates similarly to a makefile. When changes are made
to one or more supporting classifiers, the compiler only re-trains those
learned classifiers that were affected by the changes.
LBJ Part of Speech Tagger |
This is an implementation of our SNoW-based POS tagger for use with LBJ. |
|
LBJ Chunker |
A classifier that partitions plain text into sequences of semantically related words, indicating a shallow (i.e., non-hierarchical) phrase structure. |
|
LBJ Coreference |
A Coreference Resolver, based on LBJ, trained on the ACE corpus. |
|
LBJ Named Entity Tagger |
This is a state of the art NE tagger that tags plain text with named entitites (people / organizations / locations / miscellaneous). It uses gazetteers extracted from Wikipedia, word class model derived from unlabeled text and expressive non-local features. The best performance is 90.8 F1 on the CoNLL03 shared task data. |
The LBJ Chunker is written in 33 lines of LBJ:
|
In the previous example, we saw how easy it was to import and use the
learned
POSTagger classifier for feature extraction. (It was learned
using an LBJ source file very similar to the chunker's above.) In this
example, we'll see how to import it into a Java application.
|