Semisupervised Learning with Prior Knowledge

Overview:

We are interested in weakly-supervised learning of statistical models using high-level knowledge and constraints. We are particularly interested in situations where making decisions depends on reasoning over the outcomes of several different classifiers. Furthemore, we assume the availability of prior knowledge that is encoded in two froms:

  1. A set of base classifiers with low recall but high precision.
  2. A set of high-level constraints that specify the interactions between the classifiers.

For example, consider the task of extracting the author, the title, the year of publication and the journal of a scientific paper citations.

The high-confidence low-recall classifiers can be:

  • A capital letter followed by a dot is part of a name.
  • A four digit letter starting with 19xx or 200x is a year of the publication.
  • The word 'proceedings' is always part of the journal description.
  • A dot that does not follow a capital letter indicates transition between fields.
  • A question mark can appear only in the title of the paper.

The constraints could be:

  • Each field appears at most once.
  • The author and title fields have to appear
  • The author field cannot be more than 10 words long

The high-precision classifiers will allow only a partial labeing of the data. Our goal is to provide a model that will label all the data while being consistent with the base classifiers and the constraints.