Machine Learning and Natural Language

Spring 2009

Experimental Assignment I                    Language Models and Prediction: Verb Prediction (Due 02/25/09)

General

Verb Prediction

The goal of this problem set is to write a program that will predict the most likely verb in a given context. Specifically, given text from which all the verbs have been omitted, your program is expected to determine which verb should occur in the context of this sentence.
You will investigate two general approaches to this problem. A Language Model based approach, and a Classification based approach. In each case, you will have a few necessary experiments to make, and a range of other experiments you may want to perform if you are interested in studying some question and/or improving the results you get.

Attached is a paper that had studied this problem. Your task is not identical to the one presented in this paper, but the will give you more information about the difficulties and several approaches tried on this problem.
 

Y. Even-Zohar and D. Roth
A Classification Approach to Word Prediction.,
NAACL'04, The North American meeting of the Association of Computational Linguistics (2000)

Two other papers that might be useful are:
 
A. R. Golding and D. Roth
A Winnow based Approach to Context Sensitive Spelling Correction,
Machine Learning, 34, 107-130 (1999)

which studies a related problem, and
Ido Dagan, Lillian Lee, and Fernando Pereira.
Similarity-Based Models of Word Cooccurrence Probabilities. Machine Learning 34(1-3), special issue on natural language learning

which studies a language model approach to a related problem.

The Assignment

The assignment has two parts: you will build a language modelbased predictor and a classification based predictor. In each case, your work is to (1) pre-process the text and (2) design an approach and evaluate it on the text.

You will be given a training corpus which is simply a collection of sentences. One or several words (verbs) in each sentence will be designated as the target word (the word we are interested in) . For example:

                    She << dropped >> the ball.

(Actually the corpus does not have the <<  >> around the target words; the format will be different, also additional information such as predicted POS-tags and chunking information will be provided.) The sentences in the training corpus are supposed to help you to develop some characterization of the contexts in which each verb might occur in the context.
Specifically, after training your program (either the language model or the classifier) you will write a program that predicts which verb (out of all possible verbs in English) may occur in the ``hole''

                    She <<  >> the ball ?
left by dropping a verb from a sentence.
We hope that your program will predict drop or kick but not read.
Your program will be evaluated with respect to a given test corpus, supplied by the Preparation and Evaluation team.

The format of the test corpus, will be identical to that of the training corpus, including the
target word in between the <<>>  marks, but you are supposed to disregard this word, and use it
only so that you can evaluate the performance or your classifier. (In fact, when you use the SNoW learning program, in the second part, this evaluation is done automatically; although you may choose to use several other metrics of evaluation.) Below I outline some of the minimal requirements your program needs to satisfy and some of the key questions I would like you to address.

Part I: Language Model

As a minimum, your prediction at this stage will be based on:

Notice that there are several decisions you need to make, before you even start to construct your predictor. There are also several crucial computational decisions you need to make (e.g., how to smooth).

Report on the Language model Predictor

  1. Describe what you did, the specifics of your models, and the rational behind your decisions
  2. Provide the code for your preprocessing and for your model estimation and evaluation.
  3. Present the output of your program on the training corpus and the test corpus.
  4. Package the code so that one can run it also with a different corpus (details below)



Part II: A Classification based Predictor

In this part of the problem set you will build on your experience in the first stage, and build a verb predictor that is based on a learning program.

For the learning part of this assignment you will use the SNoW learning program (follow the software path). User guide is also downloadable from this webpage.

I suggest that you use the FEX program program (follow the software path). to generate the input to SNoW, based on your selection of features. Please consult the user guide.
FEX gives you the flexibility to choose what types of features you think are important (beyond the information sources readily available to the program.)

You can also read slides from a FEX/SNoW tutorial available here.
Once you know how to do that, you will run a few experiments with the data, using SNOW. As a minimum, your prediction at this stage will be based on:

On using the learning program:

I will not describe the training and test using SNoW here. Please consult the user-guide.

Note that the learning program learns a specific architecture of linear separators. You get some freedom in choosing the architecture, the freedom to choose the update rule for the linear separator (Winnow, Perceptron or naive Bayes), and a few other parameters. For Winnow and Perceptron you can choose to cycle through the training data a few times.

The program allows you to rank features and discard some of them, or do not let the system even see them (using the eligibility flag). You may choose not to play with this option at all (I suggest that you start, at least, without using it). You can also combine linear separators to form a more expressive decision surface (forming a cloud) but we will not use this option for this assignment.

Some decisions you will make on how to use the program may make a huge difference in the performance you will see.
Read the documentation that comes with the program.

Report on Classification based Predictor

  1. Describe what you did, the specifics of your models, and the rational behind your decisions
  2. Provide the Fex scripts and code for your preprocessing (if relevant); any other code, if you choose to use other classifiers.
  3. Present the output of your program on the training corpus and the test corpus.
  4. Package the code so that one can run it also with a different corpus (details below)

Detailed Description, data

Note about the data: the data is provided to you only for the use in this assignment. You should not re-distribute it or make it accessible by others (e.g., for download from your web-page). The same rule applies to other data distributed in this class (e.g., to the next assignments).



Grading

Your grade depends on:
  1. The quality of your report
  2. The accuracy of your predictors.
  3. Your originality in going beyond the minimal requirements.
The work should be divided as equally as possible between group members. Each member of each group will need to send to me a couple of lines by email explaining what was his part in the project.

Due date

Wednesday, Feb 25.