Machine Learning and Natural Language

Fall 2000

Problem Set I                   Disambiguation I: Context sensitive text correction (Due 9/28/00)

General

Context Sensitive Text Correction

The goal of this problem set is to write a program that will identify and correct context sensitive mistakes in text. Specifically, your program will deal with two confusion sets:
[accept, except] and [then, than].

Given a sentence in which one of [accept, except] occurs ([then, than], resp.) your program will determine which of the two should occur in the context of this sentence.
The first part of the assignment is devoted to preprocessing of the corpus and the manual development of a program that solves this problem. In the second part, only after you finish and submit the first part, you will develop a learning program for this task, based on the preprocessing done earlier.

Attached is a paper that had studied this problem. It will give you more information about the difficulties and several approaches tried on this problem.
 

A. R. Golding and D. Roth
A Winnow based Approach to Context Sensitive Spelling Correction,
Machine Learning, 34, 107-130 (1999)

The Assignment

The assignment has two parts: you will build a manual classifier and learn a classifier. In each case, your work is to (1) pre-process the text and (2) design a classifier and evaluate it on the text.

You are given a training corpus which is simply a collection of sentences which contain one of the target (the word we are interested in) words. For example:

                    We did not << accept >> the diagnosis at once , but gradually we are coming to .

(Actually the corpus will not have the <<  >> around the target words, assuming that you know what you are after.) The sentences in the training corpus are supposed to help you to develop some characterization of the contexts in which [accept, except] might occur.
Specifically, you will write a program that predicts which of the two (and the same for [then, than]) should
occur. Only at a later time, you will be given a test corpus in which the target word is missing, as in:

                    Why won't you <<  >> the facts ?

and you will use your program to determine which of [accept, except] should occur in this context.

The format of the test corpus, in fact, will be identical to that of the training corpus, including the
target word in between the <<>>  marks, but you are supposed to disregard this word, and use it
only so that you can evaluate the performance or your classifier. (In fact, this is done automatically in the second part, when you use the SNoW learning program; in the first part, you will have to do it yourself.

Your classifier will take as input a  text corpus 1   ( text corpus 2)  and will output a few statistics that indicate how well it is doing. Below I provide some more details on the two parts of the assignment.

Part I: Manual Classifier

In the first part of the assignment, the manual part, you are the classifier. In the second part a learning program will generate the classifier.

1. Pre-Processing

In order to build you classifier you will need to consult a text corpus that will be given as input (the training data).
In order to facilitate the generation of the classifier you will first pre-process the data and transform
each occurrence of a target word into an example. This is a formatted data element that contains all the information from the sentence that (you think) is required as input to the classifier. The example is a list of fields which we will called features. You will determine yourself which features you want to represent when you convert the input sentence to an example. As a minimum, you will extract the following features from each sentence:
    -1w          The word just before the target word.
    +1w         The word just after the target word.
    -2w          The 2nd word before the target word.
    +2w         The 2nd word after the target word.
    -1P          The word just before the target word has property P
    +2P         The  word just after the target word has property P
    iwP          The ith word (i=+/-1,+/-2,...) has property P.
You can define any number of properties P (e.g., word is capitalized, word is part of given list, etc.) and use them to generate features of the last three types. To help you, the text corpus will be given to you in two formats. (You can choose which one to use). The first format ( text corpus 1, text corpus 2 ) simply contains the sentences, as in:
 
               We did not << accept >> the diagnosis at once , but gradually we are coming to .

The second format ( text-pos corpus 1, text-pos corpus 2 ) will contain, in addition, for each word, the part of speech tag of the word in the context of the sentence, as in

             (PRP We) (VBD did) (RB not) (VB accept) (DT the) (NN diagnosis) (IN at) (RB once) (, ,) (CC but) (RB gradually) (PRP we) (VBP are) (VBG coming) (TO to) (. .)

(In "real life", if you believe this information is required, you will have to generate it somehow; in this case, I've used a part of speech tagger and saved you some of the preprocessing.

Generate examples:
Given your chosen set of features, generate an example for each occurrence of a target word.
Present these as a table in which each target occupies a row and the columns correspond to the feature
values.
     
    Example: The sentence given above will generate one example, since it contains one target work.
    If we choose the features: -1w      +1w    -2w      +2w      -1wTag   +1wTag  we will get:

    Label   -1w   +1w    -2w      +2w      -1wTag   +1wTag

    accept   not      the      did      diagnosis       RB       DT
     
     

    Notice that there are still several decisions you need to make, before you even start to construct the classifier.
    You need to decide on your feature set (that is, what type of information is required in order to make a decision) and, in particular, what "properties" do you want to use. Notice that deciding on the type of features, still does not determine the features. You will need to write a program that extracts  features of this type from the sentence. You will also have to handle multiple occurrences of targets in a sentence, problems in the corpus, etc.

    2. Building a classifier

    Given the output of the preprocessing, you now need to write a program that decides, for each occurrence of  a target word (i.e., each example) which of the target words should occur.
    To simplify the evaluation, your program will receive as input two files; a file of sentences and a file of sentences with pos tags (it can choose to disregard one of them).
    Package your code so that one can supply it in the argument line a different corpus file, instead of the one
    you used for training. For example, a command line may look like:
    spell -c confusion_set -x text_file -t pos_file
    where: "spell" is your program
               "confusion set" is an indication of the confusion set you are processing, (1 for and 2 for );
               "text_file" is an input file with sentences in the text format, and
               "pos_file" is an input file with sentences in the pos format.
     

    In order not to influence your design of the classifier I will supply you the test data only later. Meanwhile, as you develop it, you can split your training data file and use some of the sentences there as your test data.

    You program is supposed to report it accuracy on the data. In each case, the program will print:

    # of times each word in the confusion set occurred the file (A,B).
    # of times each of these words was identified correctly (M,N).
    Total accuracy:  M+N/A+B
     

    Report on the Manual Classifier:

    1. Describe what you did and the choice of your features.
    2. Describe your classifier.
    3. Provide the code for your preprocessing and for your classifier.
    4. Present the output of your program on the training corpus.
    5. Package the code so that one can run it on Solaris with a different corpus, as above.



    Part II: Learning a Classifier

    In this part of the problem set you will build on your experience in processing a text corpus and your familiarity with the problem of context sensitive text correction to design a learning program that corrects these kind of mistakes. Specifically, you will write a program that receives as input a confusion set along with corresponding training data, and learn how to correct context sensitive mistakes the might occur between the elements of the confusion set.

    For the learning part of this assignment you will use the SNoW learning program (follow the software path).
    Optionally, you can also develop a 1-DL learning program and compare it against learning with SNoW (1-DLs will be covered in the class of 9/19).

    You will build on your experience in the first part of the assignment in generating the input for SNOW. However, instead of doing it yourself, you can use the FEX program FEX program to generate the input to SNoW, based on your selection of features.
    FEX gives you the flexibility to choose what types of features you think are important (this goes beyond the information sources available to the program; you can also define functions - such as conjunctions - over the information sources and have them be the features.)
    Once you have done that, you will run a few experiments with the data, using SNOW. Here is a brief description of the learning program. See details in the user manual. Also, The paper

      M.Munoz, V. Punyakanok, D. Roth, D. Zimak
    A Learning Approach to Shallow Parsing
    In Submission, 1999
    ,

    provides more details information on SNoW.

    General description of the learning program:

    During Training, SNoW learns a network which consists of two sub-networks -- one representing accept and the other representing the target except
    When learning the first sub-network, accept-labeled examples are treated as positive examples and except-labeled examples are negative. The except sub-network, on the other hand, is learned using except labeled examples as positive and accept examples as negative.

    The input in the testing stage consists of examples (having the same format as in Training) and the networks generated during the training stage. For each example, the program first disregards the label, makes a prediction by evaluating both sub-networks and comparing their outputs (choosing the higher value) and then reports the number of agreements with the correct labels.

    The learning program learns a specific architecture of linear separators. You get some freedom in choosing the architecture, the freedom to choose the update rule for the linear separator (Winnow, Perceptron or naive Bayes), and a few other parameters. For Winnow and Perceptron you can choose to cycle through the training data a few times.

    The program allows you to rank features and discard some of them, or do not let the system even see them (using the eligibility flag). You may choose not to play with this option at all (I suggest that you start, at least, without using it). You can also combine linear separators to form a more expressive decision surface (forming a cloud) but we will not use this option for this assignment.
    Read the documentation that comes with the program. Comments are welcome. Send mail either to Andy Carlson

    The Assignment

    Given a training corpus you will use FEX to generate a lexicon of features and a set of SNoW examples.

    The first deliverable is a FEX Script and the corresponding Lexicon file- a list of all features and their index.
    As a matter of convention, the targets are also considered features and FEX will give the indices 0 and 1. In your work, you will train your classifier on the given training data and test it on a separate set of examples, that will be given to you later. It is important that this split is defined before you run FEX on the data. The lexicon is generated using the training data only; it will then be given as input to FEX, when generating the examples from the test data. The reason is that it is possible that some of the features in the test data were never observed in training; this way, they will not be present in the lexicon and therefore will not be in the test examples. (Think about this issue!).

    The second deliverable are two example files, (in the SNoW format) one generated from
    the Training data and the other from the Test data.

    Given a file of examples, you can run the learning program on it.
    Run it first in a train mode on the the training examples, then run it in test mode
    on the Test examples. These will be your "official" results.

    Also, test it on the same file you train on, and see what you get then.

    As a minimum -- use all three learning algorithms with the default setting, train on the training data and test on the test data.
    Any other experiments that you can come up with and learn something from are of interest. For example, you may want to build a learning curve and record the performance as a function of the size of your training set.

    In some sense, this is a test of your feature set. As a minimum, present your results for two feature sets which are different in their expressivity. One will not include any conjunctions and the second will. Also, use the same set of features that you have used in your manual classifier.

    If you choose to learn also a 1-DL, please use the same set of features, so that we can make
    sense of the comparison.

    Finally, present the results of your manual classifier. Make sure that the
    comparison is fair and is made using the same features and the same number of examples.
     

    What to submit (Updated)

      1. Describe what you did, your features sets, your experiments, your conclusion.
        You final document has to be a text file, a post script of a pdf (not a word document). If it's a ps file, please make sure it's readable from unix!
      2. You FEX scripts
      3. The Lexicons (one for each confusion set)
      4. Two examples files (corresponding to the Train and the Test files).
      5. SNoW scripts that are used to run your experiments (don't type things manually; use scripts, e.g., csh) as well as the corresponding parameter files.

      Please package all the deliverables in one tar file and uuencode it. Do it so that the file opens a separate directory in your name.

      The uufiles script will do it for you. Please use your name as the directory name and have a text readme file with a description of your files. Once you have the file, attach it to your mail message to me.
      Package your program so that one can run your classifier as follows:

      spell -c confusion_set -x text_file -t pos_file [other options]

      where:
      confusion_set is 1={`accept',`except'} or 2={`then',`than'}
      text_file is a file of examples,
      posf_ile is the same examples, with part of speech tags given.

      Do not output all the examples by default (you can use one of the optional flags for that).

      Grading

      Your grade depends on:
      1. The quality of your report
      2. The accuracy of your learned classifier.
      The quality of the manual classifier will be less important, as long as it is not too trivial; in the first part it is more important to perform some of the preprocessing and think about the transformations you would like to perform.

      Due date

      Tuesday, September 28.
      Dan Roth