
| Problem Set I Disambiguation I: Context sensitive text correction | (Due 9/28/00) |
Given a sentence in which one of [accept, except] occurs ([then,
than], resp.) your program will determine which of the two should
occur in the context of this sentence.
The first part of the assignment is devoted to preprocessing of the
corpus and the manual development of a program that solves this
problem. In the second part, only after you finish and submit the first
part, you will develop a learning program for this task, based on
the preprocessing done earlier.
Attached is a paper that had studied this problem. It will give you more
information about the difficulties and several approaches tried on this
problem.
You are given a training corpus which is simply a collection of sentences which contain one of the target (the word we are interested in) words. For example:
We did not << accept >> the diagnosis at once , but gradually we are coming to .
(Actually the corpus will not have the << >> around the
target words, assuming that you know what you are after.) The
sentences in the training corpus are supposed to help you to develop
some characterization of the contexts in which [accept,
except] might occur.
Specifically, you will write a program that predicts which of the two
(and the same for [then, than]) should
occur.
Only at a later time, you will be given a test corpus in which the target
word is missing, as in:
Why won't you << >> the facts ?
and you will use your program to determine which of [accept, except] should occur in this context.
The format of the test corpus, in fact, will be identical to that
of the training corpus, including the
target word in between the
<<>> marks, but you are supposed to disregard this word,
and use it
only so that you can evaluate the performance or your
classifier. (In fact, this is done automatically in the second part,
when you use the SNoW learning program; in the first part, you will
have to do it yourself.
Your classifier will take as input a text corpus 1 ( text corpus 2) and will output a few statistics that indicate how well it is doing. Below I provide some more details on the two parts of the assignment.
The second format
( text-pos corpus 1,
text-pos corpus 2 )
will contain, in addition, for each word, the
part of speech tag of the word in the context of the sentence, as in
(In "real life", if you believe this information is required, you will have to generate it somehow; in this case, I've used a part of speech tagger and saved you some of the preprocessing.
Label -1w +1w -2w +2w -1wTag +1wTag
accept not the
did diagnosis
RB
DT
Notice that there are still several decisions you need to make, before
you even start to construct the classifier.
You need to decide on your feature set (that is, what type of information
is required in order to make a decision) and, in particular, what "properties"
do you want to use. Notice that deciding on the type of features,
still does not determine the features. You will need to write a program
that extracts features of this type from the sentence. You will also
have to handle multiple occurrences of targets in a sentence, problems
in the corpus, etc.
In order not to influence your design of the classifier I will supply you the test data only later. Meanwhile, as you develop it, you can split your training data file and use some of the sentences there as your test data.
You program is supposed to report it accuracy on the data. In each case, the program will print:
# of
times each word in the confusion set occurred the file (A,B).
# of
times each of these words was identified correctly (M,N).
Total accuracy: M+N/A+B
For the learning part of this assignment you will use the SNoW learning
program (follow the software path).
Optionally, you can also develop a 1-DL learning program and
compare it against learning with SNoW (1-DLs will be covered in the
class of 9/19).
You will build on your experience in the first part of the
assignment in generating the input for SNOW. However, instead of
doing it yourself, you can use the FEX program FEX
program to generate the input to SNoW, based on your selection
of features.
FEX gives you the flexibility to choose what types of
features you think are important (this goes beyond the information
sources available to the program; you can also define functions - such
as conjunctions - over the information sources and have them be the
features.)
Once you have done that, you will run a few experiments with the
data, using SNOW. Here is a brief description of the learning
program. See details in the user manual. Also, The paper
General description of the learning program:
During
Training, SNoW learns a network which consists of two sub-networks --
one representing accept and the other representing the target
except
When learning the first sub-network, accept-labeled examples are
treated as positive examples and except-labeled examples are
negative. The except sub-network, on the other hand, is
learned using except labeled examples as positive and
accept examples as negative.
The input in the testing stage consists of examples (having the same format as in Training) and the networks generated during the training stage. For each example, the program first disregards the label, makes a prediction by evaluating both sub-networks and comparing their outputs (choosing the higher value) and then reports the number of agreements with the correct labels.
The learning program learns a specific architecture of linear separators. You get some freedom in choosing the architecture, the freedom to choose the update rule for the linear separator (Winnow, Perceptron or naive Bayes), and a few other parameters. For Winnow and Perceptron you can choose to cycle through the training data a few times.
The program allows you to rank features and
discard some of them, or do not let the system even see them
(using the eligibility flag). You may choose not to play
with this option at all (I suggest that you start, at least, without
using it). You can also combine linear separators to form a more
expressive decision surface (forming a cloud) but we will not
use this option for this assignment.
Read the documentation that comes with the program.
Comments are welcome. Send mail either to
Andy Carlson
The first deliverable is a FEX Script and the
corresponding Lexicon file- a list of all features and their
index.
As a matter of convention, the targets are also
considered features and FEX will give the indices 0 and 1.
In your work, you will train your classifier on the given training
data and test it on a separate set of examples, that will be given to
you later. It is important that this split is defined before
you run FEX on the data. The lexicon is generated using the training
data only; it will then be given as input to FEX, when generating the
examples from the test data. The reason is that it is possible that
some of the features in the test data were never observed in training;
this way, they will not be present in the lexicon and therefore will
not be in the test examples. (Think about this issue!).
The second deliverable are two example
files, (in the SNoW format) one generated from
the Training
data and the other from the Test data.
Given a file of examples, you can run the learning program on it.
Run it first in a train mode on the the training examples,
then run it in test mode
on the Test examples. These will
be your "official" results.
Also, test it on the same file you train on, and see what you get then.
As a minimum -- use
all three learning algorithms with the default setting, train on the training
data and test on the test data.
Any other experiments that you can come up with and learn something
from are of interest. For example, you may want to build a learning
curve and record the performance as a function of the size of your
training set.
In some sense, this is a test of your feature set. As a minimum, present your results for two feature sets which are different in their expressivity. One will not include any conjunctions and the second will. Also, use the same set of features that you have used in your manual classifier.
If you choose to learn also a 1-DL, please use the same set of
features, so that we can make
sense of the comparison.
Finally, present the results of your manual classifier.
Make sure that the
comparison is fair and is made using the same
features and the same number of examples.
Please package all the deliverables in one tar file and uuencode it.
Do it so that the file opens a separate directory in your name.
The uufiles
script will do it for you. Please use your name as the directory
name and have a text readme file with a description of your files.
Once you have the file, attach it to your mail message to
me.
Package your program so that one can run your classifier as follows: