These files contain the standard data used for the context-sensitive spelling scaling problem addressed by A. Carlson, J. Rosen, and D. Roth, 2001. Each line corresponds to a sentence in which each word is represented by a pair of part-of speech (POS) tag and actual word in parentheses. The data were extracted from the Wall Street Journal (WSJ) Treebank and the TDT2 corpus. POS tags were extracted using the SNoW-based Part of Speech Tagger.
The files have one of three extensions:
Each line is a .feat file is an Example:. A list of active features, separated by a , and ends with a :. The class label is considered one of the features. It takes values 0 -- k-1, (where k is the number possible labels, the size of the confusion set). The rest of the features are indexed with numbers above k. The feature indices are sorted and therefore the class label appears as the first in this list.
Filenames with suffix 20 are test files, those with suffix 80 are the training files (correspond to 20%, 80% of the data, respectively).
Download the entire corpus here.