Context Sensitive Spelling Correction Data

Brown Corpus | TDT/WSJ Corpora

TDT/WSJ Corpora

These files contain the standard data used for the context-sensitive spelling scaling problem addressed by A. Carlson, J. Rosen, and D. Roth, 2001. Each line corresponds to a sentence in which each word is represented by a pair of part-of speech (POS) tag and actual word in parentheses. The data were extracted from the Wall Street Journal (WSJ) Treebank and the TDT2 corpus. POS tags were extracted using the SNoW-based Part of Speech Tagger.


Brown Corpus

The files have one of three extensions:

Filenames with suffix 20 are test files, those with suffix 80 are the training files (correspond to 20%, 80% of the data, respectively).

Download the entire corpus here.