Context-Sensitive Spelling Correction  

[Run Demo]

This is the task of fixing spelling errors that happen to result in valid words, such as substituting to for too, casual for causal or simple word usage errors like using amount instead of number.

We are not trying to detect spelling errors which result in non-words, a simple task which is handled quite well by conventional spell checkers such as those found in Microsoft Word, or the Unix ispell utility.

The task involves learning to characterize the contexts in which different words, such as piece and peace tend to occur. This includes fixing not only "classic" types of spelling mistakes, such as homophone errors (a word which sounds the same but is spelled differently and has different meaning) as in "I would like to get a peace of cake for dessert.", in which the word peace has been used in place of the correct word piece, and typographic errors as in "I got the ball form the locker.", where from was replaced by form.

We can also fix mistakes that are more commonly regarded as grammatical errors (e.g., among and between) or incorrect forms of pronouns, as in "I had a great time on the trip with his." The techniques are also capable of correcting errors that cross word boundaries (e.g., maybe and may be) but we haven't implemented this in the demo.

While in theory our system can learn to represent every word, due to limitations on memory we have limited the words that we actually represent (and correct mistakes for) to about 1000 words. The system we have developed attempts to learn the correct usage of words from a large body of text. We take the text from sources which are likely to have few mistakes and assume that the text we use is correct.

One of the source we used for this system is a collection of articles from the all Street Journal. This collection of text includes over 100,000 sentences containing over 2 million words. Reading all this text and learning about 1000 word representations (aka: training the system) takes only about 40 minutes to complete.