We use the SNoW system that we have developed in our group. This is an architecture along with (several) learning algorithms and is tailored for learning in the presence for a very large number of information sources (features).
We use the system as an on-line learning system -- it learns as it goes, tests itself all the time, and when it makes a mistake it uses the feedback (if supplied) to update its representation. A given input (sentence, in this case) may be used to update several word representations simultaneously.
For the current task we use the system to learn word representations. Each word of interest is learned as a function of other words or linguistics predicates. When a decision needs to be made -- in our case, when we need to predict the correct word in a given sentence -- all word representations compete on the honor to become the correct word and the winner is selected.
Words are learned as linear functions of other words and
linguistics predicates (which we call "features"). A given sentence
is transformed (using a simple feature extraction language that we
have developed) into a collection of features. In the present demo
we use as features:
That is, for the sentence "I would like some chocolate cake for
dessert.", assuming that we care only about the word "dessert",
we would supply SNoW with the following features:
During training, this example would be treated as a positive example for the word representation of "dessert" and a negative example for all other words. Later, during prediction, we pretend that the slot of the target word in the sentence is empty, and try to predict which is the word most suitable for this slot. Seeing the words chocolate and cake, along with the structure of the sentence, would be good indicators that "dessert" is a better choice for than other words.
The main update rule used within SNoW is the Winnow algorithm developed by Littlestone in 1987. (Other update rules for linear representations can also be used). See SNoW page for details.
In its basic form, Winnow attempts to associate weights with each feature it associates with a concept (in this case, a target word). Given an input sentence, the linear sum of the weights of the features which are active in it is evaluated, to determine if the target word is appropriate in this context. When a mistake is made, the weights are updated using the Winnow update rule. The Winnow update rule is multiplicative -- weight go down or up by some constant factor each time the corresponding feature contributes to a mistake. This multiplicative update is main advantage of the Winnow update rule and the reason for its suitability for this domain. It can be shown that using this update rules gives us the ability to tolerate a very large number of features most of which are irrelevant, be robust in noisy conditions, and be adaptive to changing contexts (e.g., training and testing under slightly different conditions). This is the main property that makes this algorithm so suitable for knowledge intensive inferences and the kind of applications we present here.