Abstracts of papers by Dan Roth



Relational Learning via Propositional Algorithms: An Information Extraction Case Study

This paper develops a new paradigm for relational learning which allows for the representation and learning of relational information using propositional means. This paradigm suggests different tradeoffs than those in the traditional approach to this problem -- the ILP approach -- and as a result it enjoys several significant advantages over it. In particular, the new paradigm is more flexible and allows the use of any propositional algorithm, including probabilistic algorithms, within it. We evaluate the new approach on an important and relation-intensive task - Information Extraction - and show that it outperforms existing methods while being orders of magnitude more efficient.

The Use of Classifiers in Sequential Inference

We study the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. In particular, we develop two general approaches for an important subproblem - identifying phrase structure. The first is a Markovian approach that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies. The second is an extension of constraint satisfaction formalisms. We develop efficient combination algorithms under both models and study them experimentally in the context of shallow parsing.

Learning to Recognize 3D objects

A learning account for the problem of object recognition is developed within the PAC (Probably Approximately Correct) model of learnability. The key assumption underlying this work is that objects can be recognized (or, discriminated) using simple representations in terms of ``syntactically" simple relations over the raw image. Although the potential number of these simple relations could be huge, only a few of them are actually present in each observed image and a fairly small number of those observed is relevant to discriminating an object. We show that these properties can be exploited to yield an efficient learning approach in terms of sample and computational complexity, within the PAC model. No assumptions are needed on the distribution of the observed objects and the learning performance is quantified relative to its past experience. Most importantly, the success of learning an object representation is naturally tied to the ability to represent it as a function of some intermediate representations extracted from the image. We evaluate this approach in a large scale experimental study in which the SNoW learning architecture is used to learn representations for the 100 objects in the Columbia Object Image Database (COIL-100). Experimental results exhibit good generalization and robustness properties of the SNoW-based method relative to other approaches. SNoW's recognition rate degrades more gracefully when the training data contains fewer views and it shows similar behavior also in some preliminary experiments with partially occluded objects.

Learning in Natural Language: Theory and Algorithmic Approaches

This article summarizes work on developing a learning theory account for the major learning and statistics based approaches used in natural language processing. It shows that these approaches can all be explained using a single distribution free inductive principle related to the pac model of learning. Furthermore, they all make predictions using the same simple knowledge representation -- a linear representation over a common feature space. This is significant both to explaining the generalization and robustness properties of these methods and to understanding how these methods might be extended to learn from more structured, knowledge intensive examples, as part of a learning centered approach to higher level natural language inferences.

Toward a Theory of Learning Coherent Concepts

We develop a theory for learning scenarios where multiple learners co-exist but there are mutual compatibility constraints on their outcomes. This is natural in cognitive learning situations, where ``natural'' compatibility constraints are imposed on the outcomes of classifiers so that a valid sentence, image or any other domain representation is produced. We suggest that work in this direction may help to resolve the contrast between the hardness of learning as predicted by the current theoretical models and the apparent ease at which cognitive systems seem to learn. A model of concept learning is studied in which the target concept is required to {\em cohere} with other concepts of interest. The {\em coherency} is expressed via a (Boolean) constraint that the concepts have to satisfy. Under this model, learning a concept is shown to be easier (in terms of sample complexity and mistake bounds) and the concepts learned are shown to be more robust to noise in their input (attribute noise). These properties are established for half spaces and the connection to large margin theory is discussed.

Relational Representations that Facilitate Learning

Given a collection of objects in the world, along with some relations that hold among them, a fundamental problem is how to learn definitions of some relations and concepts of interest in terms of the given relations. These definitions might be quite complex and, inevitably, might require the use of quantified expressions. Attempts to use first order languages for these purposes are hampered by the fact that relational inference is intractable and, consequently, so is the problem of learning relational definitions. This work develops an expressive relational representation language that allows the use of propositional learning algorithms when learning relational definitions. The representation serves as an intermediate level between a raw description of observations in the world and a propositional learning system that attempts to learn definitions for concepts and relations. It allows for hierarchical composition of relational expressions that can be evaluated efficiently on the observations and thus supports learning complex definitions by learning simple functions of the intermediate representations. The approach is illustrated using examples from natural language and visual processing.

A Classification Approach to Word Prediction

The eventual goal of a language model is to accurately predict the value of a missing word given its context. We present an approach to word prediction that is based on learning a representation for each word as a function of words and linguistics predicates in its context. This approach raises a few new questions that we address. First, in order to learn good word representations it is necessary to use an expressive representation of the context. We present a way that uses external knowledge to generate expressive context representations, along with a learning method capable of handling the large number of features generated this way that can, potentially, contribute to each prediction. % Second, since the number of words ``competing'' for each prediction is large, there is a need to ``focus the attention'' on a smaller subset of these. We exhibit the contribution of a ``focus of attention'' mechanism to the performance of the word predictor. Finally, we describe a large scale experimental study in which the approach presented is shown to yield significant improvements in word prediction tasks.

Learning to Recognize Objects

A learning account for the problem of object recognition is developed within the PAC (Probably Approximately Correct) model of learnability. The proposed approach makes no assumptions on the distribution of the observed objects, but quantifies success relative to its past experience. Most importantly, the success of learning an object representation is naturally tied to the ability to represent it as a function of some intermediate representations extracted from the image. We evaluate this approach in a large scale experimental study in which the SNoW learning architecture is used to learn representations for the 100 objects in the Columbia Object Image Database (COIL-100). The SNoW-based method is shown to outperform other methods in terms of recognition rates; its performance degrades gracefully when the training data contains fewer views and in the presence of occlusion noise.

View-Based 3D Object Recognition Using SNoW

This paper describes a novel view-based algorithm for 3D object recognition using a network of linear units. The SNoW learning architecture is a sparse network of linear functions over a pre-defined or incrementally learned feature space and is specifically tailored for learning in the presence of a very large number of features. We use the pixel-level representation in the experiments and compare the performance of SNoW with Support Vector Machines and nearest neighbor methods on 3D object recognition using the 100 objects in the Columbia Image Object Database (COIL-100). Experimental results show that SNoW-based method outperform SVM-based system in terms of recognition rate and the computational cost involved in learning. The empirical results also provide insight for practical and theoretical considerations on view-based methods for 3D object recognition.

Learning in Natural Language

Statistics-based classifiers in natural language are developed typically by assuming a generative model for the data, estimating its parameters from training data and then using Bayes rule to obtain a classifier. For many problems the assumptions made by the generative models are evidently wrong, leaving open the question of why these approaches work.

This paper presents a learning theory account of the major statistical approaches to learning in natural language. A class of Linear Statistical Queries (LSQ) hypotheses is defined and learning with it is shown to exhibit some robustness properties. Many statistical learners used in natural language, including naive Bayes, Markov Models and Maximum Entropy models are shown to be LSQ hypotheses, explaining the robustness of these predictors even when the underlying probabilistic assumptions do not hold. This coherent view of when and why learning approaches work in this context may help to develop better learning methods and an understanding of the role of learning in natural language inferences.



Relational Learning for NLP using Linear Threshold Elements

We describe a coherent view of learning and reasoning with relational representations in the context of natural language processing. In particular, we discuss the Neuroidal Architecture, Inductive Logic Programming and the SNoW system explaining the relationships among these, and thereby offer an explanation of the theoretical basis for the SNoW system. We suggest that extensions of this system along the lines suggested by the theory may provide new levels of scalability and functionality.

A Learning Approach to Shallow Parsing

A SNoW based learning approach to shallow parsing tasks is presented and studied experimentally. The approach learns to identify syntactic patterns by combining simple predictors to produce a coherent inference. Two instantiations of this approach are studied and experimental results for Noun-Phrases (NP) and Subject-Verb (SV) phrases that compare favorably with the best published results are presented. In doing that, we compare two ways of modeling the problem of learning to recognize patterns and suggest that shallow parsing patterns are better learned using open/close predictors than using inside/outside predictors.}

Word Prediction and Clustering

We study the simultaneous learning of many word predictors and the relation it has to word similarity and clustering. Our approach is shown to be robust and perform quite well even when the competing words are very similar. Some of the issues involved in scaling up to learn many predictors are investigated, as well as how this approach gives rise to the emergence of semantic categories.

Memory Based Learning in NLP

We study memory-based learning methods and show that they can be viewed as learning linear predictors over commonly used feature spaces.

We suggest that this view allows one to study memory based methods and other successful learning algorithms used in NLP within the same computational framework and may result in improved algorithms and a better understanding for the role of learning in natural language inferences.



A Winnow-Based Approach to Context-Sensitive Spelling Correction

A large class of machine-learning problems in natural language require the characterization of linguistic context. Two characteristic properties of such problems are that their feature space is of very high dimensionality, and their target concepts depend on only a small subset of the features in the space. Under such conditions, multiplicative weight-update algorithms such as Winnow have been shown to have exceptionally good theoretical properties. In the work reported here, we present an algorithm combining variants of Winnow and weighted-majority voting, and apply it to a problem in the aforementioned class: Context-Sensitive Spelling Correction. This is the task of fixing spelling errors that happen to result in valid words, such as substituting to for too, casual for causal, and so on. We evaluate our algorithm, WinSpell, by comparing it against BaySpell, a statistics-based method representing the state of the art for this task.

We find: (1) When run with a full (unpruned) set of features, WinSpell achieves accuracies significantly higher than BaySpell was able to achieve in either the pruned or unpruned condition; (2) When compared with other systems in the literature, WinSpell exhibits the highest performance; (3) While several aspects of WinSpell's architecture contribute to its superiority over BaySpel, the primary factor is that it is able to learn a better linear separator than BaySpell learns; (4) When run on a test set drawn from a different corpus than the training set was drawn from, WinSpell is better able than BaySpell to adapt, using a strategy we will present that combines supervised learning on the training set with unsupervised learning on the (noisy) test set.



Learning to Resolve Natural Language Ambiguities: A Unified Approach

We analyze a few of the commonly used statistics based and machine learning algorithms for natural language disambiguation tasks and observe that they can be re-cast as learning linear separators in the feature space. Each of the methods makes a priori assumptions, which it employs, given the data, when searching for its hypothesis. Nevertheless, as we show, it searches a space that is as rich as the space of {\em all} linear separators. We use this to build an argument for a data driven approach which merely searches for a {\em good} linear separator in the feature space, without further assumptions on the domain or a specific problem.

We present such an approach - a sparse network of linear separators, utilizing the Winnow learning algorithm - and show how to use it in a variety of ambiguity resolution problems. The learning approach presented is attribute-efficient and, therefore, appropriate for domains having very large number of attributes.

In particular, we present an extensive experimental comparison of our approach with other methods on several well studied lexical disambiguation tasks such as context-sensitive spelling correction, prepositional phrase attachment and part of speech tagging. In all cases we show that our approach either outperforms other methods tried for these tasks or performs comparably to the best.



Part of Speech Tagging Using a Network of Linear Separators

We present an architecture and an on-line learning algorithm and apply it to the problem of part-of-speech tagging. The architecture presented, \snow, is a network of linear separators in the feature space, utilizing the Winnow update algorithm.

Multiplicative weight-update algorithms such as Winnow have been shown to have exceptionally good behavior when applied to very high dimensional problems, and especially when the target concepts depend on only a small subset of the features in the feature space. In this paper we describe an architecture that utilizes this mistake-driven algorithm for multi-class prediction -- selecting the part of speech of a word. The experimental analysis presented here provides more evidence to that these algorithms are suitable for natural language problems.

The algorithm used is an on-line algorithm: every example is used by the algorithm only once, and is then discarded. This has significance in terms of efficiency, as well as quick adaptation to new contexts.

We present an extensive experimental study of our algorithm under various conditions; in particular, it is shown that the algorithm performs comparably to the best known algorithms for pos.



Incorporating Knowledge in Natural Language Learning: A Case Study

Incorporating external information during a learning process is expected to improve its efficiency. We study a method for incorporating noun-class information, in the context of learning to resolve Prepositional Phrase Attachment (PPA) disambiguation. This is done within a recently introduced architecture, \snow, a sparse network of threshold gates utilizing the Winnow learning algorithm. That architecture has already been demonstrated to perform remarkably well on a number of natural language learning tasks.

The knowledge sources used were compiled from the WordNet database for general linguistic purposes, irrespective of the \ppa\ problem, and are being incorporated into the learning algorithm by enriching its feature space. We study two strategies of using enriched features and the effects of using class information at different granularities, as well as randomly-generated knowledge which serves as a control set.

Incorporating external knowledge sources within \snow\ yields a statistically significant performance improvement. In addition, we find an interesting relation between the granularity of the knowledge sources used and the magnitude of the improvement. The encouraging results with noun-class data provide a motivation for carrying out more work on generating better linguistic knowledge sources.



Clustering Appearances of 3D Objects

We introduce a method for unsupervised clustering of images of 3D objects. Our method examines the space of all images and partitions the images into sets that form smooth and parallel surfaces in this space. It further uses sequences of images to obtain more reliable clustering. Finally, since our method relies on a non-Euclidean similarity measure we introduce algebraic techniques to estimating local properties of these surfaces without first embedding the images in a Euclidean space. We demonstrate our method by applying it to a large database of images.

On the Hardness of Approximate Reasoning

Many AI problems, when formalized, reduce to evaluating the probability that a propositional expression is true. In this paper we show that this problem is computationally intractable even in surprisingly restricted cases and even if we settle for an approximation to this probability.

We consider various methods used in approximate reasoning such as computing degree of belief and Bayesian belief networks, as well as reasoning techniques such as constraint satisfaction and knowledge compilation, that use approximation to avoid computational difficulties, and reduce them to model-counting problems over a propositional domain.

We prove that counting satisfying assignments of propositional languages is intractable even for Horn and monotone formulae, and even when the size of clauses and number of occurrences of the variables are extremely limited. This should be contrasted with the case of deductive reasoning, where Horn theories and theories with binary clauses are distinguished by the existence of linear time satisfiability algorithms. What is even more surprising is that, as we show, even approximating the number of satisfying assignments (i.e., ``approximating'' approximate reasoning), is intractable for most of these restricted theories.

We also identify some restricted classes of propositional formulae for which efficient algorithms for counting satisfying assignments can be given.



Reasoning with Models

We develop a model-based approach to reasoning, in which the knowledge base is represented as a set of models (satisfying assignments) rather than a logical formula, and the set of queries is restricted. We show that for every propositional knowledge base ($KB$) there exists a set of {\em characteristic models} with the property that a query is true in $KB$ if and only if it is satisfied by the models in this set. We characterize a set of functions for which the model-based representation is compact and provides efficient reasoning. These include cases where the formula-based representation does not support efficient reasoning. In addition, we consider the model-based approach to {\em abductive reasoning} and show that for any propositional $KB$, reasoning with its model-based representation yields an abductive explanation in time that is polynomial in its size. Some of our technical results make use of the {\em Monotone Theory}, a new characterization of Boolean functions recently introduced.

The notion of {\em restricted queries} is inherent in our approach. This is a wide class of queries for which reasoning is efficient and exact, even when the model-based representation $KB$ provides only an approximate representation of the domain in question.

Moreover, we show that the theory developed here generalizes the model-based approach to reasoning with Horn expressions and captures even the notion of reasoning with Horn-approximations.



On Learning Read-k-Satisfy-j DNF

We study the learnability of Read-$k$-Satisfy-$j$ (RkSj) DNF formulas. These are boolean formulas in disjunctive normal form (DNF), in which the maximum number of occurrences of a variable is bounded by $k$, and the number of terms satisfied by any assignment is at most $j$. After motivating the investigation of this class of DNF formulas, we present an algorithm that for any unknown RkSj DNF formula to be learned, with high probability finds a logically equivalent DNF formula using the well-studied protocol of equivalence and membership queries. The algorithm runs in polynomial time for $k\cdot j=O({\log n\over\log\log n})$, where $n$ is the number of input variables.

Finding the Maximum Area Axis-Parallel Rectangle in a Polygon

We consider the geometric optimization problem of finding the \MAAPR~ (MAAPR) in an $n$-vertex general polygon. We characterize the MAAPR for general polygons by considering different cases based on the types of contacts between the rectangle and the polygon. We present a general framework for solving a key subcase of the MAAPR problem which dominates the running time for a variety of polygon types. Using this framework, we obtain the following MAAPR time results: $\Theta(n)$ for xy-monotone polygons, ${\rm O}(n \alpha(n))$ for \OCPs, ${\rm O}(n \alpha(n) \log n)$ for horizontally (vertically) convex polygons, (where $\alpha(n)$ is the slowly growing inverse of Ackermann's function), ${\rm O}(n \log n)$ for a special type of horizontally convex polygon, and ${\rm O}(n \log^2 n)$ for general polygons. For all these types of non-rectilinear polygons, we match the running time of the best known algorithms for their rectilinear counterparts. We prove a lower bound of time in $\Omega(n \log n)$ for finding the MAAPR in both self-intersecting polygons and general polygons with holes. The latter result gives us both a lower bound of $\Omega(n \log n)$ and an upper bound of ${\rm O}(n \log^2 n)$ for general polygons with holes.

On Learning Visual Concepts and DNF Formulae

We consider the problem of learning DNF formulae in the mistake-bound and the PAC models. We develop a new approach, which is called {\em polynomial explainability}, that is shown to be useful for learning some new subclasses of DNF (and CNF) formulae that were not known to be learnable before. Unlike previous learnability results for DNF (and CNF) formulae, these subclasses are not limited in the number of terms or in the number of variables performance term; yet, they contain the subclasses of $k$-DNF and $k$-term-DNF (and the corresponding classes of CNF) as special cases. We apply our DNF results to the problem of learning {\em visual concepts} and obtain learning algorithms for several natural subclasses of visual concepts that appear to have no natural boolean counterpart. On the other hand, we show that learning some other natural subclasses of visual concepts is as hard as learning the class of all DNF formulae. We also consider the robustness of these results under various types of noise.

Defaults and Relevance in Model Based Reasoning

Reasoning with model-based representations is an intuitive paradigm, which has been shown to be theoretically sound and to possess some computational advantages over reasoning with formula-based representations of knowledge. In this paper we present more evidence to the value of such representations. Our results hinge on the notion of relevance, and model based representations are shown to be useful in capturing relevant information, and in allowing to ignore irrelevant information. In particular, we consider situations where context-specific information is used in the process of reasoning. We show that reasoning with model-based representations can be done efficiently in the presence of varying context information. We then consider the task of default reasoning. We show that default reasoning is a generalization of reasoning within context, in which the reasoner has many ``context" rules, which may be conflicting. We develop model-based algorithms that handle efficiently fragments of Reiter's default logic. Our intuition about relevance is best captured in the model for reasoning within context, where model-based representations enable us to filter out irrelevant information. Interestingly, default logic is somewhat in contrast with our intuition about relevance. Default rules do not tell us explicitly what the context information is. Instead, we have to figure out what are the possible ``extensions" and then use those as possible contexts. As we show, model-based representations capture all possible extensions in an accessible form, thereby supporting efficient default reasoning whenever possible. Lastly, we argue that these results support an incremental view of reasoning in a natural way. We discuss the Learning to Reason framework, which emphasizes this view, and the notion of relevance as manifested in it. In particular, we discuss results on Learning to Reason in which model-based representations are used to represent relevant knowledge.

Reasoning with Examples: Propositional Formulae and Database Dependencies

For humans, looking at how concrete examples behave is an intuitive way of deriving conclusions. The drawback with this method is that it does not necessarily give the correct results. However, under certain conditions example-based deduction can be used to obtain a correct and complete inference procedure. This is the case for Boolean formulae (reasoning with models) and for certain types of database integrity constraints (the use of Armstrong relations). We show that these approaches are closely related, and use the relationship to prove new results about the existence and sizes of Armstrong relations for Boolean dependencies. Further, we study the problem of translating between different representations of relational databases, in particular we consider Armstrong relations and Boolean dependencies, and prove some positive results in that context. Finally, we discuss the close relations between the questions of finding keys in relational databases and that of finding abductive explanations.

Learning to Reason

We introduce a new framework for the study of reasoning. The Learning (in order) to Reason approach developed here combines the interfaces to the world used by known learning models with the reasoning task and a performance criterion suitable for it. In this framework the intelligent agent is given access to her favorite learning interface, and is also given a grace period in which she can interact with this interface and construct her representation KB of the world $W$. Her reasoning performance is measured only after this period, when she is presented with queries $\alpha$ from some query language, relevant to the world, and has to answer whether $W$ implies $\alpha$. The approach is meant to overcome the main computational difficulties in the traditional treatment of reasoning which stem from its separation from the ``world''. First, by allowing the reasoning task to interface the world (as in the known learning models), we avoid the rigid syntactic restriction on the intermediate knowledge representation. Second, we make explicit the dependence of the reasoning performance on the input from the environment. This is possible only because the agent interacts with the world when constructing her knowledge representation. We show how previous results from learning theory and reasoning fit into this framework and illustrate the usefulness of the Learning to Reason approach by exhibiting new results that are not possible in the traditional setting. First, we give a Learning to Reason algorithm for a class of propositional languages for which there are no efficient reasoning algorithms, when represented as a traditional (formula-based) knowledge base. Second, we exhibit a Learning to Reason algorithm for a class of propositional languages that is not known to be learnable in the traditional sense.

Mistake-Driven Learning in Text Categorization

Learning problems in the text processing domain often map the text to a space whose dimensions are the measured features of the text, e.g., its words. Three characteristic properties of this domain are (a) very high dimensionality, (b) both the learned concepts and the instances reside very sparsely in the feature space, and (c) a high variation in the number of active features in an instance. In this work we study three mistake-driven learning algorithms for a typical task of this nature -- text categorization. We argue that these algorithms -- which categorize documents by learning a linear separator in the feature space -- have a few properties that make them ideal for this domain. We then show that a quantum leap in performance is achieved when we further modify the algorithms to better address some of the specific characteristics of the domain. In particular, we demonstrate (1) how variation in document length can be tolerated by either normalizing feature weights or by using negative weights, (2) the positive effect of applying a threshold range in training, (3) alternatives in considering feature frequency, and (4) the benefits of discarding features while training. Overall, we present an algorithm, a variation of Littlestone's Winnow, which performs significantly better than any other algorithm tested on this task using a similar feature set.

Applying Winnow to Context-Sensitive Spelling Correction

Multiplicative weight-updating algorithms such as Winnow have been studied extensively in the COLT literature, but only recently have people started to use them in applications. In this paper, we apply a Winnow-based algorithm to a task in natural language: context-sensitive spelling correction. This is the task of fixing spelling errors that happen to result in valid words, such as substituting {\it to\/} for {\it too}, {\it casual\/} for {\it causal}, and so on. Previous approaches to this problem have been statistics-based; we compare Winnow to one of the more successful such approaches, which uses Bayesian classifiers. We find that: (1)~When the standard (heavily-pruned) set of features is used to describe problem instances, Winnow performs comparably to the Bayesian method; (2)~When the full (unpruned) set of features is used, Winnow is able to exploit the new features and convincingly outperform Bayes; and (3)~When a test set is encountered that is dissimilar to the training set, Winnow is better than Bayes at adapting to the unfamiliar test set, using a strategy we will present for combining learning on the training set with unsupervised learning on the (noisy) test set.

Learning Active Classifiers

Many classification algorithms are ``passive'', in that they assign a class-label to each instance based only on the description given, even if that description is incomplete. In contrast, an *active* classifier can --- at some cost --- obtain the values of missing attributes, before deciding upon a class label. The expected utility of using an active classifier depends on both the cost required to obtain the additional attribute values and the penalty incurred if it outputs the wrong classification. This paper considers the problem of *learning* near-optimal active classifiers, using a variant of the probably-approximately-correct (PAC) model. After defining the framework --- which is perhaps the main contribution of this paper --- we describe a situation where this task can be achieved efficiently, but then show that the task is often intractable.

A Connectionist Framework for Reasoning: Reasoning with Examples

We present a connectionist architecture that supports almost instantaneous deductive and abductive reasoning. The deduction algorithm responds in few steps for single rule queries and in general, takes time that is linear with the number of rules in the query. The abduction algorithm produces an explanation in few steps and the best explanation in time linear with the size of the assumption set. The size of the network is polynomially related to the size of other representations of the domain, and may even be smaller. We base our connectionist model on Valiant's Neuroidal model \cite{Valiant94} and thus make minimal assumptions about the computing elements, which are assumed to be classical threshold elements with states. Within this model we develop a reasoning framework that utilizes a model-based approach to reasoning \cite{\KKS,\KRm}. In particular, we suggest to interpret the connectionist architecture as encoding {\em examples} of the domain we reason about and show how to perform various reasoning tasks with this interpretation. We then show that the representations used can be acquired efficiently from interactions with the environment and discuss how this learning process influences the reasoning performance of the network.

Learning to Reason: The Non-Monotonic Case

We suggest a new approach for the study of the non-monotonicity of human commonsense reasoning. The two main premises that underlie this work are that commonsense reasoning is an inductive phenomenon, and that missing information in the interaction of the agent with the environment may be as informative for future interactions as observed information. This intuition is formalized and the problem of reasoning from incomplete information is presented as a problem of learning attribute functions over a generalized domain. We consider examples that illustrate various aspects of the non-monotonic reasoning phenomena, which have been used over the years as ``bench-marks'' for various formalisms, and translate them into Learning to Reason problems. We demonstrate that these have concise representations over the generalized domain and prove that these representations can be learned efficiently. The framework developed suggests an ``operational'' approach to studying reasoning that is nevertheless rigorous and amenable to analysis. We show that this approach efficiently supports reasoning with incomplete information and at the same time matches our expectations of plausible patterns of reasoning in cases where other theories do not. This work continues previous works in the Learning to Reason framework, and supports the thesis that in order to develop a computational account for commonsense reasoning one should study the phenomena of learning and reasoning together.

Learning to Reason with Restricted View

The current emphasis of the research in learning theory is on the study of inductive learning (from examples) of concepts (binary classifications of examples). The work in AI identifies other tasks, such as reasoning, as essential for intelligent agents, but those are not supported by the current learning models. The Learning to Reason framework was devised to reconcile inductive learning and efficient reasoning. The framework highlights the fact that new learning questions arise when {\em learning in order to reason}. This paper addresses the task of deductive reasoning, and investigates learning to reason problems in which the examples seen are only partially specified. The paper presents several interpretations for partial information in the interface with the environment, and develops model based representations and reasoning algorithms that are suitable to deal with partially observable worlds. Then, learning to reason algorithms that cope with partial information are developed. These results exhibit a tradeoff between learnability, the strength of the oracles used in the interface and the expressiveness of the queries asked. This work shows that one can learn to reason with respect to expressive worlds, that cannot be learned efficiently in the traditional learning framework and do not support efficient reasoning in the traditional reasoning framework.

Linearizable Read/Write Objects

We study the cost of implementing {\em linearizable} read/write objects for shared-memory multiprocessors under various assumptions on the available timing information. We take as cost measure the {\em worst-case response time} of performing an operation in distributed implementations of virtual shared memory consisting of such objects. It is assumed that processes have clocks that run at the same rate as real time and all messages incur a delay in the range $[d-u,d]$ for some known constants $u$ and $d$, $0 \leq u \leq d$. In the {\em perfect clocks} model, where processes have perfectly synchronized clocks and every message incurs a delay of exactly $d$, we present a family of optimal linearizable implementations, parameterized by a constant $\beta$, $0 \leq \beta \leq 1$, for which the worst-case response times for read and write operations are $\beta d$ and $(1-\beta)d$, respectively. The parameter $\beta$ may be appropriately chosen to account for the relative frequencies of read and write operations. Our main result is the first known linearizable implementation for the {\em imperfect clocks} model, where clocks are not initially synchronized and message delays can vary, i.e., $u > 0$; it achieves worst-case response times of less than $4u+b$ ($b>0$ is an arbitrarily small constant) and $d+3u$ for read and write operations, respectively. This implementation employs novel synchronization techniques in order to utilize the lower bound on message delay time and achieve bounds on worst-case response times that depend on the message delay uncertainty $u$. For a wide range of values of $u$, these bounds improve upon previously known ones for implementations that support consistency conditions even weaker than linearizability.

A Learning Approach to Shallow Parsing

A SNoW based learning approach to shallow parsing tasks is presented and studied experimentally. The shallow parsing method suggested learns to identify syntactic patterns by combining simple predictors to produce a coherent inference. Learned predictors are cascaded and their outcome is used as an input to an inference algorithm that produces the final phrases. We discuss the requirements from a learning system for this approach to be applicable and present a system based on the SNoW learning architecture that satisfies these conditions. Two instantiations of this approach are studied and experimental results for Noun-Phrases (NP) and Subject-Verb (SV) phrases that compare favorably with the best published results are presented. In doing that, we compare two ways of modeling the problem of learning to recognize patterns and suggest that shallow parsing patterns are better learned using open/close predictors than using inside/outside predictors.}

Linear concepts and hidden variables

We study a learning problem which allows for a ``fair'' comparison between unsupervised learning methods---probabilistic model construction, and more traditional algorithms that directly learn a classification. The merits of each approach are intuitively clear: inducing a model is more expensive computationally, but may support a wider range of predictions. Its performance, however, will depend on how well the postulated probabilistic model fits that data. To compare the paradigms we consider a model which postulates a single binary-valued {\em hidden variable\/} on which all other attributes depend. In this model, finding the most likely value of any one variable (given known values for the others) reduces to testing a linear function of the observed values. We learn the model with two techniques: the standard EM algorithm, and a new algorithm we develop based on covariances. We compare these, in a controlled fashion, against an algorithm (a version of {\em Winnow}) that attempts to find a good linear classifier directly. Our conclusions help delimit the fragility of using a model that is even ``slightly'' simpler than the distribution actually generating the data, vs. the relative robustness of directly searching for a good predictor.

Dan Roth