Abstracts of papers by Dan Roth
Relational Learning via Propositional Algorithms: An Information Extraction Case Study
This paper develops a new paradigm for relational learning which
allows for the representation and learning of relational information
using propositional means. This paradigm suggests different tradeoffs
than those in the traditional approach to this problem -- the ILP
approach -- and as a result it enjoys several significant advantages
over it.
In particular, the new paradigm is more flexible and allows the use of
any propositional algorithm, including probabilistic algorithms,
within it.
We evaluate the new approach on an important and
relation-intensive task - Information Extraction - and show that
it outperforms existing methods while being orders of magnitude
more efficient.
The Use of Classifiers in Sequential Inference
We study the problem of combining the outcomes of several different
classifiers in a way that provides a coherent inference that satisfies
some constraints. In particular, we develop two general approaches for
an important subproblem - identifying phrase structure. The first is
a Markovian approach that extends standard HMMs to allow the use of a
rich observation structure and of general classifiers to model
state-observation dependencies. The second is an extension of
constraint satisfaction formalisms. We develop efficient combination
algorithms under both models and study them experimentally in the
context of shallow parsing.
Learning to Recognize 3D objects
A learning account for the problem of object recognition is developed
within the PAC (Probably Approximately Correct) model of learnability.
The key assumption underlying this work is that objects can be
recognized (or, discriminated) using simple representations in terms
of ``syntactically" simple relations over the raw image. Although the
potential number of these simple relations could be huge, only a few
of them are actually present in each observed image and a fairly small
number of those observed is relevant to discriminating an object.
We show that these properties can be exploited to yield an efficient
learning approach in terms of sample and computational complexity,
within the PAC model. No assumptions are needed on the distribution
of the observed objects and the learning performance is quantified
relative to its past experience. Most importantly, the success of
learning an object representation is naturally tied to the ability to
represent it as a function of some intermediate representations
extracted from the image.
We evaluate this approach in a large scale experimental study in which
the SNoW learning architecture is used to learn representations for
the 100 objects in the Columbia Object Image Database (COIL-100).
Experimental results exhibit good generalization and robustness
properties of the SNoW-based method relative to other
approaches. SNoW's recognition rate degrades more gracefully when the
training data contains fewer views and it shows similar behavior also
in some preliminary experiments with partially occluded objects.
Learning in Natural Language: Theory and Algorithmic
Approaches
This article summarizes work on developing a learning theory account
for the major learning and statistics based approaches used in natural
language processing. It shows that these approaches can all be
explained using a single distribution free inductive principle related
to the pac model of learning. Furthermore, they all make predictions
using the same simple knowledge representation -- a linear
representation over a common feature space. This is significant both
to explaining the generalization and robustness properties of these
methods and to understanding how these methods might be extended to
learn from more structured, knowledge intensive examples, as part of a
learning centered approach to higher level natural language
inferences.
Toward a Theory of Learning Coherent Concepts
We develop a theory for learning scenarios where multiple learners
co-exist but there are mutual compatibility constraints on their
outcomes. This is natural in cognitive learning situations, where
``natural'' compatibility constraints are imposed on the outcomes of
classifiers so that a valid sentence, image or any other domain
representation is produced.
We suggest that work in this direction may help to resolve the
contrast between the hardness of learning as predicted by the current
theoretical models and the apparent ease at which cognitive systems
seem to learn.
A model of concept learning is studied in which the target concept is
required to {\em cohere} with other concepts of interest. The {\em
coherency} is expressed via a (Boolean) constraint that the concepts
have to satisfy. Under this model, learning a concept is shown to be
easier (in terms of sample complexity and mistake bounds) and the
concepts learned are shown to be more robust to noise in their input
(attribute noise). These properties are established for half spaces
and the connection to large margin theory is discussed.
Relational Representations that Facilitate Learning
Given a collection of objects in the world, along with some relations
that hold among them, a fundamental problem is how to learn
definitions of some relations and concepts of interest in terms of the
given relations. These definitions might be quite complex and,
inevitably, might require the use of quantified expressions. Attempts
to use first order languages for these purposes are hampered by the
fact that relational inference is intractable and, consequently, so is
the problem of learning relational definitions. This work develops an
expressive relational representation language that allows the use of
propositional learning algorithms when learning relational
definitions.
The representation serves as an intermediate level between a raw
description of observations in the world and a propositional learning
system that attempts to learn definitions for concepts and relations.
It allows for hierarchical composition of relational expressions that
can be evaluated efficiently on the observations and thus supports
learning complex definitions by learning simple functions of the
intermediate representations. The approach is illustrated using
examples from natural language and visual processing.
A Classification Approach to Word Prediction
The eventual goal of a language model is to accurately predict the
value of a missing word given its context. We present an approach to
word prediction that is based on learning a representation for each
word as a function of words and linguistics predicates in its
context. This approach raises a few new questions that we address.
First, in order to learn good word representations it is necessary to
use an expressive representation of the context. We present a way that
uses external knowledge to generate expressive context
representations, along with a learning method capable of handling the
large number of features generated this way that can, potentially,
contribute to each prediction. % Second, since the number of words
``competing'' for each prediction is large, there is a need to ``focus
the attention'' on a smaller subset of these. We exhibit the
contribution of a ``focus of attention'' mechanism to the performance
of the word predictor. Finally, we describe a large scale
experimental study in which the approach presented is shown to yield
significant improvements in word prediction tasks.
Learning to Recognize Objects
A learning account for the problem of object recognition is developed
within the PAC (Probably Approximately Correct) model of learnability.
The proposed approach makes no assumptions on the distribution of the
observed objects, but quantifies success relative to its past
experience. Most importantly, the success of learning an object
representation is naturally tied to the ability to represent it as a
function of some intermediate representations extracted from the
image.
We evaluate this approach in a large scale experimental study in which
the SNoW learning architecture is used to learn representations for
the 100 objects in the Columbia Object Image Database (COIL-100). The
SNoW-based method is shown to outperform other methods in terms of
recognition rates; its performance degrades gracefully when the
training data contains fewer views and in the presence of occlusion
noise.
View-Based 3D Object Recognition Using SNoW
This paper describes a novel view-based algorithm
for 3D object recognition using a network of linear units.
The SNoW learning architecture is a sparse network of
linear functions over a pre-defined or incrementally learned feature
space and is specifically tailored for learning in the presence of a
very large number of features.
We use the pixel-level representation in the experiments and
compare the performance of SNoW
with Support Vector Machines and nearest neighbor methods
on 3D object recognition using the 100 objects in
the Columbia Image Object Database (COIL-100).
Experimental results show that SNoW-based method outperform SVM-based
system in terms
of recognition rate and the computational cost involved in learning.
The empirical results also provide insight for practical and theoretical
considerations on view-based methods for 3D object recognition.
Learning in Natural Language
Statistics-based classifiers in natural language are developed
typically by assuming a generative model for the data, estimating its
parameters from training data and then using Bayes rule to obtain a
classifier. For many problems the assumptions made by the generative
models are evidently wrong, leaving open the question of why these
approaches work.
This paper presents a learning theory account of the major statistical
approaches to learning in natural language. A class of Linear
Statistical Queries (LSQ) hypotheses is defined and learning with it
is shown to exhibit some robustness properties. Many statistical
learners used in natural language, including naive Bayes, Markov
Models and Maximum Entropy models are shown to be LSQ
hypotheses, explaining the robustness of these predictors even when
the underlying probabilistic assumptions do not hold.
This coherent view of when and why learning approaches work in this
context may help to develop better learning methods and an
understanding of the role of learning in natural language inferences.
Relational Learning for NLP using Linear Threshold Elements
We describe a coherent view of learning and reasoning with
relational representations in the context of natural language processing.
In particular, we discuss the Neuroidal Architecture,
Inductive Logic Programming and the SNoW system
explaining the relationships among these, and thereby offer an
explanation of the theoretical basis for the SNoW system.
We suggest that extensions of this system along the lines suggested by
the theory may provide new levels of scalability and functionality.
A Learning Approach to Shallow Parsing
A SNoW based learning approach to shallow parsing tasks is presented
and studied experimentally. The approach learns to identify syntactic
patterns by combining simple predictors to produce a coherent
inference. Two instantiations of this approach are studied and
experimental results for Noun-Phrases (NP) and Subject-Verb (SV)
phrases that compare favorably with the best published results are
presented.
In doing that, we compare two ways of modeling the problem of
learning to recognize patterns and suggest that shallow parsing
patterns are better learned using open/close predictors than
using inside/outside predictors.}
Word Prediction and Clustering
We study the simultaneous learning of many word predictors
and the relation it has to word similarity and clustering. Our
approach is shown to be robust and perform quite well even when the
competing words are very similar. Some of the issues involved in
scaling up to learn many predictors are investigated, as well as how
this approach gives rise to the emergence of semantic categories.
Memory Based Learning in NLP
We study memory-based learning methods and show that they can be
viewed as learning linear predictors over commonly used feature
spaces.
We suggest that this view allows one to study memory based methods and
other successful learning algorithms used in NLP within the same
computational framework and may result in improved algorithms and a
better understanding for the role of learning in natural language
inferences.
A Winnow-Based Approach to Context-Sensitive Spelling Correction
A large class of machine-learning problems in natural language require
the characterization of linguistic context. Two characteristic
properties of such problems are that their feature space is of very
high dimensionality, and their target concepts depend on only a small
subset of the features in the space. Under such conditions,
multiplicative weight-update algorithms such as Winnow have been shown
to have exceptionally good theoretical properties. In the work
reported here, we present an algorithm combining variants of Winnow
and weighted-majority voting, and apply it to a problem in the
aforementioned class: Context-Sensitive Spelling Correction.
This is the task of fixing spelling errors that happen to result
in valid words, such as substituting to for too,
casual for causal, and so on.
We evaluate our algorithm, WinSpell,
by comparing it against BaySpell, a statistics-based method
representing the state of the art for this task.
We find: (1) When run with a full (unpruned) set of features, WinSpell
achieves accuracies significantly higher than BaySpell was able to
achieve in either the pruned or unpruned condition; (2) When compared
with other systems in the literature, WinSpell exhibits the highest
performance; (3) While several aspects of WinSpell's architecture
contribute to its superiority over BaySpel, the primary factor is that
it is able to learn a better linear separator than BaySpell learns;
(4) When run on a test set drawn from a different corpus than the
training set was drawn from, WinSpell is better able than BaySpell to
adapt, using a strategy we will present that combines supervised
learning on the training set with unsupervised learning on the (noisy)
test set.
Learning to Resolve Natural Language Ambiguities:
A Unified Approach
We analyze a few of the commonly used statistics based and machine
learning algorithms for natural language disambiguation tasks and
observe that they can be re-cast as learning linear separators in the
feature space. Each of the methods makes a priori assumptions, which
it employs, given the data, when searching for its hypothesis.
Nevertheless, as we show, it searches a space that is as rich as the
space of {\em all} linear separators. We use this to build an
argument for a data driven approach which merely searches for a {\em
good} linear separator in the feature space, without further
assumptions on the domain or a specific problem.
We present such an approach - a sparse network of linear
separators, utilizing the Winnow learning algorithm - and show how to
use it in a variety of ambiguity resolution problems. The learning
approach presented is attribute-efficient and, therefore, appropriate
for domains having very large number of attributes.
In particular, we present an extensive experimental comparison of our
approach with other methods on several well studied lexical
disambiguation tasks such as context-sensitive spelling correction,
prepositional phrase attachment and part of speech tagging. In all
cases we show that our approach either outperforms other methods tried
for these tasks or performs comparably to the best.
Part of Speech Tagging Using a Network of Linear Separators
We present an architecture and an on-line learning algorithm and apply
it to the problem of part-of-speech tagging. The architecture
presented, \snow, is a network of linear separators in the feature space,
utilizing the Winnow update algorithm.
Multiplicative weight-update algorithms such as Winnow have been shown
to have exceptionally good behavior when applied to
very high dimensional problems,
and especially when the target concepts depend on only a small subset
of the features in the feature space. In this paper we describe an
architecture that utilizes this mistake-driven algorithm for
multi-class prediction -- selecting the part of speech of a word.
The experimental analysis presented here provides more evidence to
that these algorithms are suitable for natural language problems.
The algorithm used is an on-line algorithm: every example is used by
the algorithm only once, and is then discarded. This has significance
in terms of efficiency, as well as quick adaptation to new contexts.
We present an extensive experimental study of our algorithm under
various conditions; in particular, it is shown that the algorithm
performs comparably to the best known algorithms for pos.
Incorporating Knowledge in Natural Language Learning: A Case Study
Incorporating external information during a learning process is
expected to improve its efficiency. We study a method for
incorporating noun-class information, in the context of learning to
resolve Prepositional Phrase Attachment (PPA) disambiguation. This is
done within a recently introduced architecture, \snow, a sparse
network of threshold gates utilizing the Winnow learning
algorithm. That architecture has already been demonstrated to perform
remarkably well on a number of natural language learning tasks.
The knowledge sources used were compiled from the WordNet database for
general linguistic purposes, irrespective of the \ppa\ problem, and
are being incorporated into the learning algorithm by enriching its
feature space. We study two strategies of using enriched features and
the effects of using class information at different granularities, as
well as randomly-generated knowledge which serves as a control set.
Incorporating external knowledge sources within \snow\ yields a
statistically significant performance improvement. In addition, we
find an interesting relation between the granularity of the knowledge
sources used and the magnitude of the improvement. The encouraging
results with noun-class data provide a motivation for carrying out
more work on generating better linguistic knowledge sources.
Clustering Appearances of 3D Objects
We introduce a method for unsupervised clustering of images of 3D
objects. Our method examines the space of all images and partitions
the images into sets that form smooth and parallel surfaces in this
space. It further uses sequences of images to obtain more reliable
clustering. Finally, since our method relies on a non-Euclidean
similarity measure we introduce algebraic techniques to estimating
local properties of these surfaces without first embedding the images
in a Euclidean space. We demonstrate our method by applying it to a
large database of images.
On the Hardness of Approximate Reasoning
Many AI problems, when formalized, reduce to evaluating the probability that
a propositional expression is true. In this paper we show that this problem
is computationally intractable even in surprisingly restricted cases and
even if we settle for an approximation to this probability.
We consider various methods used in approximate reasoning such as computing
degree of belief and Bayesian belief networks, as well as reasoning techniques
such as constraint satisfaction and knowledge compilation,
that use approximation to avoid computational difficulties,
and reduce them to model-counting problems over a propositional domain.
We prove that counting satisfying assignments of propositional languages
is intractable even for Horn and monotone formulae, and even when the size of
clauses and number of occurrences of the variables are extremely limited.
This should be contrasted with the case of deductive reasoning, where
Horn theories and theories with binary clauses are distinguished by the
existence of linear time satisfiability algorithms.
What is even more surprising is that, as we show, even
approximating the number of satisfying assignments (i.e.,
``approximating'' approximate reasoning),
is intractable for most of these restricted theories.
We also identify some restricted classes of propositional formulae for which
efficient algorithms for counting satisfying assignments can be given.
Reasoning with Models
We develop a model-based approach to reasoning, in which the knowledge
base is represented as a set of models (satisfying assignments) rather than
a logical formula, and the set of queries is restricted.
We show that for every propositional knowledge base ($KB$)
there exists a set of {\em characteristic models} with the property that a
query is true in $KB$ if and only if it is satisfied by the
models in this set.
We characterize a set of functions
for which the model-based representation is compact and provides
efficient reasoning.
These include cases where the formula-based representation does not
support efficient reasoning.
In addition, we consider the model-based approach to {\em abductive reasoning}
and show that for any propositional $KB$, reasoning with its model-based
representation yields an abductive
explanation in time that is polynomial in its size.
Some of our technical results make use of the {\em Monotone Theory},
a new characterization of Boolean functions
recently introduced.
The notion of {\em restricted queries} is inherent in our approach.
This is a wide class of queries for which reasoning is efficient
and exact, even when the model-based representation $KB$ provides only an approximate
representation of the domain in question.
Moreover, we show that the theory developed here generalizes
the model-based approach to reasoning with Horn expressions
and captures even the notion of reasoning with
Horn-approximations.
On Learning Read-k-Satisfy-j DNF
We study the learnability of Read-$k$-Satisfy-$j$ (RkSj) DNF formulas.
These are boolean formulas in disjunctive normal form (DNF), in which
the maximum number of occurrences of a variable is bounded by $k$,
and the number of terms satisfied by any assignment is at most $j$.
After motivating the investigation of this class of DNF formulas,
we present an algorithm that for any unknown
RkSj DNF formula to be learned, with high probability
finds a logically equivalent DNF formula using the well-studied
protocol of equivalence and membership queries. The algorithm runs in
polynomial time for $k\cdot j=O({\log n\over\log\log n})$, where $n$
is the number of input variables.
Finding the Maximum Area Axis-Parallel Rectangle in a Polygon
We consider the geometric optimization problem of finding the \MAAPR~
(MAAPR) in an $n$-vertex general polygon. We characterize the MAAPR
for general polygons by considering different cases based on the types
of contacts between the rectangle and the polygon. We present a
general framework for solving a key subcase of the MAAPR problem which
dominates the running time for a variety of polygon types.
Using this framework, we obtain the following
MAAPR time results: $\Theta(n)$ for xy-monotone polygons, ${\rm O}(n
\alpha(n))$ for \OCPs, ${\rm O}(n \alpha(n) \log n)$ for
horizontally (vertically) convex polygons, (where $\alpha(n)$ is the
slowly growing inverse of Ackermann's function), ${\rm O}(n \log n)$
for a special type of horizontally convex polygon, and ${\rm O}(n
\log^2 n)$ for general polygons.
For all these types of non-rectilinear polygons, we match the running
time of the best known algorithms for their rectilinear counterparts.
We prove a lower bound of time in $\Omega(n \log n)$ for finding the
MAAPR in both self-intersecting polygons and general polygons with
holes. The latter result gives us both a lower bound of $\Omega(n
\log n)$ and an upper bound of ${\rm O}(n
\log^2 n)$ for general polygons with holes.
On Learning Visual Concepts and DNF Formulae
We consider the problem of learning DNF formulae in the mistake-bound and
the PAC models. We develop a new approach, which is called
{\em polynomial explainability}, that is shown to be useful for
learning some new subclasses of DNF (and CNF) formulae that were not known to be
learnable before. Unlike previous learnability results for DNF
(and CNF) formulae,
these subclasses are not limited in the number of terms or
in the number of variables performance term; yet, they contain the subclasses of
$k$-DNF and $k$-term-DNF (and the corresponding classes of CNF)
as special cases.
We apply our DNF results
to the problem of learning {\em visual concepts}
and obtain learning algorithms for several natural subclasses of visual
concepts that appear to have no natural boolean counterpart.
On the other hand, we show that learning some other natural subclasses of
visual concepts is as hard as learning the class of all DNF formulae.
We also consider the robustness of these results under various types of
noise.
Defaults and Relevance in Model Based Reasoning
Reasoning with model-based representations is an intuitive paradigm,
which has been shown
to be theoretically sound and to possess some computational advantages
over reasoning with formula-based representations of knowledge.
In this paper we present more evidence to the value of such representations.
Our results hinge on the notion of relevance, and model based representations
are shown to be useful in capturing relevant information, and in
allowing to ignore irrelevant information.
In particular, we consider situations where
context-specific information is used in the process of reasoning.
We show that reasoning with
model-based representations can be done efficiently
in the presence of varying context information.
We then consider the task of default reasoning.
We show that default reasoning is a generalization of reasoning within
context, in which the reasoner has many ``context" rules, which
may be conflicting.
We develop model-based algorithms that handle efficiently fragments
of Reiter's default logic.
Our intuition about relevance is best captured in the model for reasoning
within context, where model-based representations enable us to filter out
irrelevant information. Interestingly,
default logic is somewhat in contrast with our intuition about relevance.
Default rules do not tell us explicitly what
the context information is. Instead, we have to figure out what are the
possible ``extensions" and then use those as possible contexts.
As we show, model-based representations capture all possible extensions
in an accessible form, thereby supporting efficient default reasoning
whenever possible.
Lastly, we argue that these results support an incremental view of
reasoning in a natural way. We discuss the Learning to Reason framework,
which emphasizes this view, and the notion of relevance as manifested in
it. In particular, we discuss results on Learning to Reason in which
model-based representations are used to represent relevant knowledge.
Reasoning with Examples:
Propositional Formulae and Database Dependencies
For humans, looking at how concrete examples behave is an intuitive way of
deriving conclusions.
The drawback with this method is that it does not necessarily
give the correct results.
However, under certain conditions example-based deduction
can be used to obtain a correct and complete inference procedure.
This is the case for Boolean formulae (reasoning with models)
and for certain types of database integrity constraints
(the use of Armstrong relations).
We show that these approaches are closely related,
and use the relationship to prove new results about
the existence and sizes of Armstrong relations for Boolean dependencies.
Further, we study the problem of translating between
different representations of relational databases, in particular
we consider Armstrong
relations and Boolean dependencies,
and prove some positive results in that context.
Finally, we discuss the close relations between
the questions of finding keys in relational databases and that of finding
abductive explanations.
Learning to Reason
We introduce a new framework for the study of reasoning.
The Learning (in order) to Reason approach developed here
combines the interfaces to the world used by known learning models
with the reasoning task and a performance criterion suitable for it.
In this framework the intelligent agent is given access to her favorite
learning interface, and is also given a grace period in which she
can interact with this interface and construct her representation
KB of the world $W$.
Her reasoning performance is measured only after this period,
when she is presented with queries $\alpha$ from some query language,
relevant to the world, and has to answer whether $W$ implies $\alpha$.
The approach is meant to overcome the main computational
difficulties in the traditional treatment of
reasoning which stem from its separation from the ``world''.
First, by allowing the reasoning task to interface the world
(as in the known learning models),
we avoid the rigid syntactic restriction on the intermediate knowledge
representation.
Second, we make explicit the dependence of the reasoning performance
on the input from the environment. This is possible only because the
agent interacts with the world when constructing her knowledge representation.
We show how previous results from learning theory and reasoning fit
into this framework and illustrate
the usefulness of the Learning to Reason approach by
exhibiting new results that are not possible in the traditional setting.
First, we give a Learning to Reason algorithm for a class of propositional
languages for which there are no efficient reasoning algorithms,
when represented as a traditional (formula-based) knowledge base.
Second, we exhibit a Learning to Reason algorithm for a class of
propositional languages that is not known to be learnable in the
traditional sense.
Mistake-Driven Learning in Text Categorization
Learning problems in the text processing domain often map the text to
a space whose dimensions are the measured features of the text, e.g.,
its words. Three characteristic properties of this domain are (a) very
high dimensionality, (b) both the learned concepts and the instances
reside very sparsely in the feature space, and (c) a high variation in
the number of active features in an instance. In this work we study
three mistake-driven learning algorithms for a typical task of this
nature -- text categorization.
We argue that these algorithms --
which categorize documents by learning a linear separator in the feature space --
have a few properties that make them ideal for this domain. We then
show that a quantum leap in performance is achieved when we further
modify the algorithms to better address some of the specific
characteristics of the domain. In particular, we
demonstrate (1) how variation in document length can be tolerated by
either normalizing feature weights or by using negative
weights, (2) the positive effect of applying a threshold range in
training, (3) alternatives in considering feature frequency, and (4)
the benefits of discarding features while training.
Overall, we present an algorithm, a variation of Littlestone's Winnow,
which performs significantly better than any other algorithm
tested on this task using a similar feature set.
Applying Winnow to Context-Sensitive Spelling Correction
Multiplicative weight-updating algorithms such as Winnow
have been studied extensively in the COLT literature,
but only recently have people started to use them in applications.
In this paper, we apply a Winnow-based algorithm
to a task in natural language: context-sensitive spelling correction.
This is the task of fixing spelling errors that happen to result
in valid words, such as substituting {\it to\/} for {\it too},
{\it casual\/} for {\it causal}, and so on.
Previous approaches to this problem have been statistics-based;
we compare Winnow to one of the more successful such approaches,
which uses Bayesian classifiers.
We find that: (1)~When the standard (heavily-pruned) set of features
is used to describe problem instances, Winnow performs comparably to
the Bayesian method;
(2)~When the full (unpruned) set of features is used,
Winnow is able to exploit
the new features and convincingly outperform Bayes;
and (3)~When a test set is encountered that is dissimilar
to the training set, Winnow is better than Bayes
at adapting to the unfamiliar test set,
using a strategy we will present for combining learning
on the training set with unsupervised learning on the (noisy) test set.
Learning Active Classifiers
Many classification algorithms are ``passive'', in that they assign a
class-label to each instance based only on the description given, even
if that description is incomplete. In contrast, an *active* classifier
can --- at some cost --- obtain the values of missing attributes, before
deciding upon a class label. The expected utility of using an active
classifier depends on both the cost required to obtain the additional
attribute values and the penalty incurred if it outputs the wrong
classification. This paper considers the problem of *learning* near-optimal
active classifiers, using a variant of the probably-approximately-correct
(PAC) model. After defining the framework --- which is perhaps the main
contribution of this paper --- we describe a situation where this task can be
achieved efficiently, but then show that the task is often intractable.
A Connectionist Framework for Reasoning: Reasoning with Examples
We present a connectionist architecture that supports
almost instantaneous deductive and abductive reasoning.
The deduction algorithm responds in few steps for single rule queries
and in general, takes time that is linear with the number of rules in the query.
The abduction algorithm produces an explanation in few steps and the best
explanation in time linear with the size of the assumption set.
The size of the network is polynomially related to the size of other
representations of the domain, and may even be smaller.
We base our connectionist model on Valiant's Neuroidal model
\cite{Valiant94}
and thus make minimal assumptions about the computing
elements, which are assumed to be classical threshold elements
with states.
Within this model we develop a reasoning framework that utilizes
a model-based approach to reasoning \cite{\KKS,\KRm}.
In particular, we suggest to interpret the connectionist architecture
as encoding {\em examples} of the domain we reason about
and show how to perform various reasoning tasks with this interpretation.
We then show that the representations used can be acquired efficiently
from interactions with the environment and discuss how this learning process
influences the reasoning performance of the network.
Learning to Reason: The Non-Monotonic Case
We suggest a new approach for the study of the non-monotonicity
of human commonsense reasoning.
The two main premises that underlie this work are that
commonsense reasoning is an inductive phenomenon, and that
missing information in the interaction of the agent
with the environment may be as informative for future interactions
as observed information.
This intuition is formalized and the problem of reasoning from incomplete
information is presented as a problem of learning attribute functions
over a generalized domain.
We consider examples that illustrate various aspects of the non-monotonic
reasoning phenomena, which have been used over the years as ``bench-marks''
for various formalisms, and translate them into Learning to Reason problems.
We demonstrate that these
have concise representations over the generalized domain and
prove that these representations can be learned efficiently.
The framework developed suggests an ``operational'' approach to
studying reasoning that is nevertheless rigorous and amenable to analysis.
We show that this approach efficiently supports
reasoning with incomplete information
and at the same time matches our expectations of plausible
patterns of reasoning in cases where other theories do not.
This work continues previous works in the Learning to Reason framework,
and supports the thesis that in order to develop a computational account for
commonsense reasoning one should study the phenomena of learning and reasoning
together.
Learning to Reason with Restricted View
The current emphasis of the research in learning theory is on the study of inductive learning
(from examples) of concepts (binary classifications of examples).
The work in AI identifies other tasks, such as reasoning, as essential
for intelligent agents, but those are not supported by the current
learning models.
The Learning to Reason framework
was devised to reconcile inductive learning and efficient reasoning.
The framework highlights the fact that new learning questions
arise when {\em learning in order to reason}.
This paper addresses the task of deductive reasoning, and
investigates learning to reason problems in which the
examples seen are only partially specified.
The paper presents several interpretations for partial information
in the interface with the environment, and develops model based
representations and reasoning algorithms that are suitable to deal
with partially observable worlds. Then, learning to reason algorithms
that cope with partial information are developed.
These results exhibit a tradeoff between learnability,
the strength of the oracles used in the interface
and the expressiveness of the queries asked.
This work shows that one can learn to reason with
respect to expressive worlds, that cannot be learned efficiently
in the traditional learning framework and do not support efficient
reasoning in the traditional reasoning framework.
Linearizable Read/Write Objects
We study the cost of implementing {\em linearizable} read/write
objects for shared-memory multiprocessors
under various assumptions on the available timing information.
We take as cost measure the {\em worst-case response time} of performing an
operation in distributed
implementations of virtual shared memory consisting of such
objects.
It is assumed that processes have clocks that run at the same rate as
real time and all messages incur a delay in the range $[d-u,d]$ for
some known constants $u$ and $d$, $0 \leq u \leq d$.
In the {\em perfect clocks} model, where processes have perfectly
synchronized clocks and every message incurs a delay of exactly $d$,
we present a family of optimal linearizable implementations,
parameterized by a constant $\beta$, $0 \leq \beta \leq 1$, for which
the worst-case response times for read and write operations are $\beta
d$ and $(1-\beta)d$, respectively.
The parameter $\beta$ may be appropriately chosen to account for
the relative frequencies of read and write operations.
Our main result is the first known linearizable implementation for
the {\em imperfect clocks} model, where clocks are not initially
synchronized and message delays can vary, i.e., $u > 0$; it
achieves worst-case response times of less than $4u+b$ ($b>0$ is an
arbitrarily small constant) and $d+3u$ for read and write operations,
respectively.
This implementation employs novel synchronization techniques in order
to utilize the lower bound on message delay time and achieve bounds on
worst-case response times that depend on the message delay uncertainty
$u$. For a wide range of values of $u$, these bounds improve upon
previously known ones for implementations that support
consistency conditions even weaker than linearizability.
A Learning Approach to Shallow Parsing
A SNoW based learning approach to shallow parsing tasks is presented
and studied experimentally. The shallow parsing method suggested
learns to identify syntactic patterns by combining simple predictors
to produce a coherent inference. Learned predictors are cascaded and
their outcome is used as an input to an inference algorithm that
produces the final phrases.
We discuss the requirements from a learning system for this
approach to be applicable and present a system based on the SNoW
learning architecture that satisfies these conditions.
Two instantiations of this approach are studied and experimental
results for Noun-Phrases (NP) and Subject-Verb (SV) phrases that
compare favorably with the best published results are presented.
In doing that, we compare two ways of modeling the problem of
learning to recognize patterns and suggest that shallow parsing
patterns are better learned using open/close predictors than
using inside/outside predictors.}
Linear concepts and hidden variables
We study a learning problem which allows for a ``fair'' comparison
between unsupervised learning methods---probabilistic model
construction, and more traditional algorithms that directly learn a
classification.
The merits of each approach are intuitively clear: inducing a model is
more expensive computationally, but may support a wider range of
predictions. Its performance, however, will depend on how well the
postulated probabilistic model fits that data.
To compare the paradigms we consider a model which postulates a single
binary-valued {\em hidden variable\/} on which all other attributes
depend. In this model, finding the most likely value of any one
variable (given known values for the others) reduces to testing a
linear function of the observed values.
We learn the model with two techniques: the standard EM algorithm, and
a new algorithm we develop based on covariances. We compare these, in
a controlled fashion, against an algorithm (a version of {\em Winnow})
that attempts to find a good linear classifier directly.
Our conclusions help delimit the fragility of using a model that is
even ``slightly'' simpler than the distribution actually generating
the data, vs. the relative robustness of directly searching for a good
predictor.
Dan Roth