As semantic technologies mature, more and more people understand that automated reasoning is hard, but not really why it’s hard. This blog post explains in detail—using a very simple, easy to understand example—why automated reasoning is computationally hard.
By the way, if you want to play with the toy example, we’ve put it up as
a gist. Try the Pellet command-line,
pellet realize non-determinism.owl.
In order to understand why expressive reasoning is computationally hard, let’s take a very brief return back to Computer Science 101.
Computational Complexity; or, Why Does the Universe Hate Us?
Computational complexity is the technical term for, roughly, how much work a computer (any computer) has to perform in order to determine an answer to some request. What does “how much work” mean? There are two measures—time and space (i.e., computer memory). In other words, how long the computer runs and how much space the computer uses, respectively.
There are three interesting measures of complexity: best case, worst case, and average case. Computational complexity usually focuses on worst case complexity; that is, given the very hardest data, how much work does a computer have to do when executing an algorithm to solve a problem? But average case complexity is also important, although a lot harder to figure out typically.
Wait, that was a lie: there’s a fourth kind and it’s the most important: your case. Given the data and questions you care about, how much work does the computer (any computer) have to do?
Non-Determinism and Expressive Reasoning
Okay, so let’s turn to the toy example: it’s simply some axioms (i.e., logical definitions) about the university domain, including students, professors, instructors, courses, etc.
First, we say that a student is either a graduate student or an
Student equivalentTo GraduateStudent or
UndergraduateStudent. Then we define instructor as either a professor
or a graduate student:
Instructor equivalentTo Professor or
GraduateStudent. Next we say that a person cannot be both a student
and a professor:
Professor disjointWith Student. (Again, this
is a toy example to explain the underlying difficulties. The modeling is
So far we’ve only said things about the schema. Now let’s say something
about the data. We have person that is an instructor and a student; or,
put another way, there’s an individual in the knowledge base that’s an
instance of the Instructor and Student class:
John type Instructor,
So far, so simple. Right? Given this data, let’s answer a simple question: is John a grad student?
What Does Pellet Have to Do?
Can Pellet conclude from this data that
John is an instance of
GraduateStudent? To infer this relation, Pellet must consider
different possibilities and backtrack whenever it makes a wrong choice.
Every time it has to backtrack—because it’s made the wrong (random)
choice—it has to do more work.
Since we know that John is an
Instructor, Pellet knows it needs
to pick either
Professor it would see that there is a problem
Professor is disjoint with
John is known to be an instance of
it would backtrack and pick
Why is this simple domain hard? Because it contains uncertainty: instructors are sometimes professors; but sometimes they are grad students. We may not know which is which in any particular case without knowing some more information.
In our modeling, we’ve encoded uncertainty that’s the result of incomplete information (knowing for every instructor explicitly whether they are a professor or grad student). Pellet will reason about this incomplete, uncertain information, but doing so is hard because it is non-deterministic, that is, it requires making choices that may be wrong, which means the reasoner has to do more work.
Pellet includes many optimizations to make the right choices and learn from the mistakes it made in the past. However, doing so in a domain-independent way is not easy, which means it’s hard to give guarantees about performance in every possible case. That’s why we said the only case that really matters is your case.
Non-Determinism is Expressive…And Hard!
As you can imagine, having many of these non-deterministic choices during reasoning will complicate things. Adding or removing statements will invalidate the choices made by Pellet and then lots of work has to be repeated.
To sum up, the need for non-determinism in reasoning is a result of uncertainty in the data, where we have incomplete information about John. The data tells us John is a student, but it does not say if he is a graduate or undergraduate. If we knew this information, we could use a simpler, deterministic reasoning method.
Let’s give another example. A course has 2 instructors, a professor and
a graduate student:
Course subClassOf (hasInstructor exactly 2
Instructor) and (hasInstructor exactly 1 Professor) and (hasInstructor
exactly 1 GraduateStudent). And let’s say that a publication with at
least one student author is a student publication:
equivalentTo Publication and hasAuthor min 1 Student.
Then we have some more instance data saying CS501 has instructors Alice and Bob, and there’s a publication which Alice and Bob author:
CS501 type Course; hasInstructor Alice, Bob.
AliceBobPublication type Publication; hasAuthor Alice, Bob.
Given this information, can Pellet infer that
AliceBobPublication is a
Yes, it can, even though we we don’t know which of Alice or Bob is a student. But we do know that one of them must be a student because a course has one student instructor. It doesn’t matter that we don’t know whether Alice or Bob is the student because we know one of them must be and that’s sufficient.
Pellet again will try different possibilities: either Alice is a
professor and Bob is a student, or Bob is a professor and Alice is a
student. Reasoning is even harder in this case because either possibility
might be true, and Pellet needs to try every possibility to infer that
AliceBobPublication has at least one student author. So
the reasoner performs a case-by-case analysis to conclude this
inference holds in every possible world. Again, if we had the most specific
types for Alice and Bob in the data then we wouldn’t need this analysis.
That’s why we know the universe hates us and wants us to be unhappy: nearly all useful advanced computer science is computationally very hard. The universe is a cruel and bitter tease.
Expressive reasoning of the non-deterministic variety can be helpful, especially when the instance data has uncertainty or is incomplete. However, this comes at a price of higher complexity. Pellet includes many optimizations and, in practice, the average case complexity is usually not that bad; but there is always some unpredictability associated with non-determinism.
If you are tempted to conclude that all this talk about uncertainty and incomplete information isn’t a practical problem, because in the real world databases and other applications always know everything that’s relevant, then you are forgetting about information integration, the eternal bugaboo of information technology in the real world.
When information relevant to some task exists in many databases or applications—or is spread all over the Web in many different sites—then from the point of view of any single system, its information is incomplete and uncertain. And even when doing the hard work of integrating and aligning these diverse, unintegrated systems, there’s uncertainty and incompleteness. Reasoners and other semantic technologies help solve these kinds of problems.