Introducing Spanner: From Documents to Linked Data Apps

TLDR? Spanner is a new product that automagically turns documents (of any kind) into a full-featured semantic web—or, if you prefer, “Linked Data”—application that is easily customized or extended via JavaScript.

Interested? Read on for the long play version, where I address some common objections to semantic technologies to show how Spanner handles them.

From What to What?

A common lament about semantic technologies: where does the RDF and OWL come from? If you need to integrate a lot of databases and other structured sources, converting to RDF and OWL is feasible. If fact, once you know how to do it, it’s pretty simple.

But what if your data is unstructured, i.e., what if most of it lives in documents?

Spanner extracts information from documents and converts that information into RDF and OWL. It’s especially good at entity extraction when a gazette (a list of names of entities, organized by type) is provided, but works reasonably well without a gazette, too. Spanner uses machine learning to discover connections between documents, entities, and between entities and documents; it will also learn categories or tags for documents if some of the documents are already tagged. Finally, it will extract keywords and a single key sentence from every document.

It works for pretty much any kind of MS Office documents, as well as HTML, plain text, email, PDF, etc.

We call this part of the process “data bootstrapping” and it’s fully automated (of course, if you provide gazettes or other inputs, the quality of the process improves, but the only required input is documents).

We Just Want to Publish Linked Data

We’ve got you covered. The core of Spanner is a Linked Data publishing solution: give it an RDF file or a SPARQL endpoint, and it will publish that data as Linked Open Data automagically, with very minimal configuration.

Going even further, in Spanner 1.1, you won’t even have to convert information to RDF or standup a SPARQL endpoint: Spanner 1.1 will support publishing native RDBMS data as RDF dynamically, on-the-holy-crap-is-that-cool-fly… If you don’t need to do anything else but publish Linked Data, Spanner has you covered.

Making an Ontology or Schema is Too Hard

There’s good news and there’s better news.

You don’t need an ontology or schema before Spanner performs data bootstrapping. Don’t have an ontology? Can’t find a publicly available one? Don’t want to build one? Don’t worry about it. That’s the good news.

The better news is that if you have an ontology or schema, the data bootstrapping process will just work better. That’s the better news.

There is no bad news.

My Org isn’t Full of SemWeb Developers

You don’t need any semantic technology expertise to use Spanner. Most of Spanner can be extended, customized, skinned, rearranged, or otherwise manipulated by writing ordinary JavaScript code against the Spanner APIs, which are thin and simple and RESTful.

The better news is that you probably won’t need to do much other than customize the default look-and-feel because Spanner is quite feature-rich:

  • semantic search (via Waldo)
  • faceted browsing (via Pelorus)
  • auto-generated edit forms for data curation (via Annex)
  • SPARQL query (via Empire)
  • full-spectrum OWL reasoning (via Pellet and friends)
  • RDF & OWL-aware machine learning (via Corleone)
  • integrity constraint validation (via Pellet ICV)
  • role-based access controls
  • asynchronous job service
  • RESTful HTTP API wrapping all Spanner features and services (via PelletServer)

Anyone who can write JavaScript and use a RESTful interface will be a savvy semantic technology developer when using Spanner. And, yes, that means that there still isn’t any bad news.

NLP isn’t Perfect, Or: What About Data Quality?

That’s right—sometimes the results of Natural Language Processing are awful. What does that mean for users or developers?

Let’s be honest: it means that you’re going to have to take-on some data curation and data quality burden; but, hey, you knew that already. Any system that helps you pivot from document-based information management to, well, anything else…is going to require you to curate data. The trick is doing that at the lowest possible cost.

We realized that, by using machine learning (both unsupervised and semi-supervised) plus some other tricks, we could build a system that offers users a flexible means to improve data quality explicitly, while also allowing (and training) machine learning systems to improve data quality automatically, too. (We’ll post more technical details about this aspect of the system as the Spanner 1.0 launch date gets closer.)

Spanner lets regular users—who don’t know anything about any of this technology stuff—build complex, flexible, ontology-driven apps, all from very simple web pages…with no “technology leakage”.

There’s no such thing as a free lunch. Combining machine learning, training, and data curation into one system is the next best thing.

Next Steps

Spanner’s been under development for a year and is in production at some of our customers already. Everything I’ve talked about in this post is real.

Now we’re looking for a few more reference customers, i.e., early adopters who are willing to be guinea pigs as we finish up the last bits of polishing, etc. As the man says, if you need this stuff, you need it bad. We’re targeting late Q1, 2011 for Spanner general availability to early adopters, reference customers, etc.

Let’s talk.

Feedback: Comments


Comments


Comments by Disqus

Colophon

This is Thinking Clearly, a weblog by Clark & Parsia, LLC—read more about this site.

Follow us on Twitter RSS Feed