In it, he wrote about how the oft-repeated mantra at CfA of “building interfaces to government that are simple, beautiful, and easy to use” is but one piece of the civic tech puzzle. He states:
“many of the problems government confronts with technology are fundamentally about data integration: taking the disparate data sets living in a variety of locations and formats and getting them into a place and shape where they’re actually usable.”
He complained that “the fact that I can go months hearing about ‘open data’ without a single mention of ETL is a problem. ETL is the pipes of your house: it’s how you open data.”
In a blog post citing Dave’s, Bob Lannon at the Sunlight Foundation, added that
“the up-front costs of ETL don’t come up very often in the open data and civic hacking community. At hackathons, in funding pitches, and in our definitions of success, we tend to focus on outputs (apps, APIs, visualizations) and treat the data preparation as a collateral task, unavoidable and necessary but not worth “getting into the weeds” about.”
These posts really resonated with us since it acknowledged the elephant in the room – the messy spaghetti plumbing behind all these “beautiful interfaces” are only getting messier, not better. Even all the sexy “big data” technologies that have come of late do not address the root cause of “dirty data” – that “publishers”, for the most part, do not have any incentives to publish “clean data”.
In general, ETL solutions are brute-force responses that now only make sense thanks to Moore’s Law. The cloud – with its near-infinite computing capacity and storage; schema-less databases; and mapreduce/aggregation frameworks – just allows us to throw more hardware at the problem at relatively low cost, and worse, create yet more silos.
That’s why when we heard about IBM’s Watson Challenge, we thought that perhaps there’s a third way beyond ETL as a permanent cost, and the nirvana of a clean data commons.
First off, it was an honor to be among the top 25 from hundreds of submissions to the challenge. IBM, in my book, was originally founded as a “civic tech” company. It had its roots in solving the challenge of tabulating the decennial census mandated by the US Constitution.
When the US Census Bureau took 7 years to manually tabulate the 1880 census, it knew it had a big problem with the 1890 census just three years out. Herman Hollerith‘s invention of punch cards not only allowed the Bureau to complete the 1890 Census in 2.5 years, it formed the basis for the formation of the Computing, Tabulating and Recording Company, later renamed IBM.
And thus – the first Computing Era was born – the Tabulating Era, lasting till the 1940s when the first mainframes were born (also largely built by IBM).
IBM also played a pivotal role in starting the Programmable Computing era with its personal computers which now subsumes and enables the modern world.
I can still remember watching with bated breath the Watson Jeopardy series. As a life-long computer geek, I knew I was witnessing a major breakthrough that will fundamentally shift how we solve technological problems, who would have thought that I’d be fortunate enough to be counted among the first to play with Watson!?!
With Watson, IBM is offering a new approach to attacking the problem of big, unstructured data. Instead of forcing humans to think like machines, Watson allows machines to think like us.
Because ETL is forcing humans to behave like machines. What to us humans is obvious – that a field named “SSN”, “SS ID”, “SS#” are one and the same thing – Social Security Number, to a machine are totally different things (and that’s not counting whitespace, reserved characters and case-sensitivity problems). What if machines can start to understand basic things like these equivalencies?
IBM calls this coming age of computing, the Cognitive Computing Era. With the first era – the Tabulating Era, tracing its roots back to Hollerith’s punch cards, and the Programmable Computing Era as the stage we’re currently in.
An Answering People Interface?
In the limited time we had with Watson, it was impressive how we managed to to create a working prototype of our long-dreamed API – our Answering People Interface.
Application Programming Interfaces – the traditional meaning of API, is how machines talk to each other (or how developers program their machines to talk to other machines, by thinking like a machine). With our NYCpedia API Concierge prototype, we wanted to create a way for users to interact with the vast trove of data we’ve organized about New York in the best interface – natural language.
With NYCpedia, we’ve compiled more than a million entities, with each entity having its own unique ID (or URI), and more than 14 million facts about those entities (an entity being a building, a neighborhood, a subway station, a park, etc.). When we organized and curated (“ETLed”) all that data geospatially, I can say that we’ve built an Open Data interface that is “simple, beautiful and easy to use.”
But beyond clicking around the vast trove of data in topics and discovery pathways that we’ve curated beforehand (jobs, transportation, community stats, crime, etc.), there was no way for our users to ask what any first-time user would quickly ask once you get beyond looking at the stats of where you are.
And the thing with “traditional simple interfaces” – there is no quick way for users to find the answer beyond what the developer built. Not only do users not have the time, they don’t have the inclination to poke around, never mind fill out a feedback form (see attention economy). We tried building in faceted search and even toyed with geospatial search, but it wasn’t really addressing the problem of discoverability.
What we needed is a simple search box that users can ask the question the way they would ask the next person – “What’s the crime rate here?“, “Where’s the nearest subway stop?“, “Who owns this building?”
Watson enabled us to create that interface.
Next-gen ETL: From Raw Open Data to Open Knowledge with IBM Watson
Watson is much more than natural-language parsing. For it to give sensible answers to the questions we posed, it needed to understand the data we fed it. And unlike ETL where you have to map each field to another matching field explicitly, teaching Watson about our domain was a totally different experience.
It approximated how you would teach another person about a new subject – “Here you go, here are some books, websites, and PDFs. Read them all, and then, let’s talk about it next week.”
The difference is that with Watson, it parsed thousands of books, websites, PDFs, we then had a Q&A training session a few hours later where it answered our questions, showed its confidence in its answers, and revealed how it arrived at those answers.
And since we did some of the hard work of first-level ETL – every entity in our corpus we gave Watson had a URI and we linked some of these entities together in a primitive graph, it was amazing how the combination of hardware and “wetware” (our brains) allowed us to pose the questions that our “simple user interface” would not have been able to answer without a lot of clicking around. Watson’s sheer computing hardware allowed it to plow through thousands of validated hypotheses, and the last mile connection was made by our brains to instantly know when Watson answered a question correctly, and to see why it failed when it responds with a non-sequitur.
And since all our entities had context, it allowed for more productive discovery – Watson would serve up the information you asked for, and you can then explore the underlying information in a form of directed serendipity. In the “who owns this building?” example for instance, you can quickly find out the property tax, when the building was built, and what the crime rate is for that zipcode.
And the more you “converse” with it, the smarter and smarter it gets, getting “rainman-like” in a relatively short amount of time.
We really enjoyed playing with Watson and we believe that Cognitive Computing may just be the missing ingredient in converting raw Open Data to Open Knowledge beyond the traditional search interfaces people have built in the Open Data industry thus far.
Even though we weren’t among the top three Watson Challenge winners (congrats all BTW!), we know Watson’s in our future.
Much the same way we repurposed API to mean Answering People Interface, we can refocus our ETL efforts on “Entity Transformation and Linking.” and let Watson handle the hard work of question-answering from there