PDF stands for “PDF Data-Fication”

(counter-clockwise) Mayor Bill de Blasio with Dick Dadey, Executive Director of Citizens Union NY, Noel Hidalgo from BetaNYC, Rachael Fauss, director of Public Policy at Citizens Union, and yours truly

Last Thursday, I had the great honor of witnessing the passage of the City Record Online Law.  In my book, its as big a milestone as the widely-acclaimed PLUTO dataset release last year.

Because the City Record is the “logfile” of the City.  Its the City’s official daily newspaper and has information about public hearings, agency rule changes, procurement actions, contract awards and City employee salary changes going all the way back to 1873!

It’s a valuable resource that up to now, has only been known to City insiders, even after the City started publishing the PDFs in 2011.

Because as #opendata practitioners know, PDF is 1-star open data. Its “digital paper”, much the same way the first automobiles were called “horseless carriages” – a direct translation of what we had before computers.

Not to diminish the utility of PDFs.  I for one, was a big proponent when they were first introduced back in mid-90s.  Back then, when I first started my career in LIMS software development, a lot of our support problems were due to customers not being able to share their reports with co-workers and customers without having them first install a specific version of a specialized, proprietary report viewer.  I was responsible for implementing our report viewer, and I can still remember the various hair-pulling, DLL Hell customer episodes we had when critical Certificates of Analysis couldn’t be opened by our clients’ customers.

As PDF was truly a Portable Document Format.” It didn’t require the installation of anything else beyond the now universally available PDF reader.  It was no surprise that it was universally adopted shortly after its introduction.

But Nobody Reads PDFs

In a World Bank study released earlier this summer entitled “Which World Bank reports are widely read?“, World Bank researchers inspected their website traffic data and found that fully one-third of their 2008-2012 PDF reports have never been downloaded at all, another 40% were downloaded fewer than 100 times, and less than 13% were downloaded more than 250 times.



And that’s for a corpus of 1,611 reports spanning four years, which by the researchers’ own reckoning, accounts for about one-quarter of its country services budget, with a mean cost of $180,000 per report.

What more for the City Record, which truly reads like a logfile, and not thoroughly researched scholarly prose written by experts?  Within its drab, black and white, 8-point font, single-spaced, three-column, information-packed pages, the City Record details all major contracts and employee salary adjustments, which last fiscal year alone, amounted to $21.2 billion and $22.2 billion (total payroll) respectively!  That’s 55 percent of the City’s $78.2 billion dollar FY 2014 budget! (source: checkbooknyc.com)


If we had access to City Record machine-readable data, and cross-referenced it with the CheckBook NYC API, could we have detected the CityTime scandal in the making?  When contractors “treated the City like an ATM machine“, inflating the project’s original $63 million 1998 budget to over $600 million by the time they were exposed by Daily News columnist Juan Gonzalez in 2009?

And that’s just for contracts and salaries.  What about all the public hearing notices? Rule changes? And public property dispositions? And all the other notices mandated by the City Charter be published in the City Record?

Just imagine the possibilities. What can we do with this data once its liberated from PDFs?

Can we prototype an “Ombudsbot” – a Machine as Ombudsman, that can automatically flag suspicious activities?  Can we use technologies like Splunk on the liberated City Record data to gain operational intelligence – the Pulse of the City Administration? What about other municipal datasets trapped in information-packed PDFs like Environmental Impact Statements, City Agency Audit Reports and all the other “knowledge products” largely disseminated as PDFs?  Could this be a precedent for unlocking those data-sources as well?  Could we apply these techniques to other PDF-rich, data-poor, 1-star data troves – the World Bank’s PDFs, academic publications, annual reports, etc?

Enter the City Record Online Working Group (CROW)

And we just don’t need to ruminate on  the possibilities.  Because not only was the City Record Online Law passed, Mayor de Blasio also announced the creation of a public-private partnership centered around the release of more than 4,000 archival City Records dating back to 1998, with the City working with six partners – BetaNYC, Citizens Union, Dev Bootcamp, Socrata, Sunlight Foundation and Ontodia to convert the historical archive to machine readable information – taking it beyond just Optical Character Recognition (OCR) to Entity Recognition and Classification.

Its an honor to be part of CROW, and we thank the Mayor, Manhattan Borough President Gale Brewer, Council Member Brad Lander, Council Member Ben Kallos, and Council Member James Vacca for their leadership and this amazing opportunity.  We thank Citizens Union for their pioneering vision and we look forward to working with the Department of Citywide Administrative Services (DCAS), Department of Information Technology & Telecommunications (DoITT),  and the rest of the team.

The time has come for PDFs to be redefined.  PDF stands for “PDF Data-Fication.”

Join us at  http://bit.ly/git-crow.


This entry has 0 replies

Comments are closed.