A lot of government data is published as PDFs, which is machine-renderable, not machine-readable.
When it comes to Tim Berners-Lee’s 5 star deployment scheme for Open Data, PDF is at the bottom of the heap. And as we found out from last year’s World Bank study, hardly anybody reads PDFs!
Still, PDFs are a very useful, portable way of sharing information, so I’m not saying we should not use them.
To the contrary, I say we extend it! What if we leverage modern PDF features to embed machine-readable data when its referred to in a PDF? Tabular data is top of mind, but with the ISO PDF/A-3 standard, we can embed any file – JSON, XML, XLSX, etc!
Better yet, if the publisher doesn’t want filesizes to balloon, you can also optionally use a link to the associated data portal/webserver where the data is hosted.
So on this day of Tabula’s 1.0 release, I present our initial PDDF (Portable Document & Data Format) experiment.
Using Tabula to extract the CSVs, I attached the extracted CSVs back using Adobe Acrobat Pro and used Chris Whong’s Get the Data button to link to the CSVs which I also posted on the data portal as XLSX files.
If you want to extract the CSVs directly from the PDF viewer above, click on the 1) the sidebar button, then the 2) attachment button.
We can even give a visual hint to the user that they’re viewing a PDDF, by tweaking PDF.js – the cross-platform PDF-rendering technology embedded in Chrome, Firefox and CKAN.
What do you think? Is this a pragmatic way for publishers to have their nice, pretty PDFs, and allow us to extract machine-readable data from it too?