Long time listener, first time hacker

Last weekend I spent a fun few hours following Australia’s 2013 #GovHack on Twitter.  Like the name suggests, this event aimed to encourage “open government and open data” by inviting teams to “mashup, reuse, and remix government data” at meetings held across the country. Unsurprisingly, there were some wonderful results. The theme of open data and reuse resonated strongly at the Fieldschool this week as we practised finding, extracting and manipulating not just Government statistics but any open and available data. I had thought that the technical skills needed to remix data from the web were out of my reach because I wasn’t a programmer but happily the Fieldschool proved me totally wrong. How did this happen?

1. Finding data is surprisingly simple.

Many organisations give away data in formats that are easy to interpret

Many governments (including the US, UK and New Zealand) provide giant datasets for people to reuse. Meanwhile, an increasing number of museums, galleries and online repositories are opening their data doors too. Often, all it takes is roaming around a website to find the ‘download data’ option. On top of this, data is often provided in formats that people can easily understand: a CSV file is no more complex than spreadsheet. I would hazard a guess that simply knowing useable data exists and that it can be, often, easily understood dismantles the first significant barrier to reuse.

APIs are incredible

Learning about APIs felt like being given the keys to the castle because they allow you to reuse data on-the-fly. To my non-programmer mind, APIs took a while to understand because you can only really ‘get’ how they work on their own terms (culprit #2: JavaScript functions) and the process of requesting data dynamically is more complex than downloading it once-off. APIs come in many flavors too, so you aren’t assured the same request and response format every time. But this week we learnt the basic recipe and despite the increased complexity I would never hesitate to use an API: I at least feel confident that I can figure it out.

You may be able to scrape it

Scraping is the process of extracting unstructured data from an HTML document (i.e. webpage) and structuring it so that it can be manipulated for visualisation. Our technique was so straightforward that all we needed was a Google spreadsheet and tabular data from Wikipedia. I did learn that it is not a fail-proof technique: my spreadsheet went a little bit haywire when I tried to scrape this table later on. (Bonus points for anyone who can figure out why it didn’t work).

2. Cleaning data is surprisingly fulfilling.

While we learnt that cleaning data is crucial to successfully reusing it, opinions vary on how enjoyable this process is. Using a powerful tool like OpenRefine meant that I was surprised at how enjoyable it was. If you enjoy meticulous activities like jigsaw puzzles or knitting then take my word for it: cleaning data is genuinely absorbing.

3. Meanwhile: Data licensing is incredibly important

One important point we learnt this week (if not from the Fieldschool, then from the media) is that data is not neutral or free floating. When remixing, you have to be aware of use limitations placed by the person providing the data.  But, even then, licensing is not the impenetrable brick-wall that you think it might be. Navigating licensing can be as simple as familiarising yourself with the Creative Commons. A handy tip for the remainder of the Fieldschool is that visualisations are derivative copies.

3. Knowing what you want to do with data becomes wonderfully obvious

The last exciting discovery of this week is that I actually have ideas about what I’d like to make. I thought I’d have ‘hackers block’ about what to do with data, but I’m relieved to discover that’s definitely not the case. As soon as we learnt about various sources and techniques for extracting data, 1000 ideas appeared from nowhere. It obviously just took learning about what was possible for my mind to leap into action.

Essentially, while I can’t step out and immediately build the world’s best data-driven app, this week has proved that many of the barriers to remixing data I’d anticipated are roughly a day’s worth of (hard) concentrating away from being dismantled.

Leave a Reply