Tuesday 25 June 2013

On the Naming, Identifying and Linking of Data


I've spent the past few days working further with the NOMISWEB API, and getting myself acquainted with the Neighbourhood API. Both provide a way to access the subsets of the 2011 Census data which have been released so far.

Help files here and here, both invaluable in doing anything with these services end points!

The idea driving development is to take the metadata created by WISERD and stored within our own database and link it up to the datasets stored behind these remote APIs. As such, "data feeds" can be pulled into the DataPortal on demand, as a user searches according to their needs.

This required multiple steps. Firstly, writing some wrapper to each API to discover what datasets are out there, in some way which can hopefully be used in future to develop access to APIs not currently under investigation. For example, in future we plan to link the DataPortal up to the ONS API currently under development. I hope this new API is similar to either of the previous two so accessing it fits the current PHP class interface when I come to write it.

I'll probably create an end-point which wraps access to these other API's in future, allowing a one-hit search, refine and download of datasets, across multiple external APIs, from the DataPortals API.

It turned out to be relatively easy to perform keyword searches on the NOMIS and Neighbourhood APIs. The URLs accept easily readable GET variables to create either:

https://www.nomisweb.co.uk/api/v01/dataset/def.sdmx.xml?search=*van*

http://www.neighbourhood.statistics.gov.uk/NDE2/Disco/FindDatasets?Metadata=van

My current favourite test-case question is the "2011 Census: Car or Van Availability, 2011".  Purely because "van" is quick to type, and doesn't appear as part of many other words. Searching for "car" brought back datasets including "care", and slowed down my debugging. Also, WISERD has metadata for this census question, making it an ideal start for "question matching".

Without going into too much description of XML or JSON, the API requests above return a ton of metadata produced by each institutions own APIs.

A snippet of the NOMIS reply:

...
<KeyFamilies>
<structure:KeyFamily id="NM_548_1" agencyID="NOMIS" version="1.0" uri="Nm-548d1">
<structure:Name xml:lang="en">QS416EW - Car or van availability</structure:Name>
<Annotations>...

And a snippet of the neighbourhood statistics reply:

<MatchingDSFamilies>
<DSFamily>
<DSFamilyId>2511</DSFamilyId>
<Name>Car or Van Availability, 2011 (QS416EW)</Name>
</DSFamily>
It took me a while to get my head around all this. Putting them side by side is clearer now, but at the time I was working only with NOMIS, so the conclusion was less obvious.

Basically, I needed an identifier provided by each API with which to link together all our datasets. Then when a DataPortal user searches for question metadata, we have a reference with which to say "yes, there is other data/ metadata out there, here it is for you!".

So it turns out each API has its own "unique" identifier for each question in their databases.

For clarity, searching for "van" in each API gives:

WISERD DataPortal : qid_c11hqh12-S2
NOMISWEB API : NM_548_1 and Nm-548d
Neighbourhood : 2511
I have issues with all of these!

The internals of the WISERD DB gives some explanation as to how that id is formed. Question ID, census 2011, some key letters, question numbers and partial breakdown of each major question into sub parts.

NOMIS presumably has NM, a dataset id and some sub part. The response above contains an ID and a URI - it wasn't initially obvious which to use here.

Neighourhood was the least useful, I can only guess this is the 2511th question they have on file.

Either way, I made a discovery which was for me stunning. While I was looking for a tagged ID to actually refer *globally* to a questionnaire question, I realised that there in fact was an ID, but no-one was using it....
 QS416EW
A rant follows!

What this string of letters and numbers means isn't important, I'm never going to write a program which guesses unique identifiers. The fact that this string exists inside the "name" STRING in each API response, but doesn't have its own distinct tag in any of the responses amazed me. Here is a way to identify something, which I can only assume existed from the start, and no-one used it, instead preferring to create their own (sometimes multiple) unique identifiers within their own systems.

Now, I understand that within any dataset, you can't define globally which identifiers should be used. However it would still make sense to recognise its existence for future reference.

End of rant.

So the solution becomes obvious. Match within the WISERD DataPortal database the WISERD_ID, the external resource ID and, if available, the identifier provided by the creator of the survey. And make it searchable. In future this will make defining datasets a lot easier for everyone, and at the very least information isn't lost along the way.

One of the next blog posts I write will be an investigation into why unique identifiers can have problems, and the issues with data integrity/ reliability within a database.

<spoiler alert>

caSe seNsitIvitY.


Tuesday 4 June 2013

On Usage of Development and Production Services

As the DataPortal is now at v2, it was necessary to move it to an environment more suitable for a production service.

The COMSC Cloud installation is being retained as a development and testing environment, to push incremental code changes to between major releases. It may be possible to open this development service to other users if demand for early access becomes significant. This is almost intentionally unstable however, as once features become stable and functional, they will be pushed to the Production server. Hopefully this 3 stage development (the first stage being a build on several virtual machines on my local machine) will mean development can continue alongside user testing of released features, without one adversely affecting the other.

To this end, we set up a new cloud instance on Amazon's AWS service. As the previous development installation had been on Ubuntu in the COMSC Cloud, it was relatively trivial to set up an Ubuntu instance with EC2 and install everything again.

I took this one step further and kept records of every terminal command typed to setup the DataPortal, and created a series of scripts which automate the installation in future. This is generally good practice, and I've heard often before amongst developers that if anything you're doing is more than trivial, or requires Googling to find the correct command line incantation, save it as a script for future usage.

So now I have a script which installs the DataPortal with almost zero interaction, which is nice. It's still time consuming, and it turns out that EC2 charges for disk I/O, but it's nice to know that future builds won't be laborious and repetitive, leaving time to consider new features and implementations. It's all about efficiency.

On Releasing Software Products, and Managing User Expectations

We've recently taken the decision to "release" the WISERD DataPortal. This isn't to say that all the functionality we're after is 100% implemented, but that which is there is working in a fairly stable manner. There are bugs, but all software has bugs - I'll write a further post about how bugs can be defined and dealt with later on.

Releasing software, especially the first roll-out of a new service, is stressful. I'm having to force myself to say "This works, let people try and use it", when my gut feeling is more "But what if I've overlooked something major, and end up looking stupid?".

As far as I can tell from the releases of all software, - and even hardware - from small end "Indie" games, to huge operating system scale products, bugs are always going to be discovered and it's up to the developers to manage these issues as they arise.

At this point, different companies and institutions have different ways of managing their users expectations. I'll take 3 examples from common household names and describe their reactions to users issues with their products. On the "low" end, there's Mojangs Minecraft, considerably larger is Canonical's Ubuntu, and on the very high end there's Microsoft's Windows 8 operating system. For a bonus example in the hardware sector I'll describe Apples iPhone 4. I hope my reasoning for choosing these becomes clearer as I go.

Indie games are generally smaller products, produced by small teams, with low budgets and are targeted at a fairly niche audience. These are not summer blockbusters, but rather more experimental pieces which can afford to drift slightly from the accepted expectations of what certain genres normally offer. Due to the need to get a revenue stream going as quickly as possible, alpha, beta and RC (release candidate) versions are pushed out to the public for testing very frequently, up to multiple releases per week. The users expectations are managed by the developers by a continuous discourse through social media, where it is admitted openly that the product is not finished, major features are likely to be broken or missing entirely, or that each version may be incompatible with previous releases.

This is not a lazy or incomplete way of working, as both the developer and the users gain from this relationship. Developers, as long as they listen to the user feedback, essentially have a free testing panel, and new features can be added dynamically with the users requests in mind. Obviously managerial oversight and vision is required, but the overall development process is very inclusive of the users wishes. It is also a more personal approach, with users often getting to know the developers by name, and personalities can show through.

In contrast, the development and release of massive software packages such as those created by Microsoft, Apple or Google are not able to be as collaborative with the user base. It is easy to say there is zero relationship with the users until the release of the final version, though this is not entirely true.

Microsoft managed their user expectations very early on, by releasing initial press releases describing Windows 8 very early on, with screen shots, slideshows and videos of actors looking very happy to maximise application windows in new and exciting ways. A beta version of Windows 8 was released for those interested in the "Developer Preview", which was downloaded over half a million times in the first 12 hours of it's release. This was a totally different type of user interaction, essentially talking to developers of future software designed to run on Windows 8. Regular users would have little input on features within this development, their feedback would mainly be in the form of bug reports or forum posts to customer services representatives. While there is a development blog, the general vibe is still one way, information flowing from the developers outwards. As a market leader, new features are designed to show users what they can have, rather than creating what users want.

Bridging the gap between the fairly small and extremely large, Canonical develops the Ubuntu operating system in a way that utilises user input in a very collaborative way, (users code can end up in future releases) while retaining overall quality control and direction. Both forums, blogs and press releases are used, of specific interest here is the Brainstorm feature request site, where users can create and vote upon ideas within the community. This is a great way to crowd source the development process and target efforts at those most in demand.
 
To finish up, I'll mention the PR fiasco that occurred around the iPhone 4 when users complained about reception issues. It turned out that holding the phone a certain way caused a short circuit across the metal band around the phone which acts as the aerial. User feedback was met with a very strange reply which confused and angered users, eventually ending in a class-action lawsuit and a free phone cover for all iPhone owners. The message here is not that lawsuits are a convenient way to improve product design, but rather that interaction with users should always be collaborative and productive to both sides.

So the message here, which I've taken a long route to get to, is that communication with users improves products and user experience, as well as providing valuable insight into the needs of the user when planning future features and improvements. For smaller audiences, social media and a more personal interaction can be invaluable in building a user-base that can actually make the developers life easier, rather than more stressful.