Tuesday 25 June 2013

On the Naming, Identifying and Linking of Data


I've spent the past few days working further with the NOMISWEB API, and getting myself acquainted with the Neighbourhood API. Both provide a way to access the subsets of the 2011 Census data which have been released so far.

Help files here and here, both invaluable in doing anything with these services end points!

The idea driving development is to take the metadata created by WISERD and stored within our own database and link it up to the datasets stored behind these remote APIs. As such, "data feeds" can be pulled into the DataPortal on demand, as a user searches according to their needs.

This required multiple steps. Firstly, writing some wrapper to each API to discover what datasets are out there, in some way which can hopefully be used in future to develop access to APIs not currently under investigation. For example, in future we plan to link the DataPortal up to the ONS API currently under development. I hope this new API is similar to either of the previous two so accessing it fits the current PHP class interface when I come to write it.

I'll probably create an end-point which wraps access to these other API's in future, allowing a one-hit search, refine and download of datasets, across multiple external APIs, from the DataPortals API.

It turned out to be relatively easy to perform keyword searches on the NOMIS and Neighbourhood APIs. The URLs accept easily readable GET variables to create either:

https://www.nomisweb.co.uk/api/v01/dataset/def.sdmx.xml?search=*van*

http://www.neighbourhood.statistics.gov.uk/NDE2/Disco/FindDatasets?Metadata=van

My current favourite test-case question is the "2011 Census: Car or Van Availability, 2011".  Purely because "van" is quick to type, and doesn't appear as part of many other words. Searching for "car" brought back datasets including "care", and slowed down my debugging. Also, WISERD has metadata for this census question, making it an ideal start for "question matching".

Without going into too much description of XML or JSON, the API requests above return a ton of metadata produced by each institutions own APIs.

A snippet of the NOMIS reply:

...
<KeyFamilies>
<structure:KeyFamily id="NM_548_1" agencyID="NOMIS" version="1.0" uri="Nm-548d1">
<structure:Name xml:lang="en">QS416EW - Car or van availability</structure:Name>
<Annotations>...

And a snippet of the neighbourhood statistics reply:

<MatchingDSFamilies>
<DSFamily>
<DSFamilyId>2511</DSFamilyId>
<Name>Car or Van Availability, 2011 (QS416EW)</Name>
</DSFamily>
It took me a while to get my head around all this. Putting them side by side is clearer now, but at the time I was working only with NOMIS, so the conclusion was less obvious.

Basically, I needed an identifier provided by each API with which to link together all our datasets. Then when a DataPortal user searches for question metadata, we have a reference with which to say "yes, there is other data/ metadata out there, here it is for you!".

So it turns out each API has its own "unique" identifier for each question in their databases.

For clarity, searching for "van" in each API gives:

WISERD DataPortal : qid_c11hqh12-S2
NOMISWEB API : NM_548_1 and Nm-548d
Neighbourhood : 2511
I have issues with all of these!

The internals of the WISERD DB gives some explanation as to how that id is formed. Question ID, census 2011, some key letters, question numbers and partial breakdown of each major question into sub parts.

NOMIS presumably has NM, a dataset id and some sub part. The response above contains an ID and a URI - it wasn't initially obvious which to use here.

Neighourhood was the least useful, I can only guess this is the 2511th question they have on file.

Either way, I made a discovery which was for me stunning. While I was looking for a tagged ID to actually refer *globally* to a questionnaire question, I realised that there in fact was an ID, but no-one was using it....
 QS416EW
A rant follows!

What this string of letters and numbers means isn't important, I'm never going to write a program which guesses unique identifiers. The fact that this string exists inside the "name" STRING in each API response, but doesn't have its own distinct tag in any of the responses amazed me. Here is a way to identify something, which I can only assume existed from the start, and no-one used it, instead preferring to create their own (sometimes multiple) unique identifiers within their own systems.

Now, I understand that within any dataset, you can't define globally which identifiers should be used. However it would still make sense to recognise its existence for future reference.

End of rant.

So the solution becomes obvious. Match within the WISERD DataPortal database the WISERD_ID, the external resource ID and, if available, the identifier provided by the creator of the survey. And make it searchable. In future this will make defining datasets a lot easier for everyone, and at the very least information isn't lost along the way.

One of the next blog posts I write will be an investigation into why unique identifiers can have problems, and the issues with data integrity/ reliability within a database.

<spoiler alert>

caSe seNsitIvitY.


No comments:

Post a Comment