Wednesday 15 May 2013

On data feeds, and the NOMISWEB API

It is an aim of the current development phase of the WISERD DataPortal to be able to use not only the survey metadata collated and stored within the local database, but also external data from a variety of different sources.

The Office for National Statistics (ONS) is in the process of producing an API to allow access to survey data collected within the UK. This is due within the coming months, but in the meantime I have been looking into a few other sources of data which provide similar functionality as is to be expected from the ONS API.

In particular, the API provided by NOMIS is very extensive and easily usable. They are providing "Key statistics and Quick Statistics" here from the 2011 census data, which is proving to be the perfect dataset to develop and test new functionality to the DataPortal.

So far I have knocked together a simple web-based GUI which allows the user to search all the datasets provided by nomis by keyword. This retrieves metadata relating to the geographic areas the data results have been broken down into, and so it's possible to provide the ability to request data values for regions such as entire countries, or LSOAs, down to a postcode granularity. These results are requested in the JSON format (other formats are available) and can be very large in size, so further thought is needed to provide the data in a way which us most usable to the user.

The nomisweb API also offers the ability to download mapping data in KML format, which may be integrated into the DataPortal in the near future.

As WISERD has collected and collated their own metadata on the 2011 census survey, it is possible to connect the incoming requested data streams to our own metadata. The idea is this will provide a much richer view on the census questionnaires, in particular connecting the actual question asked to the result dataset

It is a peculiarity within these datasets, that the questions asked are not recorded along with the result data; or if they are recorded, it is done entirely within PDFs scanned from a printed census questionnaire. This is obviously a sub-optimal way or recording anything, as it is impossible to search or link data to such a document. Further thinking and research will be required to try to bridge this gap programmatically.

Ideally a method can be devised by which remote datasets can be linked to local metadata in a fairly automated way, to avoid lots of dull data entry. I'll post again in the future as I figure out a way to tackle these problems.

On the COMSC Cloud Infrastructure

Some facts and figures. 

  • 8 Physical servers
  • 172 Cores
  • 424 GB RAM
  • 20+ TB Storage
  • 10GBit Backbone networking
  • Multiple 1GB connections to the outside world.
  • Running OpenStack Cloud software
The VMs used for the development and early production phases of the DataPortal v2 were usually 2-4 cores, 4 GB ram and about 80 GB Storage.

This can be scaled up later if demand requires it, potentially with load balancers if the complexity is considered worthy necessary.

The webserver and database servers are both running Ubuntu 12.04, fresh instances of which can be spun up within the cloud, on demand, in around 15 seconds. As the major releases of the DataPortal are performed, "snapshots" of the current state of the VMs are taken in order to use as backups, or disaster recovery.

There is also some level of redundancy within the Cloud. Having two "nodes" each containing approximately half of the available cores means if one node goes down (which, as an ever changing, research led piece of equipment can lend an occasional challenge), there is a chance that the second node can be used to spin up previous snapshots, achieving a higher level of uptime.

On the DataPortal's Software Stack

The WISERD GeoPortal v1 (alpha) stack looks like this :

and the DataPortal v2 is very similar, with IIS7 replaced by Apache, and ASP.NET replaced by Yii (PHP). See my previous post on why this change happened. The Apache the GeoServer sits on is Tomcat, and the two servers are currently running on different VMs in the COMSC Cloud.

On choices of programming language

Having inherited the alpha version of the GeoPortal codebase, I was tasked with resurrecting the GeoPortal and producing a working version in a production environment. I'll discuss the software stack in a later post, but I wanted to cover the initial choices I had to make first of all.

The WISERD GeoPortal v1 (alpha) backend is coded primarily in the ASP .NET framework which, being a Microsoft development, idealy requires a Microsoft environment to run. This was fine previously, as the GeoPortal was running on a single stand-alone machine running Windows XP, sat in an office at Glamorgan University. Having a small background in the use of Cloud computing, I quickly decided that I would attempt to get the GeoPortal code running in a cloud-y type way.

Thankfully, I have been given the option to use the Cloud created recently within the Cardiff School of Computer Science and Informatics (which I'll refer to as the COMSC Cloud). Specific shout-out to Kieran Evans there for maintaining the oft tempremental beast, and giving a load of help when I was figuring all this out I'll post later with the specs of the COMSC Cloud later on too.

So having decided to use the Cloud, I had some decisions to make. Do I leave the code as-is, and run a Windows OS in the Cloud, or change the code to fit a Linux envornment? As I was new to this code, the ASP .NET language, and the IIS server, I could spend time learning how all this works, and then find I'd wasted that time due to an inability to actually host it all.

I'll mention now that I tried Mono, which has a Linux friendly implimentation of the ASP .NET MVC framework. This took a bit of time to get my head around, but was a useful expenditure of time, even though this wasn't the soution I eventually went with. Setting up the existing codebase to work with MonoDevelop, the Mono IDE, was tricky and eventually ineffective, as the support for .NET v4 was not easily available, and I didn't get the feeling that I would be completely satified with the end result. Basically I was forcing the code to fit a hole it didn't want to go into, so I dropped this line of research after a few days. None of this is to say it wouldn't have worked - I just wasn't happy doing it.

In the end, the decision was almost made for me. Using Windows in a Cloud environment is eventually going to involve licencing issues, and I really didn't want to start down that rabbit hole. The COMSC Cloud was going to be used only as a development environment, and so spending money on potentially multiple VMs for an unknown number of weeks/months was not ideal. The university also did not appear to have a Windows hosting environment available (at zero cost) to play with while I was getting to grips with the code. Perhaps it's my own inability to spend money on things, but I was sure it could be avoided early on.

So I turned to Yii. This is a PHP implementation of the MVC model which I'd seen used in previous projects I've worked on, and had dabbled in a little myself. I've used PHP quite a bit in recent months, primarily during my time on the i4Life project, working on the taxonomic Cross-Mapping tools. Essentially I would be far happier working with PHP, and I knew it could be easily hosted on a LAMP stack in the Cloud. If the ASP MVC code could be quickly rewritten into PHP, this would solve all the hosting problems, and I would be far faster at future developments due to my familiarity with the language.

So this is what I did. In the space of two weeks during March, I tore through the ASP code, and re-wrote it in PHP. Using PHPStorm from JetBrains - which recognises the Yii framework to some extent - I was able to go through every line of code and find a PHP replacement for it. In the end, I was producing REGEX strings to find-and-replace large blocks of code, which was a cool way to learn regular expressions. Every time I refreshed the GeoPortal page, an extra function worked using the LAMP server I was running in my development VM. So that was a good feeling, and pushed me to keep going.

So now we have the DataPortal v2. Bit of a name change, and a rewritten backend. Crawling through every line of code forced me to really understand what every line was doing, and was a chance to optimise some of the previous implementation. The code is now in GitHub, so is essentially open source. I'll get to that in a future post too.

To answer the question I didn't really ask in the title - Which programming language should I use for this? The answer appears to be :

Use the tool you know how to use

There are a million reasons why PHP is not an ideal language, generally, but porting the code dramatically increased the speed at which I was able to produce new functionality, and bug-fix previous issues.

That is, until I began to look at the JavaScript.