Thursday 31 October 2013

On services utilised by the WISERD DataPortal





The WISERD DataPortal uses a number of remote services to provide the various functions offered to the user. This post will describe the services currently in use, and those planned on being utilised in future.

PostgreSQL + PostGIS


Provided by PostgeSQL, this service is running alongside the web-server, but can be run on a separate VM, or even in a totally different location. There are actually 3 databases, users, survey data and qualitative data. Spatial data and functions provided by PostGIS are also incorporated.

GeoServer


Found here "GeoServer is a Java-based software server that allows users to view and edit geospatial data. Using open standards set forth by the Open Geospatial Consortium (OGC), GeoServer allows for great flexibility in map creation and data sharing".

Loaded with shapefiles of the "Heads of the Valleys", "Bangor Locality", "Aberystwyth Locality" and "Unitary Authorities", but also provides linkages to the PostGIS tools in the database to create new layers.

This service also receives "SLD" objects to produce map overlay layers. This "StyledLayerDescriptor" in the form of an XML document describing a layers property name, the ranges the properties values can fall into, and the colour each value range should be rendered as.

For example, as the main mapping option is to render the response rate of a Survey by region, the SLD describes the "successful" value for a survey, a set of ranges such as "0-100", "101-200", and an RGB colour definition such as "#FFFFFF".

EdinaUnlock


Edina offers a number of services, available though APIs, to provide tools for research and learning. Unlock is a "Service for georeferencing placenames and geographic data searching", from the website.

The DataPortal uses Unlock Places to produce an RDF document from a string of text. We use it for production of metadata for the Qualitative documents primarily to come up with place names mentioned as well as their GPS latitude and longitude. The RDF returned is decoded within the RDFPHP PHPAPI and is shown

Digimap OpenStream

The Digimap OpenStream service provides the DataPortal with its map tiles.

http://openstream.edina.ac.uk/registration/map

These tiles are made available from OS OpenData™ and exposed through another Edina API, saving the need to host the tiles ourselves which would be expensive in both computation and bandwidth.

AlchemyAPI


This is similar to the EdinaUnlock service, except run as a commercial product and provides a variety of open source SDKs to access the API more easily. The PHP SDK  creates PHP objects that are very easy to iterate through and produce results. (It's possible my opinion on this is twisted by my experience in trying to decode RDF in PHP.) The DataPortal isn't using this service "live" yet, it's just in experimental testing and implementation phases to see if any useful functionality can be gained.

Data Feed APIs


These are the APIs used to pull in actual data, linked to the WISERD DataPortal metadata. The aim is to allow users to find data and then download it either to the portal, or export it if necessary.

NomisWEB API

As I've described before, the Nomisweb API is searchable through a rest interface, and returns either XML (SDMX), JSON or browser-friendly HTML. I'm using the JSON format, as it decodes well for use within the Javascript based ExtJS GUI. This is the API I have the most functionality with, as it didn't involve figuring out RDF or SOAP.

NeighbourhoodStatistics

Basically used in the same way as the Nomis API, but with slightly different endpoint construction, and only returns XML (potentially RDF...). Discovery is fine, simple HTTP GET, but the data delivery is requested with a SOAP packet I haven't found time to figure out. The user guide is here.



ONS API

I have literally just discovered that an open access beta to this was released 2 days ago, so I haven't had a chance to look into how it works yet. That will be the content of an imminent blog posting!

Wednesday 30 October 2013

On database sanity, workarounds and cleanout

Insane in the mainframe, insane in the blame


The WISERD Database has grown in both number of records and number of tables during its development, and the individual tables have new columns and data types than in the original implementation.

This has inevitably led to a number of issues with the database becoming inconsistent in its usage and the way the data is presented.



This is a problem I've come up against time and time again as minor irritations, so this post is to describe some issues I've had to discover, debug and work around during development of new features.

Firstly, I'm aware that the easiest way to solve the problems below would be to crawl through the database and ensure that everything in there is formatted identically, references to other tables are correct, and all empty fields are properly populated. Unfortunately, this would be a truly mammoth task, and it would be difficult or impossible to ensure that everything was ever truly correct and homogeneous.

Secondly, quite often I didn't even realise a problem existed until I stumbled across an error while trying to achieve something completely different. Science is nothing without a large dose of serendipity.

In no particular order, I'll list of a few issues, the observation, the problem, the possible cause of the problem, and the solution I've come up with. Often there are also issues with the solution too.

1.

White-space in IDs.

It seems that the previous data entry tools took a survey questions ID to be the entire contents of the input dialogue box, including all the blank space after the typed characters. This means an ID in the database should be :

"qid_sceli86nsnv15"



but instead is recorded as :

"qid_sceli86nsnv15                                                                                                                                                                                                                                                                                           "



This is the id, followed by 283 spaces - invisible white-space characters. In turn, this means if any code takes this value and trims it, it'll no longer match the id in any database queries. Worse, these id's, white-space included, are used in linking tables which define which questions are in which surveys, with which responses, and associated Dublin Core data.

It was very tempting to run a script on the database which detects trailing white-space, trims it, and reinserts it into the database. I may still do this, but for now I'm leaving the database I've inherited alone, and am trying to combat this possible issue in the code using the DB. This means doing a search with the ID as seen in the "linking" table, and if that doesn't return anything, try again with the ID trimmed.

2/

Capitalisation of IDs

Similar to the above issue, the IDs in the linking table are not always identical to those in the Survey Questions or Dublin Core tables. Having spoken to a member of the data entry team, it became apparent that on occasion the IDs would be typed manually rather than being loaded from the DB, and so two different IDs exist which refer to the same record. I wrote a small error checking script into the code-base which attempted to guess why fewer numbers of surveys questions were returned than there should be, resulting in the following messages in my logs:



no question for *qid_sceli86kskq5*

......

neither does untrimmed qid *qid_sceli86kskq5                                  *

......

but lower-case qid matching trimmed *qid_sceli86kskq5* does.

Here the linking table originally told us that the question ID we're looking for is *qid_sceli86kskq5                                  *. Predicting the white-space issue, I trim the ID first, as this brings back the majority - but not all - records. If this doesn't work, we try again with the white-space attached. It's not common for this to return anything either, so then I try the lower() PostgreSQL method on the ID's. This normally finds the question we want. If it doesn't - well I haven't come up with a convincing way of guessing what the next issue could be.

There is an issue here, as future questions added to the database could result in a conflict of IDs. For example "ID1234" and "id1234" are both unique, but lower() would render them as identical, so there may be issues there.



In an Ideal World the questions IDs relate directly to their respective surveys, the "sceli86" part of the question ID above denotes it is part of the "Social Change and Economic Life Survey" from 1986, so hopefully question IDs wouldn't clash, but this is not insured against in the code.


3/


In some cases, values in the DB may be missing completely. Most of the time a null entry is rendered as "N/A" during data entry, but this can also be an empty string. It is unknown whether "N/A" refers to the data being unavailable, non-applicable, removed, deleted, skipped, or missing. This is not really an issue with the database or the code-base, but it is potentially something which could come up in future.


4/



The data entry software allows the user to add columns to a new table. 6 of the columns are created automatically, and populated as such, but there is a use case where the tables will also have columns added to refer to data which is only relevant in this specific case. This surprised me, and has led to some fairly tricky issues.




As it stands, it is now necessary to use the PostgreSQL query :

Select column_name as name from information_schema.columns where table_name =<tablename>;

to find the columns which have been added. Then it is possible to query the database for the data added to these previously unknown columns. This is not an ideal solution to the problem, but again, I considered it important to retain compatibility with the previous data entry tools. Future development will allow storage of this data in a different format within the database, allowing easier searching and retrieval, though this will inevitably result in the data being stored in 2 ways (the old way, and the new format).