WISERD DataPortal Development: 2013

Thursday 31 October 2013

On services utilised by the WISERD DataPortal

The WISERD DataPortal uses a number of remote services to provide the various functions offered to the user. This post will describe the services currently in use, and those planned on being utilised in future.

PostgreSQL + PostGIS

Provided by PostgeSQL, this service is running alongside the web-server, but can be run on a separate VM, or even in a totally different location. There are actually 3 databases, users, survey data and qualitative data. Spatial data and functions provided by PostGIS are also incorporated.

GeoServer

Found here "GeoServer is a Java-based software server that allows users to view and edit geospatial data. Using open standards set forth by the Open Geospatial Consortium (OGC), GeoServer allows for great flexibility in map creation and data sharing".

Loaded with shapefiles of the "Heads of the Valleys", "Bangor Locality", "Aberystwyth Locality" and "Unitary Authorities", but also provides linkages to the PostGIS tools in the database to create new layers.

This service also receives "SLD" objects to produce map overlay layers. This "StyledLayerDescriptor" in the form of an XML document describing a layers property name, the ranges the properties values can fall into, and the colour each value range should be rendered as.

For example, as the main mapping option is to render the response rate of a Survey by region, the SLD describes the "successful" value for a survey, a set of ranges such as "0-100", "101-200", and an RGB colour definition such as "#FFFFFF".

EdinaUnlock

Edina offers a number of services, available though APIs, to provide tools for research and learning. Unlock is a "Service for georeferencing placenames and geographic data searching", from the website.

The DataPortal uses Unlock Places to produce an RDF document from a string of text. We use it for production of metadata for the Qualitative documents primarily to come up with place names mentioned as well as their GPS latitude and longitude. The RDF returned is decoded within the RDFPHP PHPAPI and is shown

Digimap OpenStream

The Digimap OpenStream service provides the DataPortal with its map tiles.

http://openstream.edina.ac.uk/registration/map

These tiles are made available from OS OpenData™ and exposed through another Edina API, saving the need to host the tiles ourselves which would be expensive in both computation and bandwidth.

AlchemyAPI

This is similar to the EdinaUnlock service, except run as a commercial product and provides a variety of open source SDKs to access the API more easily. The PHP SDK creates PHP objects that are very easy to iterate through and produce results. (It's possible my opinion on this is twisted by my experience in trying to decode RDF in PHP.) The DataPortal isn't using this service "live" yet, it's just in experimental testing and implementation phases to see if any useful functionality can be gained.

Data Feed APIs

These are the APIs used to pull in actual data, linked to the WISERD DataPortal metadata. The aim is to allow users to find data and then download it either to the portal, or export it if necessary.

NomisWEB API

As I've described before, the Nomisweb API is searchable through a rest interface, and returns either XML (SDMX), JSON or browser-friendly HTML. I'm using the JSON format, as it decodes well for use within the Javascript based ExtJS GUI. This is the API I have the most functionality with, as it didn't involve figuring out RDF or SOAP.

NeighbourhoodStatistics

Basically used in the same way as the Nomis API, but with slightly different endpoint construction, and only returns XML (potentially RDF...). Discovery is fine, simple HTTP GET, but the data delivery is requested with a SOAP packet I haven't found time to figure out. The user guide is here.

ONS API

I have literally just discovered that an open access beta to this was released 2 days ago, so I haven't had a chance to look into how it works yet. That will be the content of an imminent blog posting!

Wednesday 30 October 2013

On database sanity, workarounds and cleanout

Insane in the mainframe, insane in the blame

The WISERD Database has grown in both number of records and number of tables during its development, and the individual tables have new columns and data types than in the original implementation.

This has inevitably led to a number of issues with the database becoming inconsistent in its usage and the way the data is presented.

This is a problem I've come up against time and time again as minor irritations, so this post is to describe some issues I've had to discover, debug and work around during development of new features.

Firstly, I'm aware that the easiest way to solve the problems below would be to crawl through the database and ensure that everything in there is formatted identically, references to other tables are correct, and all empty fields are properly populated. Unfortunately, this would be a truly mammoth task, and it would be difficult or impossible to ensure that everything was ever truly correct and homogeneous.

Secondly, quite often I didn't even realise a problem existed until I stumbled across an error while trying to achieve something completely different. Science is nothing without a large dose of serendipity.

In no particular order, I'll list of a few issues, the observation, the problem, the possible cause of the problem, and the solution I've come up with. Often there are also issues with the solution too.

1.

White-space in IDs.

It seems that the previous data entry tools took a survey questions ID to be the entire contents of the input dialogue box, including all the blank space after the typed characters. This means an ID in the database should be :

"qid_sceli86nsnv15"

but instead is recorded as :

"qid_sceli86nsnv15                                                                                                                                                                                                                                                                                           "

This is the id, followed by 283 spaces - invisible white-space characters. In turn, this means if any code takes this value and trims it, it'll no longer match the id in any database queries. Worse, these id's, white-space included, are used in linking tables which define which questions are in which surveys, with which responses, and associated Dublin Core data.

It was very tempting to run a script on the database which detects trailing white-space, trims it, and reinserts it into the database. I may still do this, but for now I'm leaving the database I've inherited alone, and am trying to combat this possible issue in the code using the DB. This means doing a search with the ID as seen in the "linking" table, and if that doesn't return anything, try again with the ID trimmed.

2/

Capitalisation of IDs

Similar to the above issue, the IDs in the linking table are not always identical to those in the Survey Questions or Dublin Core tables. Having spoken to a member of the data entry team, it became apparent that on occasion the IDs would be typed manually rather than being loaded from the DB, and so two different IDs exist which refer to the same record. I wrote a small error checking script into the code-base which attempted to guess why fewer numbers of surveys questions were returned than there should be, resulting in the following messages in my logs:

no question for *qid_sceli86kskq5*

......

neither does untrimmed qid *qid_sceli86kskq5                                  *

......

but lower-case qid matching trimmed *qid_sceli86kskq5* does.

Here the linking table originally told us that the question ID we're looking for is *qid_sceli86kskq5                                  *. Predicting the white-space issue, I trim the ID first, as this brings back the majority - but not all - records. If this doesn't work, we try again with the white-space attached. It's not common for this to return anything either, so then I try the lower() PostgreSQL method on the ID's. This normally finds the question we want. If it doesn't - well I haven't come up with a convincing way of guessing what the next issue could be.

There is an issue here, as future questions added to the database could result in a conflict of IDs. For example "ID1234" and "id1234" are both unique, but lower() would render them as identical, so there may be issues there.

In an Ideal World the questions IDs relate directly to their respective surveys, the "sceli86" part of the question ID above denotes it is part of the "Social Change and Economic Life Survey" from 1986, so hopefully question IDs wouldn't clash, but this is not insured against in the code.

3/

In some cases, values in the DB may be missing completely. Most of the time a null entry is rendered as "N/A" during data entry, but this can also be an empty string. It is unknown whether "N/A" refers to the data being unavailable, non-applicable, removed, deleted, skipped, or missing. This is not really an issue with the database or the code-base, but it is potentially something which could come up in future.

4/

The data entry software allows the user to add columns to a new table. 6 of the columns are created automatically, and populated as such, but there is a use case where the tables will also have columns added to refer to data which is only relevant in this specific case. This surprised me, and has led to some fairly tricky issues.

As it stands, it is now necessary to use the PostgreSQL query :

Select column_name as name from information_schema.columns where table_name =<tablename>;

to find the columns which have been added. Then it is possible to query the database for the data added to these previously unknown columns. This is not an ideal solution to the problem, but again, I considered it important to retain compatibility with the previous data entry tools. Future development will allow storage of this data in a different format within the database, allowing easier searching and retrieval, though this will inevitably result in the data being stored in 2 ways (the old way, and the new format).

Saturday 24 August 2013

On Users, Projects, Visibility and Permissions

This week, I've been thinking a lot about the way the DataPortal will be used in the future - specifically the usage by future project users.

Currently every survey stored by WISERD, or described within the WISERD metadata and stored elsewhere (such as the UK Data Archive), has it's metadata globally searchable and viewable by anyone who has registered and logged in.

This is because there is currently no concept within the DataPortal of any users being different from any other users, so all users have the same viewing permissions and download abilities.

While this is fine for any information which has been cleared as "Open" - and obviously this is the ideal situation - it is easy to conceive of situations where information could be stored within the DataPortal which should not be visible to the entire world.

I've been thinking of 3 main use-cases to try to provide a solution for.

Data Entry team members may add part of a surveys metadata, but don't want the survey to be "live" until all the data has been entered

The WISERD hub team want to moderate the data being entered into the DataPortal by external partners, and can control what text is shown "live" until it has been sanity checked.

A Surveys entered during the lifetime of a project is not due to be made public until the end of the project, but can be viewed by project members during.

To implement this, 5 new Database tables have been added, to define 3 different "concepts".

The rest of this post uses Capitalised Words to denote the concepts which now exist. Their definitions follow, as they are currently envisioned, or where implementation has begun:

There is now the idea that a user has a "Role", Users and Surveys belong to a "Project" a survey has a "Visibility".

The tables map as follows:

Project ID - Project Name

Standard, every project has an id.

User ID - Project ID

User ID taken from the regular user details table, any number of users can be matched with any number of projects.

Survey ID - Project ID

Each Survey is owned by one Project. The survey can be transferred between projects if and when visibility ownership or responsibilities shift between projects.

Visibility ID - Visibility State Name

These visibilities denote what type of users can view a survey. So far there is Global (anyone can see), Project (visible by users listed as being in the Project which owns that Survey), and it is anticipated that a "UserOnly" visibility will be added in future.

Survey ID - Visibility ID
Assigning a Survey a Visibility. A Survey has one-and-only-one visibility, so ideally there should be as many entries in this table as there are entries in the Survey table. Some defaults may be assumed if there is no visibility set: if the Survey has a Project owning it, assign it as project only visibility until set otherwise; if the Survey has no Project set, set it as global ownership, and assume it is owned by the "wiserd_global" project.

The "wiserd_global" project is a catch-all Project for unclaimed Surveys, and has all its Survey's as globally visible, on the assumption that they were added prior to this Survey-owning paradigm being set in place.

I have also coded up a GUI to go along with the data entry tools, which allows a logged in user to assign the entered Survey to a Project of which that User is a member. Users with Roles which have certain Privileges can create Surveys in the Projects they have access to, and higher level Roles allow assigning Users to such Projects. These hierarchies will be drawn out in further detail after communication with project heads and the WISERD Hub data entry team to ensure that the right number of Roles will exist, with the right permissions and functions available to them.

Obviously the aim here is for people to continue doing their job as they have done before, without adding undue layers of complexity. Hopefully, once a user has been assigned the appropriate Role and has been added to the relevant Projects, they should notice no difference to their everyday tasks. Additionally, changing a Users list of accessible Projects or Roles should not cause any adverse effects to a User, other than changing their ability to interact with a given Survey.

Roles listed here are currently not implemented, but will be designed with the following ideas in mind.They will be defined better once I have the Data Entry Tools properly adding new Surveys to completion. The Roles will be something along the lines of :

User
Any standard logged in user; can search only Global Visibility Surveys

Project User
A User assigned as a member of a Project, can view all Global Surveys, as well as View and Create abilities on Surveys within Projects they are members of.

WISERD Hub Admin
Able to View, Create and change Visibilities of Surveys and Create Projects. Project Users may tag Surveys for Global Visibility, but the Survey will first be moderated by a Hub Admin before the Global Visibility can be assigned.

SuperAdmin
Can Create, View, Modify, Moderate, change Visibility of Surveys, elevate Users to different Roles, create new Projects and assign Users to those Projects. Any and all other tasks can be completed by a SuperAdmin, and it is assumed that only a few users will be required with this Role. It is also likely that the Role will involve direct access to the database, in situations where a GUI does not exist to provide the required functionality.

Obviously any often repeated tasks by any User with any Role should be replaced by an automated task, or a GUI which aids this tasks completion.

So far, the implementation in no way affects the functionality of the DataPortal, and the new database tables and GUI additions sit on top of the existing service. In this way, backward compatibility should be assured, with the defaults labelled above as the clarification required where new functions require additional information not provided by the previous database.

I'll post again soon on how the Roles are defined, specifically how Yii's inbuilt RBAC (Role Based Access Control) abilities are used to provide permissions and access control to specific users.

Friday 9 August 2013

On Data Entry Tools and Services

This week, having returned from my Summer break, I've been looking at the data entry tools originally used to add Survey and Qualitative Metadata to the DataPortal.

These originally (GeoPortal v. alpha) took the form of a VB.NET based web front-end for adding Qualitative metadata and a stand-alone VB application "wizard" tool for adding Survey Metadata.

As hinted at in a previous post here, I'm not a fan of VB, and as the main DataPortal code has now been moved over to more cross-platform friendly languages, I decided to convert the code from these tools too.

I have started rewriting the Survey Metadata entry tool in Java, although I am yet to decide if the eventual tool will be in this form. I'll come back to this later.

The Qualitative tool GUI was relatively trivial to move over the Javascript (ExtJS). The metadata stored is a list of most commonly used words, output from Edina Unlock, and output from OpenCalais.

Most common words are easily found when iterating through the text in PHP, we store words used more than a set threshold averaged over the number of pages.

The OpenCalais output is retrieved from the Enlighten webservice, a simple HTTP POST with an API key, the text to scan, and a small piece of XML to define search parameters. The result returns RDF, in theory...

As it turns out, the RDF returned from OpenCalais is embedded in a "String" tag, inside some XML. This means that a standard XML parser will fail, as it finds the RDF and bugs out. An RDF parser will fail due to the funky XML in the way. After several hours of frustration, I googled the correct magic words to return this API which, bizarrely, modifies the default PHP xml_parser with some new methods to clear out the junk wrapper, returns the RDF proper, and hands it off to "RAP - RDF API for PHP" (website, SourceForge), to create a PHP model of the document. This can then be iterated through to find the relevant tags we're interested in. The RAP API is actually really well designed, and clearly can do a lot more than what I require here, and it's one to keep in mind for later. I'm sure future RDF style metadata objects will be created as we go along with this project.

Edina Unlock is the next challenge. It seems that at sometime in the last 2 years, (since GeoPortal alpha was first created), the service has changed it's API, and no longer works. This will be one of the next challenges going forward, but the documentation seems sound enough.

There are two API's, "text" and "places". At first glance, it appears that "text" jobs are now submitted, remotely executed, the user polls for progress, then retrieves results once they are produced. This is wrapped in a user auth model to poll for a specific users job. It's a bit more technical for implementation, and will involve our data entry team to watch a progress bar till the results are ready, but I guess it makes sense on their end if they're receiving a ton of requests. I'm not sure yet if this service is required, as we have Edina Unlock working already, we'll see.

The "places" API seems far easier to use, with simple jsonp or HTTP GET's returning JSON formatted data.

I'll update this post with my experiences of this once I've figured it out.

As always, the DataPortal code is on GitHub, so you can check out my solutions should you desire more in depth demonstrations of how I've managed to get it all working.

Wednesday 17 July 2013

On the DataPortal Users Knowledge Exchange Event

On Friday, June 12th 2013, WISERD hosted a Knowledge Exchange event demoing the WISERD DataPortal.

This was a chance to interact with a variety of members from academic and professional communities. We had a great turnout, with lots of positive comments and conversations.

The day comprised of a summary presentation of the current state of metadata with regards to statistical data collection and the aims of the project, followed by a demonstration of the primary functionality of the DataPortal, and ended with an opportunity for the attendees to try out the DataPortal in a computing lab while providing feedback.

Overall, personally I think the day was pretty successful, and I'm very thankful for how enthusiastic and engaged everyone who attended was with their comments and feedback. To summarise the outcomes of the sessions, I'm going to cover two aspects of what I discovered throughout the day - the user feedback, and the technical aspects of hosting the event.

User feedback:

We put a bit of thought into how the day should be organised, based on what we wanted to get out of it. From my development perspective, I wanted to get some good user testing done - this event was unique in that the majority of the users had little or no experience with the DataPortal, and so would come to it completely fresh. This was a factor worth capitalising upon, as it would influence the users experience greatly. How the DataPortal was demonstrated and introduced would change the quality of the feedback hugely.

I saw there was two options:

Give an in depth tutorial of how to use the software, and then use the following lab session as a test of how easy users found it to achieve tasks specific to their own interests. This would give us insight into what users expect in such a product, without them having to learn through intuition alone.

or:

Give a quick overview of the software, so that we could see how usable the software was to someone completely new to it. This would be more of a usability testing scenario, rather than discovering what features users required to complete tasks relevant to their areas of expertise.

In the end, the lab session ran longer than I initially expected, and we had a range of users with different technical backgrounds and experience. This meant that watching the room as a whole would give wide ranging information about user expectations (some comments were repeated from nearly everyone in the room, indicating a strong level of expectation in some areas). Meanwhile, users could be questioned individually, and could ask questions themselves, as they attempted to achieve different tasks on their own.

So with enough attention, it was possible to gain knowledge on usability at a basic level (which is essentially bug reporting when features are obviously missing, broken or misbehaving), and much higher level feature requests for future versions to consider meeting.

This was brilliant feedback, and again I thank everyone who attended. Perhaps they didn't realise how closely I was paying attention, or even hoped I was watching closer, but I believe I've collected a lot of feedback which will guide future developments.

Now, the second aspect of the day...

Technical :

Here I'm going to describe how the software framework was set up in advance of the event, and lessons learnt through its usage (and inevitable failure).

I was aiming throughout the development part of this project to keep things as cost effective as possible, which basically meant not spending any money unless I couldn't think of any possible way to avoid it. The logic of this is easy pulled apart (how many man hours should be spent saving a few £'s??). But, if I reword the problem as "how can we most efficiently meet demand here?", then maybe it's more of a reasonable challenge to attempt.

To this end, I was attempting to put the WISERD MetaData database, as well as the website webserver, on the same tiny VM in the Amazon EC2 cloud.

This was a misteak. It was obvious before we started that this would be nowhere near enough (the Amazon "free" allocation is meant for testing only, and should never have been expected to support the demands I was throwing at it). But it was with an almost morbid curiosity that I stood in the back of the room as people registered and logged into the service, waiting for it to fall over. By 5 minutes in I had almost persuaded myself that it could even work - everyone had the home page open (in Google Chrome, more to follow on that one...), and the server seemed fine. It was only when people attempted to perform searches on the database that things began to slow down... then stop.

So, predictably, you can't have 20 people doing relatively complex searches, including spatial searches, on >2GB of database, with only 680 MB of RAM, and effectively half of a CPU core.

If anything, I'm surprised at how patiently some users are willing to stare at a loading screen, when I would have become bored and angry long before. This experiment (to pretend it was intentional) also gave insight into what users expect in the way of error messages, and what they consider an error to be.

It's important for me, as the developer, to agree with everything a user says at this point. Of course, I could become defensive and point out how to gain the expected results in the correct way, but that's not useful. It's much more important for the developer to change the functionality of the system to match the usage pattern learnt when using other similar systems.

I know how to use the system, but no-one is going to be impressed by that - because I wrote it. Of course I know how to use it!

So when a search takes more than 5 minutes to complete, and then returns a blank page, I know this means the communication between the web-server and the database has taken longer than the default time-out, and so has been dropped. This means the JavaScript Ajax query returns, it continues to load the results form, but has no data to show, so the table remains empty.

This isn't what our imaginary user sees (no resemblance to real life events is intended here!!). If a progress bar takes 5 minutes, that's OK, perhaps there is a lot of data to search. Different computers take different amounts of time to achieve things, so waiting is a guessing game at best. However - returning a blank form is just wrong. This gives the indication that zero results were found, which is not the case. In fact, zero results were returned from the server, because the search failed. If it had been successful, it may have returned hundreds of results, but that isn't what happened. This behaviour is indicating untrue *information*, that the metadata does not contain anything related to the search query.

Events similar to this description did occur, and it's important to point out that this is entirely the fault of the developer, i.e., me. If something fails, the user should be told it has failed, in terminology understandable by humans if possible. Carrying on and pretending the error didn’t happen gives false information.

The fall-back plan :

A lot of the errors and unexpected behaviour were due to the performance of the server, which was expected. So after about 15 minutes I released the address of the "development server" - a complete clone of the DataPortal, running on a much more capable machine. I had built this software stack a few days before on the COMSC cloud (described in a previous post), with around 80GB of disk space, 4GB of RAM, and 2 dedicated cores. This proved to be much more stable, and we split the users between the "live" Amazon server and the "dev" server in a roughly 30-70 split. With the load reduced, and some users running both and switching to whichever seemed most responsive at the time, people were able to actually use the DataPortal and give valuable feedback into how it may actually be useful to them.

Use of the COMSC Cloud proved to be a life-saver, thanks once again to Kieran Evans for loan of these resources, and allowed the demo to continue relatively seamlessly. I was still restarting the GeoServer every few minutes, due to a bug I am yet to track down, but the underlying infrastructure was sound.

So another lesson learnt there. I'm now roughly aware of what requirements our VM should be scaled to in future. It's not worth listing it here, as this information will be out of date far too fast to be useful, but needless to say, more is better.

I've also been looking at database optimisation techniques, configs in PHP to reduce load, minimizing of JavaScript files, and a large number of other tweaks here and there which would greatly speed up a live system in future.

I'll also scale the cloud infrastructure to meet demand ahead of time, even if it's for a few hours, then scale it back again to reduce costs. This is the main selling point of cloud computing, but it's good to have a real life demonstration of its necessity, as well as a possible solution. The fact that I had predicted the problem, prepared a backup server as such a solution, and saw it work almost immediately felt pretty good too.

So again, thanks to everyone who attended. I'll be writing a blog post more accurately covering the feedback in the coming weeks.

Tuesday 25 June 2013

On the Naming, Identifying and Linking of Data

I've spent the past few days working further with the NOMISWEB API, and getting myself acquainted with the Neighbourhood API. Both provide a way to access the subsets of the 2011 Census data which have been released so far.

Help files here and here, both invaluable in doing anything with these services end points!

The idea driving development is to take the metadata created by WISERD and stored within our own database and link it up to the datasets stored behind these remote APIs. As such, "data feeds" can be pulled into the DataPortal on demand, as a user searches according to their needs.

This required multiple steps. Firstly, writing some wrapper to each API to discover what datasets are out there, in some way which can hopefully be used in future to develop access to APIs not currently under investigation. For example, in future we plan to link the DataPortal up to the ONS API currently under development. I hope this new API is similar to either of the previous two so accessing it fits the current PHP class interface when I come to write it.

I'll probably create an end-point which wraps access to these other API's in future, allowing a one-hit search, refine and download of datasets, across multiple external APIs, from the DataPortals API.

It turned out to be relatively easy to perform keyword searches on the NOMIS and Neighbourhood APIs. The URLs accept easily readable GET variables to create either:

https://www.nomisweb.co.uk/api/v01/dataset/def.sdmx.xml?search=*van*

http://www.neighbourhood.statistics.gov.uk/NDE2/Disco/FindDatasets?Metadata=van

My current favourite test-case question is the "2011 Census: Car or Van Availability, 2011". Purely because "van" is quick to type, and doesn't appear as part of many other words. Searching for "car" brought back datasets including "care", and slowed down my debugging. Also, WISERD has metadata for this census question, making it an ideal start for "question matching".

Without going into too much description of XML or JSON, the API requests above return a ton of metadata produced by each institutions own APIs.

A snippet of the NOMIS reply:

...
<KeyFamilies>
<structure:KeyFamily id="NM_548_1" agencyID="NOMIS" version="1.0" uri="Nm-548d1">
<structure:Name xml:lang="en">QS416EW - Car or van availability</structure:Name>
<Annotations>...

And a snippet of the neighbourhood statistics reply:

<MatchingDSFamilies>
<DSFamily>
<DSFamilyId>2511</DSFamilyId>
<Name>Car or Van Availability, 2011 (QS416EW)</Name>
</DSFamily>

It took me a while to get my head around all this. Putting them side by side is clearer now, but at the time I was working only with NOMIS, so the conclusion was less obvious.

Basically, I needed an identifier provided by each API with which to link together all our datasets. Then when a DataPortal user searches for question metadata, we have a reference with which to say "yes, there is other data/ metadata out there, here it is for you!".

So it turns out each API has its own "unique" identifier for each question in their databases.

For clarity, searching for "van" in each API gives:

WISERD DataPortal : qid_c11hqh12-S2
NOMISWEB API : NM_548_1 and Nm-548d
Neighbourhood : 2511

I have issues with all of these!

The internals of the WISERD DB gives some explanation as to how that id is formed. Question ID, census 2011, some key letters, question numbers and partial breakdown of each major question into sub parts.

NOMIS presumably has NM, a dataset id and some sub part. The response above contains an ID and a URI - it wasn't initially obvious which to use here.

Neighourhood was the least useful, I can only guess this is the 2511th question they have on file.

Either way, I made a discovery which was for me stunning. While I was looking for a tagged ID to actually refer *globally* to a questionnaire question, I realised that there in fact was an ID, but no-one was using it....

QS416EW

A rant follows!

What this string of letters and numbers means isn't important, I'm never going to write a program which guesses unique identifiers. The fact that this string exists inside the "name" STRING in each API response, but doesn't have its own distinct tag in any of the responses amazed me. Here is a way to identify something, which I can only assume existed from the start, and no-one used it, instead preferring to create their own (sometimes multiple) unique identifiers within their own systems.

Now, I understand that within any dataset, you can't define globally which identifiers should be used. However it would still make sense to recognise its existence for future reference.

End of rant.

So the solution becomes obvious. Match within the WISERD DataPortal database the WISERD_ID, the external resource ID and, if available, the identifier provided by the creator of the survey. And make it searchable. In future this will make defining datasets a lot easier for everyone, and at the very least information isn't lost along the way.

One of the next blog posts I write will be an investigation into why unique identifiers can have problems, and the issues with data integrity/ reliability within a database.

caSe seNsitIvitY.

Tuesday 4 June 2013

On Usage of Development and Production Services

As the DataPortal is now at v2, it was necessary to move it to an environment more suitable for a production service.

The COMSC Cloud installation is being retained as a development and testing environment, to push incremental code changes to between major releases. It may be possible to open this development service to other users if demand for early access becomes significant. This is almost intentionally unstable however, as once features become stable and functional, they will be pushed to the Production server. Hopefully this 3 stage development (the first stage being a build on several virtual machines on my local machine) will mean development can continue alongside user testing of released features, without one adversely affecting the other.

To this end, we set up a new cloud instance on Amazon's AWS service. As the previous development installation had been on Ubuntu in the COMSC Cloud, it was relatively trivial to set up an Ubuntu instance with EC2 and install everything again.

I took this one step further and kept records of every terminal command typed to setup the DataPortal, and created a series of scripts which automate the installation in future. This is generally good practice, and I've heard often before amongst developers that if anything you're doing is more than trivial, or requires Googling to find the correct command line incantation, save it as a script for future usage.

So now I have a script which installs the DataPortal with almost zero interaction, which is nice. It's still time consuming, and it turns out that EC2 charges for disk I/O, but it's nice to know that future builds won't be laborious and repetitive, leaving time to consider new features and implementations. It's all about efficiency.

On Releasing Software Products, and Managing User Expectations

We've recently taken the decision to "release" the WISERD DataPortal. This isn't to say that all the functionality we're after is 100% implemented, but that which is there is working in a fairly stable manner. There are bugs, but all software has bugs - I'll write a further post about how bugs can be defined and dealt with later on.

Releasing software, especially the first roll-out of a new service, is stressful. I'm having to force myself to say "This works, let people try and use it", when my gut feeling is more "But what if I've overlooked something major, and end up looking stupid?".

As far as I can tell from the releases of all software, - and even hardware - from small end "Indie" games, to huge operating system scale products, bugs are always going to be discovered and it's up to the developers to manage these issues as they arise.

At this point, different companies and institutions have different ways of managing their users expectations. I'll take 3 examples from common household names and describe their reactions to users issues with their products. On the "low" end, there's Mojangs Minecraft, considerably larger is Canonical's Ubuntu, and on the very high end there's Microsoft's Windows 8 operating system. For a bonus example in the hardware sector I'll describe Apples iPhone 4. I hope my reasoning for choosing these becomes clearer as I go.

Indie games are generally smaller products, produced by small teams, with low budgets and are targeted at a fairly niche audience. These are not summer blockbusters, but rather more experimental pieces which can afford to drift slightly from the accepted expectations of what certain genres normally offer. Due to the need to get a revenue stream going as quickly as possible, alpha, beta and RC (release candidate) versions are pushed out to the public for testing very frequently, up to multiple releases per week. The users expectations are managed by the developers by a continuous discourse through social media, where it is admitted openly that the product is not finished, major features are likely to be broken or missing entirely, or that each version may be incompatible with previous releases.

This is not a lazy or incomplete way of working, as both the developer and the users gain from this relationship. Developers, as long as they listen to the user feedback, essentially have a free testing panel, and new features can be added dynamically with the users requests in mind. Obviously managerial oversight and vision is required, but the overall development process is very inclusive of the users wishes. It is also a more personal approach, with users often getting to know the developers by name, and personalities can show through.

In contrast, the development and release of massive software packages such as those created by Microsoft, Apple or Google are not able to be as collaborative with the user base. It is easy to say there is zero relationship with the users until the release of the final version, though this is not entirely true.

Microsoft managed their user expectations very early on, by releasing initial press releases describing Windows 8 very early on, with screen shots, slideshows and videos of actors looking very happy to maximise application windows in new and exciting ways. A beta version of Windows 8 was released for those interested in the "Developer Preview", which was downloaded over half a million times in the first 12 hours of it's release. This was a totally different type of user interaction, essentially talking to developers of future software designed to run on Windows 8. Regular users would have little input on features within this development, their feedback would mainly be in the form of bug reports or forum posts to customer services representatives. While there is a development blog, the general vibe is still one way, information flowing from the developers outwards. As a market leader, new features are designed to show users what they can have, rather than creating what users want.

Bridging the gap between the fairly small and extremely large, Canonical develops the Ubuntu operating system in a way that utilises user input in a very collaborative way, (users code can end up in future releases) while retaining overall quality control and direction. Both forums, blogs and press releases are used, of specific interest here is the Brainstorm feature request site, where users can create and vote upon ideas within the community. This is a great way to crowd source the development process and target efforts at those most in demand.

To finish up, I'll mention the PR fiasco that occurred around the iPhone 4 when users complained about reception issues. It turned out that holding the phone a certain way caused a short circuit across the metal band around the phone which acts as the aerial. User feedback was met with a very strange reply which confused and angered users, eventually ending in a class-action lawsuit and a free phone cover for all iPhone owners. The message here is not that lawsuits are a convenient way to improve product design, but rather that interaction with users should always be collaborative and productive to both sides.

So the message here, which I've taken a long route to get to, is that communication with users improves products and user experience, as well as providing valuable insight into the needs of the user when planning future features and improvements. For smaller audiences, social media and a more personal interaction can be invaluable in building a user-base that can actually make the developers life easier, rather than more stressful.

Wednesday 15 May 2013

On data feeds, and the NOMISWEB API

It is an aim of the current development phase of the WISERD DataPortal to be able to use not only the survey metadata collated and stored within the local database, but also external data from a variety of different sources.

The Office for National Statistics (ONS) is in the process of producing an API to allow access to survey data collected within the UK. This is due within the coming months, but in the meantime I have been looking into a few other sources of data which provide similar functionality as is to be expected from the ONS API.

In particular, the API provided by NOMIS is very extensive and easily usable. They are providing "Key statistics and Quick Statistics" here from the 2011 census data, which is proving to be the perfect dataset to develop and test new functionality to the DataPortal.

So far I have knocked together a simple web-based GUI which allows the user to search all the datasets provided by nomis by keyword. This retrieves metadata relating to the geographic areas the data results have been broken down into, and so it's possible to provide the ability to request data values for regions such as entire countries, or LSOAs, down to a postcode granularity. These results are requested in the JSON format (other formats are available) and can be very large in size, so further thought is needed to provide the data in a way which us most usable to the user.

The nomisweb API also offers the ability to download mapping data in KML format, which may be integrated into the DataPortal in the near future.

As WISERD has collected and collated their own metadata on the 2011 census survey, it is possible to connect the incoming requested data streams to our own metadata. The idea is this will provide a much richer view on the census questionnaires, in particular connecting the actual question asked to the result dataset

It is a peculiarity within these datasets, that the questions asked are not recorded along with the result data; or if they are recorded, it is done entirely within PDFs scanned from a printed census questionnaire. This is obviously a sub-optimal way or recording anything, as it is impossible to search or link data to such a document. Further thinking and research will be required to try to bridge this gap programmatically.

Ideally a method can be devised by which remote datasets can be linked to local metadata in a fairly automated way, to avoid lots of dull data entry. I'll post again in the future as I figure out a way to tackle these problems.

On the COMSC Cloud Infrastructure

Some facts and figures.

8 Physical servers
172 Cores
424 GB RAM
20+ TB Storage
10GBit Backbone networking
Multiple 1GB connections to the outside world.
Running OpenStack Cloud software

The VMs used for the development and early production phases of the DataPortal v2 were usually 2-4 cores, 4 GB ram and about 80 GB Storage.

This can be scaled up later if demand requires it, potentially with load balancers if the complexity is considered worthy necessary.

The webserver and database servers are both running Ubuntu 12.04, fresh instances of which can be spun up within the cloud, on demand, in around 15 seconds. As the major releases of the DataPortal are performed, "snapshots" of the current state of the VMs are taken in order to use as backups, or disaster recovery.

There is also some level of redundancy within the Cloud. Having two "nodes" each containing approximately half of the available cores means if one node goes down (which, as an ever changing, research led piece of equipment can lend an occasional challenge), there is a chance that the second node can be used to spin up previous snapshots, achieving a higher level of uptime.

On the DataPortal's Software Stack

The WISERD GeoPortal v1 (alpha) stack looks like this :

and the DataPortal v2 is very similar, with IIS7 replaced by Apache, and ASP.NET replaced by Yii (PHP). See my previous post on why this change happened. The Apache the GeoServer sits on is Tomcat, and the two servers are currently running on different VMs in the COMSC Cloud.

On choices of programming language

Having inherited the alpha version of the GeoPortal codebase, I was tasked with resurrecting the GeoPortal and producing a working version in a production environment. I'll discuss the software stack in a later post, but I wanted to cover the initial choices I had to make first of all.

The WISERD GeoPortal v1 (alpha) backend is coded primarily in the ASP .NET framework which, being a Microsoft development, idealy requires a Microsoft environment to run. This was fine previously, as the GeoPortal was running on a single stand-alone machine running Windows XP, sat in an office at Glamorgan University. Having a small background in the use of Cloud computing, I quickly decided that I would attempt to get the GeoPortal code running in a cloud-y type way.

Thankfully, I have been given the option to use the Cloud created recently within the Cardiff School of Computer Science and Informatics (which I'll refer to as the COMSC Cloud). Specific shout-out to Kieran Evans there for maintaining the oft tempremental beast, and giving a load of help when I was figuring all this out I'll post later with the specs of the COMSC Cloud later on too.

So having decided to use the Cloud, I had some decisions to make. Do I leave the code as-is, and run a Windows OS in the Cloud, or change the code to fit a Linux envornment? As I was new to this code, the ASP .NET language, and the IIS server, I could spend time learning how all this works, and then find I'd wasted that time due to an inability to actually host it all.

I'll mention now that I tried Mono, which has a Linux friendly implimentation of the ASP .NET MVC framework. This took a bit of time to get my head around, but was a useful expenditure of time, even though this wasn't the soution I eventually went with. Setting up the existing codebase to work with MonoDevelop, the Mono IDE, was tricky and eventually ineffective, as the support for .NET v4 was not easily available, and I didn't get the feeling that I would be completely satified with the end result. Basically I was forcing the code to fit a hole it didn't want to go into, so I dropped this line of research after a few days. None of this is to say it wouldn't have worked - I just wasn't happy doing it.

In the end, the decision was almost made for me. Using Windows in a Cloud environment is eventually going to involve licencing issues, and I really didn't want to start down that rabbit hole. The COMSC Cloud was going to be used only as a development environment, and so spending money on potentially multiple VMs for an unknown number of weeks/months was not ideal. The university also did not appear to have a Windows hosting environment available (at zero cost) to play with while I was getting to grips with the code. Perhaps it's my own inability to spend money on things, but I was sure it could be avoided early on.

So I turned to Yii. This is a PHP implementation of the MVC model which I'd seen used in previous projects I've worked on, and had dabbled in a little myself. I've used PHP quite a bit in recent months, primarily during my time on the i4Life project, working on the taxonomic Cross-Mapping tools. Essentially I would be far happier working with PHP, and I knew it could be easily hosted on a LAMP stack in the Cloud. If the ASP MVC code could be quickly rewritten into PHP, this would solve all the hosting problems, and I would be far faster at future developments due to my familiarity with the language.

So this is what I did. In the space of two weeks during March, I tore through the ASP code, and re-wrote it in PHP. Using PHPStorm from JetBrains - which recognises the Yii framework to some extent - I was able to go through every line of code and find a PHP replacement for it. In the end, I was producing REGEX strings to find-and-replace large blocks of code, which was a cool way to learn regular expressions. Every time I refreshed the GeoPortal page, an extra function worked using the LAMP server I was running in my development VM. So that was a good feeling, and pushed me to keep going.

So now we have the DataPortal v2. Bit of a name change, and a rewritten backend. Crawling through every line of code forced me to really understand what every line was doing, and was a chance to optimise some of the previous implementation. The code is now in GitHub, so is essentially open source. I'll get to that in a future post too.

To answer the question I didn't really ask in the title - Which programming language should I use for this? The answer appears to be :

Use the tool you know how to use.

There are a million reasons why PHP is not an ideal language, generally, but porting the code dramatically increased the speed at which I was able to produce new functionality, and bug-fix previous issues.

That is, until I began to look at the JavaScript.