WISERD DataPortal Development

Wednesday, 30 September 2015

On how we reached here, standing on the shoulders on giants

This post will be a progressively updated list of links to external sites, with an explanation on their usefulness generally and their utility within the dataportal.

Currently in no specific order:

https://www.nomisweb.co.uk/api/v01/help

http://blog.ouseful.info/2015/03/09/sketching-out-a-python-pandas-remote-data-access-wrapper-for-the-nomis-api/

https://gist.github.com/psychemedia/b1fadd04af01ccd6a3a1#file-nomis_api_wrapper-py

http://www.r-bloggers.com/opening-up-access-to-data-why-apis-may-not-be-enough/

http://gka.github.io/chroma.js/#chroma-contrast

http://blog.webkid.io/maps-with-leaflet-and-topojson/

http://www.empreinte-urbaine.eu/mapping/geostats/

http://redditgraphs.com/blog/2015/01/04/coloring-maps-with-d3/

Monday, 28 September 2015

On returning to developments, new beginnings and future plans

After a hiatus of just over a year, new development work has begun on the WISERD DataPortal.

The previous version remains at the same URL as previously (data.wiserd.ac.uk). This will remain in place for the foreseeable future, while developments on the new "version 3" of the DataPortal continues.

Version 3 of the DataPortal is close to a complete rewrite, though all the previous data will be retained. The objective is to have the new version use updated software libraries and technologies, and be easily expandable to have the features required for the most recent WISERD projects demands.

And approximate run down of the new software stack follows:

Frontend - Bootstrap, leaflet, datatables, jquery-ui

Backend - Python/ Django (webserver and database models/access), Gunicorn, Nginx

Database - PostgreSQL, PostGIS

Ancillary tools - MapShaper, Redis, Celery

More info to come!

Thursday, 31 October 2013

On services utilised by the WISERD DataPortal

The WISERD DataPortal uses a number of remote services to provide the various functions offered to the user. This post will describe the services currently in use, and those planned on being utilised in future.

PostgreSQL + PostGIS

Provided by PostgeSQL, this service is running alongside the web-server, but can be run on a separate VM, or even in a totally different location. There are actually 3 databases, users, survey data and qualitative data. Spatial data and functions provided by PostGIS are also incorporated.

GeoServer

Found here "GeoServer is a Java-based software server that allows users to view and edit geospatial data. Using open standards set forth by the Open Geospatial Consortium (OGC), GeoServer allows for great flexibility in map creation and data sharing".

Loaded with shapefiles of the "Heads of the Valleys", "Bangor Locality", "Aberystwyth Locality" and "Unitary Authorities", but also provides linkages to the PostGIS tools in the database to create new layers.

This service also receives "SLD" objects to produce map overlay layers. This "StyledLayerDescriptor" in the form of an XML document describing a layers property name, the ranges the properties values can fall into, and the colour each value range should be rendered as.

For example, as the main mapping option is to render the response rate of a Survey by region, the SLD describes the "successful" value for a survey, a set of ranges such as "0-100", "101-200", and an RGB colour definition such as "#FFFFFF".

EdinaUnlock

Edina offers a number of services, available though APIs, to provide tools for research and learning. Unlock is a "Service for georeferencing placenames and geographic data searching", from the website.

The DataPortal uses Unlock Places to produce an RDF document from a string of text. We use it for production of metadata for the Qualitative documents primarily to come up with place names mentioned as well as their GPS latitude and longitude. The RDF returned is decoded within the RDFPHP PHPAPI and is shown

Digimap OpenStream

The Digimap OpenStream service provides the DataPortal with its map tiles.

http://openstream.edina.ac.uk/registration/map

These tiles are made available from OS OpenData™ and exposed through another Edina API, saving the need to host the tiles ourselves which would be expensive in both computation and bandwidth.

AlchemyAPI

This is similar to the EdinaUnlock service, except run as a commercial product and provides a variety of open source SDKs to access the API more easily. The PHP SDK creates PHP objects that are very easy to iterate through and produce results. (It's possible my opinion on this is twisted by my experience in trying to decode RDF in PHP.) The DataPortal isn't using this service "live" yet, it's just in experimental testing and implementation phases to see if any useful functionality can be gained.

Data Feed APIs

These are the APIs used to pull in actual data, linked to the WISERD DataPortal metadata. The aim is to allow users to find data and then download it either to the portal, or export it if necessary.

NomisWEB API

As I've described before, the Nomisweb API is searchable through a rest interface, and returns either XML (SDMX), JSON or browser-friendly HTML. I'm using the JSON format, as it decodes well for use within the Javascript based ExtJS GUI. This is the API I have the most functionality with, as it didn't involve figuring out RDF or SOAP.

NeighbourhoodStatistics

Basically used in the same way as the Nomis API, but with slightly different endpoint construction, and only returns XML (potentially RDF...). Discovery is fine, simple HTTP GET, but the data delivery is requested with a SOAP packet I haven't found time to figure out. The user guide is here.

ONS API

I have literally just discovered that an open access beta to this was released 2 days ago, so I haven't had a chance to look into how it works yet. That will be the content of an imminent blog posting!

Wednesday, 30 October 2013

On database sanity, workarounds and cleanout

Insane in the mainframe, insane in the blame

The WISERD Database has grown in both number of records and number of tables during its development, and the individual tables have new columns and data types than in the original implementation.

This has inevitably led to a number of issues with the database becoming inconsistent in its usage and the way the data is presented.

This is a problem I've come up against time and time again as minor irritations, so this post is to describe some issues I've had to discover, debug and work around during development of new features.

Firstly, I'm aware that the easiest way to solve the problems below would be to crawl through the database and ensure that everything in there is formatted identically, references to other tables are correct, and all empty fields are properly populated. Unfortunately, this would be a truly mammoth task, and it would be difficult or impossible to ensure that everything was ever truly correct and homogeneous.

Secondly, quite often I didn't even realise a problem existed until I stumbled across an error while trying to achieve something completely different. Science is nothing without a large dose of serendipity.

In no particular order, I'll list of a few issues, the observation, the problem, the possible cause of the problem, and the solution I've come up with. Often there are also issues with the solution too.

1.

White-space in IDs.

It seems that the previous data entry tools took a survey questions ID to be the entire contents of the input dialogue box, including all the blank space after the typed characters. This means an ID in the database should be :

"qid_sceli86nsnv15"

but instead is recorded as :

"qid_sceli86nsnv15                                                                                                                                                                                                                                                                                           "

This is the id, followed by 283 spaces - invisible white-space characters. In turn, this means if any code takes this value and trims it, it'll no longer match the id in any database queries. Worse, these id's, white-space included, are used in linking tables which define which questions are in which surveys, with which responses, and associated Dublin Core data.

It was very tempting to run a script on the database which detects trailing white-space, trims it, and reinserts it into the database. I may still do this, but for now I'm leaving the database I've inherited alone, and am trying to combat this possible issue in the code using the DB. This means doing a search with the ID as seen in the "linking" table, and if that doesn't return anything, try again with the ID trimmed.

2/

Capitalisation of IDs

Similar to the above issue, the IDs in the linking table are not always identical to those in the Survey Questions or Dublin Core tables. Having spoken to a member of the data entry team, it became apparent that on occasion the IDs would be typed manually rather than being loaded from the DB, and so two different IDs exist which refer to the same record. I wrote a small error checking script into the code-base which attempted to guess why fewer numbers of surveys questions were returned than there should be, resulting in the following messages in my logs:

no question for *qid_sceli86kskq5*

......

neither does untrimmed qid *qid_sceli86kskq5                                  *

......

but lower-case qid matching trimmed *qid_sceli86kskq5* does.

Here the linking table originally told us that the question ID we're looking for is *qid_sceli86kskq5                                  *. Predicting the white-space issue, I trim the ID first, as this brings back the majority - but not all - records. If this doesn't work, we try again with the white-space attached. It's not common for this to return anything either, so then I try the lower() PostgreSQL method on the ID's. This normally finds the question we want. If it doesn't - well I haven't come up with a convincing way of guessing what the next issue could be.

There is an issue here, as future questions added to the database could result in a conflict of IDs. For example "ID1234" and "id1234" are both unique, but lower() would render them as identical, so there may be issues there.

In an Ideal World the questions IDs relate directly to their respective surveys, the "sceli86" part of the question ID above denotes it is part of the "Social Change and Economic Life Survey" from 1986, so hopefully question IDs wouldn't clash, but this is not insured against in the code.

3/

In some cases, values in the DB may be missing completely. Most of the time a null entry is rendered as "N/A" during data entry, but this can also be an empty string. It is unknown whether "N/A" refers to the data being unavailable, non-applicable, removed, deleted, skipped, or missing. This is not really an issue with the database or the code-base, but it is potentially something which could come up in future.

4/

The data entry software allows the user to add columns to a new table. 6 of the columns are created automatically, and populated as such, but there is a use case where the tables will also have columns added to refer to data which is only relevant in this specific case. This surprised me, and has led to some fairly tricky issues.

As it stands, it is now necessary to use the PostgreSQL query :

Select column_name as name from information_schema.columns where table_name =<tablename>;

to find the columns which have been added. Then it is possible to query the database for the data added to these previously unknown columns. This is not an ideal solution to the problem, but again, I considered it important to retain compatibility with the previous data entry tools. Future development will allow storage of this data in a different format within the database, allowing easier searching and retrieval, though this will inevitably result in the data being stored in 2 ways (the old way, and the new format).

Saturday, 24 August 2013

On Users, Projects, Visibility and Permissions

This week, I've been thinking a lot about the way the DataPortal will be used in the future - specifically the usage by future project users.

Currently every survey stored by WISERD, or described within the WISERD metadata and stored elsewhere (such as the UK Data Archive), has it's metadata globally searchable and viewable by anyone who has registered and logged in.

This is because there is currently no concept within the DataPortal of any users being different from any other users, so all users have the same viewing permissions and download abilities.

While this is fine for any information which has been cleared as "Open" - and obviously this is the ideal situation - it is easy to conceive of situations where information could be stored within the DataPortal which should not be visible to the entire world.

I've been thinking of 3 main use-cases to try to provide a solution for.

Data Entry team members may add part of a surveys metadata, but don't want the survey to be "live" until all the data has been entered

The WISERD hub team want to moderate the data being entered into the DataPortal by external partners, and can control what text is shown "live" until it has been sanity checked.

A Surveys entered during the lifetime of a project is not due to be made public until the end of the project, but can be viewed by project members during.

To implement this, 5 new Database tables have been added, to define 3 different "concepts".

The rest of this post uses Capitalised Words to denote the concepts which now exist. Their definitions follow, as they are currently envisioned, or where implementation has begun:

There is now the idea that a user has a "Role", Users and Surveys belong to a "Project" a survey has a "Visibility".

The tables map as follows:

Project ID - Project Name

Standard, every project has an id.

User ID - Project ID

User ID taken from the regular user details table, any number of users can be matched with any number of projects.

Survey ID - Project ID

Each Survey is owned by one Project. The survey can be transferred between projects if and when visibility ownership or responsibilities shift between projects.

Visibility ID - Visibility State Name

These visibilities denote what type of users can view a survey. So far there is Global (anyone can see), Project (visible by users listed as being in the Project which owns that Survey), and it is anticipated that a "UserOnly" visibility will be added in future.

Survey ID - Visibility ID
Assigning a Survey a Visibility. A Survey has one-and-only-one visibility, so ideally there should be as many entries in this table as there are entries in the Survey table. Some defaults may be assumed if there is no visibility set: if the Survey has a Project owning it, assign it as project only visibility until set otherwise; if the Survey has no Project set, set it as global ownership, and assume it is owned by the "wiserd_global" project.

The "wiserd_global" project is a catch-all Project for unclaimed Surveys, and has all its Survey's as globally visible, on the assumption that they were added prior to this Survey-owning paradigm being set in place.

I have also coded up a GUI to go along with the data entry tools, which allows a logged in user to assign the entered Survey to a Project of which that User is a member. Users with Roles which have certain Privileges can create Surveys in the Projects they have access to, and higher level Roles allow assigning Users to such Projects. These hierarchies will be drawn out in further detail after communication with project heads and the WISERD Hub data entry team to ensure that the right number of Roles will exist, with the right permissions and functions available to them.

Obviously the aim here is for people to continue doing their job as they have done before, without adding undue layers of complexity. Hopefully, once a user has been assigned the appropriate Role and has been added to the relevant Projects, they should notice no difference to their everyday tasks. Additionally, changing a Users list of accessible Projects or Roles should not cause any adverse effects to a User, other than changing their ability to interact with a given Survey.

Roles listed here are currently not implemented, but will be designed with the following ideas in mind.They will be defined better once I have the Data Entry Tools properly adding new Surveys to completion. The Roles will be something along the lines of :

User
Any standard logged in user; can search only Global Visibility Surveys

Project User
A User assigned as a member of a Project, can view all Global Surveys, as well as View and Create abilities on Surveys within Projects they are members of.

WISERD Hub Admin
Able to View, Create and change Visibilities of Surveys and Create Projects. Project Users may tag Surveys for Global Visibility, but the Survey will first be moderated by a Hub Admin before the Global Visibility can be assigned.

SuperAdmin
Can Create, View, Modify, Moderate, change Visibility of Surveys, elevate Users to different Roles, create new Projects and assign Users to those Projects. Any and all other tasks can be completed by a SuperAdmin, and it is assumed that only a few users will be required with this Role. It is also likely that the Role will involve direct access to the database, in situations where a GUI does not exist to provide the required functionality.

Obviously any often repeated tasks by any User with any Role should be replaced by an automated task, or a GUI which aids this tasks completion.

So far, the implementation in no way affects the functionality of the DataPortal, and the new database tables and GUI additions sit on top of the existing service. In this way, backward compatibility should be assured, with the defaults labelled above as the clarification required where new functions require additional information not provided by the previous database.

I'll post again soon on how the Roles are defined, specifically how Yii's inbuilt RBAC (Role Based Access Control) abilities are used to provide permissions and access control to specific users.

Friday, 9 August 2013

On Data Entry Tools and Services

This week, having returned from my Summer break, I've been looking at the data entry tools originally used to add Survey and Qualitative Metadata to the DataPortal.

These originally (GeoPortal v. alpha) took the form of a VB.NET based web front-end for adding Qualitative metadata and a stand-alone VB application "wizard" tool for adding Survey Metadata.

As hinted at in a previous post here, I'm not a fan of VB, and as the main DataPortal code has now been moved over to more cross-platform friendly languages, I decided to convert the code from these tools too.

I have started rewriting the Survey Metadata entry tool in Java, although I am yet to decide if the eventual tool will be in this form. I'll come back to this later.

The Qualitative tool GUI was relatively trivial to move over the Javascript (ExtJS). The metadata stored is a list of most commonly used words, output from Edina Unlock, and output from OpenCalais.

Most common words are easily found when iterating through the text in PHP, we store words used more than a set threshold averaged over the number of pages.

The OpenCalais output is retrieved from the Enlighten webservice, a simple HTTP POST with an API key, the text to scan, and a small piece of XML to define search parameters. The result returns RDF, in theory...

As it turns out, the RDF returned from OpenCalais is embedded in a "String" tag, inside some XML. This means that a standard XML parser will fail, as it finds the RDF and bugs out. An RDF parser will fail due to the funky XML in the way. After several hours of frustration, I googled the correct magic words to return this API which, bizarrely, modifies the default PHP xml_parser with some new methods to clear out the junk wrapper, returns the RDF proper, and hands it off to "RAP - RDF API for PHP" (website, SourceForge), to create a PHP model of the document. This can then be iterated through to find the relevant tags we're interested in. The RAP API is actually really well designed, and clearly can do a lot more than what I require here, and it's one to keep in mind for later. I'm sure future RDF style metadata objects will be created as we go along with this project.

Edina Unlock is the next challenge. It seems that at sometime in the last 2 years, (since GeoPortal alpha was first created), the service has changed it's API, and no longer works. This will be one of the next challenges going forward, but the documentation seems sound enough.

There are two API's, "text" and "places". At first glance, it appears that "text" jobs are now submitted, remotely executed, the user polls for progress, then retrieves results once they are produced. This is wrapped in a user auth model to poll for a specific users job. It's a bit more technical for implementation, and will involve our data entry team to watch a progress bar till the results are ready, but I guess it makes sense on their end if they're receiving a ton of requests. I'm not sure yet if this service is required, as we have Edina Unlock working already, we'll see.

The "places" API seems far easier to use, with simple jsonp or HTTP GET's returning JSON formatted data.

I'll update this post with my experiences of this once I've figured it out.

As always, the DataPortal code is on GitHub, so you can check out my solutions should you desire more in depth demonstrations of how I've managed to get it all working.

Wednesday, 17 July 2013

On the DataPortal Users Knowledge Exchange Event

On Friday, June 12th 2013, WISERD hosted a Knowledge Exchange event demoing the WISERD DataPortal.

This was a chance to interact with a variety of members from academic and professional communities. We had a great turnout, with lots of positive comments and conversations.

The day comprised of a summary presentation of the current state of metadata with regards to statistical data collection and the aims of the project, followed by a demonstration of the primary functionality of the DataPortal, and ended with an opportunity for the attendees to try out the DataPortal in a computing lab while providing feedback.

Overall, personally I think the day was pretty successful, and I'm very thankful for how enthusiastic and engaged everyone who attended was with their comments and feedback. To summarise the outcomes of the sessions, I'm going to cover two aspects of what I discovered throughout the day - the user feedback, and the technical aspects of hosting the event.

User feedback:

We put a bit of thought into how the day should be organised, based on what we wanted to get out of it. From my development perspective, I wanted to get some good user testing done - this event was unique in that the majority of the users had little or no experience with the DataPortal, and so would come to it completely fresh. This was a factor worth capitalising upon, as it would influence the users experience greatly. How the DataPortal was demonstrated and introduced would change the quality of the feedback hugely.

I saw there was two options:

Give an in depth tutorial of how to use the software, and then use the following lab session as a test of how easy users found it to achieve tasks specific to their own interests. This would give us insight into what users expect in such a product, without them having to learn through intuition alone.

or:

Give a quick overview of the software, so that we could see how usable the software was to someone completely new to it. This would be more of a usability testing scenario, rather than discovering what features users required to complete tasks relevant to their areas of expertise.

In the end, the lab session ran longer than I initially expected, and we had a range of users with different technical backgrounds and experience. This meant that watching the room as a whole would give wide ranging information about user expectations (some comments were repeated from nearly everyone in the room, indicating a strong level of expectation in some areas). Meanwhile, users could be questioned individually, and could ask questions themselves, as they attempted to achieve different tasks on their own.

So with enough attention, it was possible to gain knowledge on usability at a basic level (which is essentially bug reporting when features are obviously missing, broken or misbehaving), and much higher level feature requests for future versions to consider meeting.

This was brilliant feedback, and again I thank everyone who attended. Perhaps they didn't realise how closely I was paying attention, or even hoped I was watching closer, but I believe I've collected a lot of feedback which will guide future developments.

Now, the second aspect of the day...

Technical :

Here I'm going to describe how the software framework was set up in advance of the event, and lessons learnt through its usage (and inevitable failure).

I was aiming throughout the development part of this project to keep things as cost effective as possible, which basically meant not spending any money unless I couldn't think of any possible way to avoid it. The logic of this is easy pulled apart (how many man hours should be spent saving a few £'s??). But, if I reword the problem as "how can we most efficiently meet demand here?", then maybe it's more of a reasonable challenge to attempt.

To this end, I was attempting to put the WISERD MetaData database, as well as the website webserver, on the same tiny VM in the Amazon EC2 cloud.

This was a misteak. It was obvious before we started that this would be nowhere near enough (the Amazon "free" allocation is meant for testing only, and should never have been expected to support the demands I was throwing at it). But it was with an almost morbid curiosity that I stood in the back of the room as people registered and logged into the service, waiting for it to fall over. By 5 minutes in I had almost persuaded myself that it could even work - everyone had the home page open (in Google Chrome, more to follow on that one...), and the server seemed fine. It was only when people attempted to perform searches on the database that things began to slow down... then stop.

So, predictably, you can't have 20 people doing relatively complex searches, including spatial searches, on >2GB of database, with only 680 MB of RAM, and effectively half of a CPU core.

If anything, I'm surprised at how patiently some users are willing to stare at a loading screen, when I would have become bored and angry long before. This experiment (to pretend it was intentional) also gave insight into what users expect in the way of error messages, and what they consider an error to be.

It's important for me, as the developer, to agree with everything a user says at this point. Of course, I could become defensive and point out how to gain the expected results in the correct way, but that's not useful. It's much more important for the developer to change the functionality of the system to match the usage pattern learnt when using other similar systems.

I know how to use the system, but no-one is going to be impressed by that - because I wrote it. Of course I know how to use it!

So when a search takes more than 5 minutes to complete, and then returns a blank page, I know this means the communication between the web-server and the database has taken longer than the default time-out, and so has been dropped. This means the JavaScript Ajax query returns, it continues to load the results form, but has no data to show, so the table remains empty.

This isn't what our imaginary user sees (no resemblance to real life events is intended here!!). If a progress bar takes 5 minutes, that's OK, perhaps there is a lot of data to search. Different computers take different amounts of time to achieve things, so waiting is a guessing game at best. However - returning a blank form is just wrong. This gives the indication that zero results were found, which is not the case. In fact, zero results were returned from the server, because the search failed. If it had been successful, it may have returned hundreds of results, but that isn't what happened. This behaviour is indicating untrue *information*, that the metadata does not contain anything related to the search query.

Events similar to this description did occur, and it's important to point out that this is entirely the fault of the developer, i.e., me. If something fails, the user should be told it has failed, in terminology understandable by humans if possible. Carrying on and pretending the error didn’t happen gives false information.

The fall-back plan :

A lot of the errors and unexpected behaviour were due to the performance of the server, which was expected. So after about 15 minutes I released the address of the "development server" - a complete clone of the DataPortal, running on a much more capable machine. I had built this software stack a few days before on the COMSC cloud (described in a previous post), with around 80GB of disk space, 4GB of RAM, and 2 dedicated cores. This proved to be much more stable, and we split the users between the "live" Amazon server and the "dev" server in a roughly 30-70 split. With the load reduced, and some users running both and switching to whichever seemed most responsive at the time, people were able to actually use the DataPortal and give valuable feedback into how it may actually be useful to them.

Use of the COMSC Cloud proved to be a life-saver, thanks once again to Kieran Evans for loan of these resources, and allowed the demo to continue relatively seamlessly. I was still restarting the GeoServer every few minutes, due to a bug I am yet to track down, but the underlying infrastructure was sound.

So another lesson learnt there. I'm now roughly aware of what requirements our VM should be scaled to in future. It's not worth listing it here, as this information will be out of date far too fast to be useful, but needless to say, more is better.

I've also been looking at database optimisation techniques, configs in PHP to reduce load, minimizing of JavaScript files, and a large number of other tweaks here and there which would greatly speed up a live system in future.

I'll also scale the cloud infrastructure to meet demand ahead of time, even if it's for a few hours, then scale it back again to reduce costs. This is the main selling point of cloud computing, but it's good to have a real life demonstration of its necessity, as well as a possible solution. The fact that I had predicted the problem, prepared a backup server as such a solution, and saw it work almost immediately felt pretty good too.

So again, thanks to everyone who attended. I'll be writing a blog post more accurately covering the feedback in the coming weeks.