Wednesday 17 July 2013

On the DataPortal Users Knowledge Exchange Event


On Friday, June 12th 2013, WISERD hosted a Knowledge Exchange event demoing the WISERD DataPortal.

This was a chance to interact with a variety of members from academic and professional communities. We had a great turnout, with lots of positive comments and conversations.

The day comprised of a summary presentation of the current state of metadata with regards to statistical data collection and the aims of the project, followed by a demonstration of the primary functionality of the DataPortal, and ended with an opportunity for the attendees to try out the DataPortal in a computing lab while providing feedback.

Overall, personally I think the day was pretty successful, and I'm very thankful for how enthusiastic and engaged everyone who attended was with their comments and feedback. To summarise the outcomes of the sessions, I'm going to cover two aspects of what I discovered throughout the day - the user feedback, and the technical aspects of hosting the event.

User feedback:

We put a bit of thought into how the day should be organised, based on what we wanted to get out of it. From my development perspective, I wanted to get some good user testing done - this event was unique in that the majority of the users had little or no experience with the DataPortal, and so would come to it completely fresh. This was a factor worth capitalising upon, as it would influence the users experience greatly. How the DataPortal was demonstrated and introduced would change the quality of the feedback hugely.

I saw there was two options:

  • Give an in depth tutorial of how to use the software, and then use the following lab session as a test of how easy users found it to achieve tasks specific to their own interests. This would give us insight into what users expect in such a product, without them having to learn through intuition alone.
or:
  • Give a quick overview of the software, so that we could see how usable the software was to someone completely new to it. This would be more of a usability testing scenario, rather than discovering what features users required to complete tasks relevant to their areas of expertise.
In the end, the lab session ran longer than I initially expected, and we had a range of users with different technical backgrounds and experience. This meant that watching the room as a whole would give wide ranging information about user expectations (some comments were repeated from nearly everyone in the room, indicating a strong level of expectation in some areas). Meanwhile, users could be questioned individually, and could ask questions themselves, as they attempted to achieve different tasks on their own.

So with enough attention, it was possible to gain knowledge on usability at a basic level (which is essentially bug reporting when features are obviously missing, broken or misbehaving), and much higher level feature requests for future versions to consider meeting.

This was brilliant feedback, and again I thank everyone who attended. Perhaps they didn't realise how closely I was paying attention, or even hoped I was watching closer, but I believe I've collected a lot of feedback which will guide future developments.

Now, the second aspect of the day...

Technical :

Here I'm going to describe how the software framework was set up in advance of the event, and lessons learnt through its usage (and inevitable failure).

I was aiming throughout the development part of this project to keep things as cost effective as possible, which basically meant not spending any money unless I couldn't think of any possible way to avoid it. The logic of this is easy pulled apart (how many man hours should be spent saving a few £'s??). But, if I reword the problem as "how can we most efficiently meet demand here?", then maybe it's more of a reasonable challenge to attempt.

To this end, I was attempting to put the WISERD MetaData database, as well as the website webserver, on the same tiny VM in the Amazon EC2 cloud.

This was a misteak. It was obvious before we started that this would be nowhere near enough (the Amazon "free" allocation is meant for testing only, and should never have been expected to support the demands I was throwing at it). But it was with an almost morbid curiosity that I stood in the back of the room as people registered and logged into the service, waiting for it to fall over. By 5 minutes in I had almost persuaded myself that it could even work - everyone had the home page open (in Google Chrome, more to follow on that one...), and the server seemed fine. It was only when people attempted to perform searches on the database that things began to slow down... then stop.

So, predictably, you can't have 20 people doing relatively complex searches, including spatial searches, on >2GB of database, with only 680 MB of RAM, and effectively half of a CPU core.

If anything, I'm surprised at how patiently some users are willing to stare at a loading screen, when I would have become bored and angry long before. This experiment (to pretend it was intentional) also gave insight into what users expect in the way of error messages, and what they consider an error to be.

It's important for me, as the developer, to agree with everything a user says at this point. Of course, I could become defensive and point out how to gain the expected results in the correct way, but that's not useful. It's much more important for the developer to change the functionality of the system to match the usage pattern learnt when using other similar systems.

I know how to use the system, but no-one is going to be impressed by that - because I wrote it. Of course I know how to use it!

So when a search takes more than 5 minutes to complete, and then returns a blank page, I know this means the communication between the web-server and the database has taken longer than the default time-out, and so has been dropped. This means the JavaScript Ajax query returns, it continues to load the results form, but has no data to show, so the table remains empty.

This isn't what our imaginary user sees (no resemblance to real life events is intended here!!). If a progress bar takes 5 minutes, that's OK, perhaps there is a lot of data to search. Different computers take different amounts of time to achieve things, so waiting is a guessing game at best. However - returning a blank form is just wrong. This gives the indication that zero results were found, which is not the case. In fact, zero results were returned from the server, because the search failed. If it had been successful, it may have returned hundreds of results, but that isn't what happened. This behaviour is indicating untrue *information*, that the metadata does not contain anything related to the search query.

Events similar to this description did occur, and it's important to point out that this is entirely the fault of the developer, i.e., me. If something fails, the user should be told it has failed, in terminology understandable by humans if possible. Carrying on and pretending the error didn’t happen gives false information.

The fall-back plan :

A lot of the errors and unexpected behaviour were due to the performance of the server, which was expected. So after about 15 minutes I released the address of the "development server" - a complete clone of the DataPortal, running on a much more capable machine. I had built this software stack a few days before on the COMSC cloud (described in a previous post), with around 80GB of disk space, 4GB of RAM, and 2 dedicated cores. This proved to be much more stable, and we split the users between the "live" Amazon server and the "dev" server in a roughly 30-70 split. With the load reduced, and some users running both and switching to whichever seemed most responsive at the time, people were able to actually use the DataPortal and give valuable feedback into how it may actually be useful to them.

Use of the COMSC Cloud proved to be a life-saver, thanks once again to Kieran Evans for loan of these resources, and allowed the demo to continue relatively seamlessly. I was still restarting the GeoServer every few minutes, due to a bug I am yet to track down, but the underlying infrastructure was sound.

So another lesson learnt there. I'm now roughly aware of what requirements our VM should be scaled to in future. It's not worth listing it here, as this information will be out of date far too fast to be useful, but needless to say, more is better.

I've also been looking at database optimisation techniques, configs in PHP to reduce load, minimizing of JavaScript files, and a large number of other tweaks here and there which would greatly speed up a live system in future.

I'll also scale the cloud infrastructure to meet demand ahead of time, even if it's for a few hours, then scale it back again to reduce costs. This is the main selling point of cloud computing, but it's good to have a real life demonstration of its necessity, as well as a possible solution. The fact that I had predicted the problem, prepared a backup server as such a solution, and saw it work almost immediately felt pretty good too.

So again, thanks to everyone who attended. I'll be writing a blog post more accurately covering the feedback in the coming weeks.