Friday 9 August 2013

On Data Entry Tools and Services


This week, having returned from my Summer break, I've been looking at the data entry tools originally used to add Survey and Qualitative Metadata to the DataPortal.

These originally (GeoPortal v. alpha) took the form of a VB.NET based web front-end for adding Qualitative metadata and a stand-alone VB application "wizard" tool for adding Survey Metadata.

As hinted at in a previous post here, I'm not a fan of VB, and as the main DataPortal code has now been moved over to more cross-platform friendly languages, I decided to convert the code from these tools too.

I have started rewriting the Survey Metadata entry tool in Java, although I am yet to decide if the eventual tool will be in this form. I'll come back to this later.

The Qualitative tool GUI was relatively trivial to move over the Javascript (ExtJS). The metadata stored is a list of most commonly used words, output from Edina Unlock, and output from OpenCalais.

Most common words are easily found when iterating through the text in PHP, we store words used more than a set threshold averaged over the number of pages.

The OpenCalais output is retrieved from the Enlighten webservice, a simple HTTP POST with an API key, the text to scan, and a small piece of XML to define search parameters. The result returns RDF, in theory...

As it turns out, the RDF returned from OpenCalais is embedded in a "String" tag, inside some XML. This means that a standard XML parser will fail, as it finds the RDF and bugs out. An RDF parser will fail due to the funky XML in the way. After several hours of frustration, I googled the correct magic words to return this API which, bizarrely, modifies the default PHP xml_parser with some new methods to clear out the junk wrapper, returns the RDF proper, and hands it off to "RAP - RDF API for PHP" (website, SourceForge), to create a PHP model of the document. This can then be iterated through to find the relevant tags we're interested in. The RAP API is actually really well designed, and clearly can do a lot more than what I require here, and it's one to keep in mind for later. I'm sure future RDF style metadata objects will be created as we go along with this project.

Edina Unlock is the next challenge. It seems that at sometime in the last 2 years, (since GeoPortal alpha was first created), the service has changed it's API, and no longer works. This will be one of the next challenges going forward, but the documentation seems sound enough.

There are two API's, "text" and "places". At first glance, it appears that "text" jobs are now submitted, remotely executed, the user polls for progress, then retrieves results once they are produced. This is wrapped in a user auth model to poll for a specific users job. It's a bit more technical for implementation, and will involve our data entry team to watch a progress bar till the results are ready, but I guess it makes sense on their end if they're receiving a ton of requests. I'm not sure yet if this service is required, as we have Edina Unlock working already, we'll see.

The "places" API seems far easier to use, with simple jsonp or HTTP GET's returning JSON formatted data.

I'll update this post with my experiences of this once I've figured it out.

As always, the DataPortal code is on GitHub, so you can check out my solutions should you desire more in depth demonstrations of how I've managed to get it all working.




No comments:

Post a Comment