Wednesday 30 October 2013

On database sanity, workarounds and cleanout

Insane in the mainframe, insane in the blame


The WISERD Database has grown in both number of records and number of tables during its development, and the individual tables have new columns and data types than in the original implementation.

This has inevitably led to a number of issues with the database becoming inconsistent in its usage and the way the data is presented.



This is a problem I've come up against time and time again as minor irritations, so this post is to describe some issues I've had to discover, debug and work around during development of new features.

Firstly, I'm aware that the easiest way to solve the problems below would be to crawl through the database and ensure that everything in there is formatted identically, references to other tables are correct, and all empty fields are properly populated. Unfortunately, this would be a truly mammoth task, and it would be difficult or impossible to ensure that everything was ever truly correct and homogeneous.

Secondly, quite often I didn't even realise a problem existed until I stumbled across an error while trying to achieve something completely different. Science is nothing without a large dose of serendipity.

In no particular order, I'll list of a few issues, the observation, the problem, the possible cause of the problem, and the solution I've come up with. Often there are also issues with the solution too.

1.

White-space in IDs.

It seems that the previous data entry tools took a survey questions ID to be the entire contents of the input dialogue box, including all the blank space after the typed characters. This means an ID in the database should be :

"qid_sceli86nsnv15"



but instead is recorded as :

"qid_sceli86nsnv15                                                                                                                                                                                                                                                                                           "



This is the id, followed by 283 spaces - invisible white-space characters. In turn, this means if any code takes this value and trims it, it'll no longer match the id in any database queries. Worse, these id's, white-space included, are used in linking tables which define which questions are in which surveys, with which responses, and associated Dublin Core data.

It was very tempting to run a script on the database which detects trailing white-space, trims it, and reinserts it into the database. I may still do this, but for now I'm leaving the database I've inherited alone, and am trying to combat this possible issue in the code using the DB. This means doing a search with the ID as seen in the "linking" table, and if that doesn't return anything, try again with the ID trimmed.

2/

Capitalisation of IDs

Similar to the above issue, the IDs in the linking table are not always identical to those in the Survey Questions or Dublin Core tables. Having spoken to a member of the data entry team, it became apparent that on occasion the IDs would be typed manually rather than being loaded from the DB, and so two different IDs exist which refer to the same record. I wrote a small error checking script into the code-base which attempted to guess why fewer numbers of surveys questions were returned than there should be, resulting in the following messages in my logs:



no question for *qid_sceli86kskq5*

......

neither does untrimmed qid *qid_sceli86kskq5                                  *

......

but lower-case qid matching trimmed *qid_sceli86kskq5* does.

Here the linking table originally told us that the question ID we're looking for is *qid_sceli86kskq5                                  *. Predicting the white-space issue, I trim the ID first, as this brings back the majority - but not all - records. If this doesn't work, we try again with the white-space attached. It's not common for this to return anything either, so then I try the lower() PostgreSQL method on the ID's. This normally finds the question we want. If it doesn't - well I haven't come up with a convincing way of guessing what the next issue could be.

There is an issue here, as future questions added to the database could result in a conflict of IDs. For example "ID1234" and "id1234" are both unique, but lower() would render them as identical, so there may be issues there.



In an Ideal World the questions IDs relate directly to their respective surveys, the "sceli86" part of the question ID above denotes it is part of the "Social Change and Economic Life Survey" from 1986, so hopefully question IDs wouldn't clash, but this is not insured against in the code.


3/


In some cases, values in the DB may be missing completely. Most of the time a null entry is rendered as "N/A" during data entry, but this can also be an empty string. It is unknown whether "N/A" refers to the data being unavailable, non-applicable, removed, deleted, skipped, or missing. This is not really an issue with the database or the code-base, but it is potentially something which could come up in future.


4/



The data entry software allows the user to add columns to a new table. 6 of the columns are created automatically, and populated as such, but there is a use case where the tables will also have columns added to refer to data which is only relevant in this specific case. This surprised me, and has led to some fairly tricky issues.




As it stands, it is now necessary to use the PostgreSQL query :

Select column_name as name from information_schema.columns where table_name =<tablename>;

to find the columns which have been added. Then it is possible to query the database for the data added to these previously unknown columns. This is not an ideal solution to the problem, but again, I considered it important to retain compatibility with the previous data entry tools. Future development will allow storage of this data in a different format within the database, allowing easier searching and retrieval, though this will inevitably result in the data being stored in 2 ways (the old way, and the new format).




No comments:

Post a Comment