Lib Dem leaflet chart fail

Coming up to the election, there’s no shortage of misleading statistics, charts and downright quantitative lies being flung around. One even made it through our letterbox today. It’s far from the worst available online, but such statistical slights always feel more personal when they get physically pushed into one’s abode.

Here goes the Liberal Democrats, being honest enough to admit that their main selling point around here is that they got more votes in our area last time than the next largest party did.

Lib Dem chart

For the avoidance of doubt – my research indicates that 28% is not usually more than twice the amount of 16% on a linear scale, so I have taken the liberty of correcting the chart proportions below for a somewhat more realistic look.

Lib Dem chart improved

Although the point that our constituency is traditionally very Conservative-with-a-big-C remains [sadly] true, the Yougov Nowcast is suggesting a very different result for place #2 at present, as shown here.

nowcast

Not that I (or Yougov) would claim that that’s a done deal – but what the Lib Dem leaflet fails to mention is that the last result does not always predict the next result.

The poor Liberal Democrats were apparently recently polling at a 25-year low, behind even the previously pretty numerically insignificant UKIP and Green parties. I think it’s safe to say that the Cleggmania-fuelled 2010 election is not the best model for the current Lib Dem performance, bless them…

Extracting SPSS variable labels and factors with Alteryx

SPSS is a nice statistics/analytics package that, since 1968 (!), seems to have been well-regarded program for classic statistics. It now has many new bits and pieces that target the predictive modelling market too.

In my experience it was previously mostly used in academia, especially the social sciences, but these days it seems it has made inroads into business and government data too.But it’s not yet common enough that we all have a copy sadly, and hence I faced a challenge of how to use a large data file in SPSS format within the other tools at our disposal.

Importing the data straight was not a big issue. Both Alteryx, and Tableau can read the files in natively which is great (this is a new feature in the latest release of Tableau – choose “Statistical file” in the “Connect” data box.)

However, they import the data as-is, and don’t seem to be able to access all the special metadata of an SPSS file all that well. This is very fair – but quite annoying if we are talking about 300,000 questionnaire responses with 400 fields somebody else designed, all helpfully encoded and code-booked as per best data practice 🙂

My first issue was that SPSS has the concept of:

  •  variable labels: e.g. a field called “QUESTION13PARTB” might have the label “What is your gender?”.
  • factors (a.k.a. value labels): meaning the answers to that question might actually be saved in the data as 0 or 1, they actually represent female” or “male”.

My goal was to undo the variable labels and value factors so that instead of:
“QUESTION13PARTB:0” I could get “What is your gender?: Female”.

Alteryx default settings can get you part of the way there very easily. When used on an SPSS (*.SAV) file the  “Input Data” tool has an option to “expand factors”. If not ticked you’ll get 0 or 1 in the above example, if ticked you get female or male. Super easy.

Input Data tool with SPSS file

But I could not find a way to extract the variable labels, i.e. translate QUESTION13PARTB into “What is your gender?”.

Alteryx Analytics Gallery to the rescue! It has a freely downloadable SPSS Input tool there which neatly slots into your Alteryx Designer software, courtesy of Neil Ryan.

Upon use you get the following options, which you can see pretty perfectly coincide with the task at hand.

SPSS input tool

The concept of SPSS Variable Labels is pretty much the same as Alteryx field descriptions, so the second tickbox there is basically recreating the original SPSS setup but in Alteryx. You will see the field descriptions in Alteryx tools like the select tool in that case – or you have the first option which literally replaces the coded field names with their SPSS descriptions if you prefer.

The gallery does warn that this tool is not super fast. I would agree, the default Alteryx input tool is much faster – but even with a recordset in the hundreds of thousands with a with of a few hundred columns I only had to remain patient for about 5 minutes before it was done.

Those without Alteryx – you can do this all in R (indeed that is apparently how this Alteryx tool actually works) – but it will quite possibly be a much less friendly experience if you’re coming to it from scratch! Here’s some docs on the read.spss command you might start from though for any r-curious.

Data dictionary functionality in Tableau Server 9

Whilst Tableau is by far the best dataviz/exploration tool I have ever had the pleasure of using, I’ve traditionally felt it’s not so strong in the boring “enterprise” areas around governance, metadata management, documentation and so on.

It’s quite clear to me why – the tool is/was aimed at data practitioners wanting to bypass all that slow, boring traditional IT process where data can’t be used until it has gone through 6 months worth of paperwork, signoff, tedious discussions and other delays that tend to render it a total waste of effort :-).

This is exactly what it needs to do for most of its target audience – and very successful it is too. But sometimes, even data-progressive organisations might like to have at least a core of “governed data” that is centrally documented, validated and deemed the official source of truth, even whilst not restricting their analysts to use only those sources.

Tableau has a great mechanism for creating and distributing datasets securely enterprise-wide – via using the Tableau Server as a datasource provider – and has the ability to create implicit metadata such as hierarchies, datatypes, groups, sets etc. However it was not immediately obvious how to create a business-user understandable data dictionary.

Our organisation has tried to get around this by creating a datasource in Excel that is essentially a list of all official datasources, fields and much more. It lists, line-by-line, things like formulae used, human-readable definitions, original system of record, who is responsible for its governance, refresh rate, which dashboards use it, and much more.

The system works, but it is a very manual process, subject to human error and will not scale well if ever there was a data onslaught. The upside though to this method is it can then be published whenever desirable in a consistent manner. We use it both in a standalone “data dictionary dashboard”, and also as a mandatory worksheet at the back of every “official” Tableau dashboard released to general business users.

Why would you want a data dictionary?

So far, the main use cases for this dictionary here have transpired to be:

  1. When someone is interacting with a dashboard they might want to know the precise definition of the measures or dimensions they are looking at. For instance, is “sales” net or gross of tax? What system did it come from? How is currency conversion handled? If a customer used a discount voucher is that included or excluded? And so on.Even the simplest of fields seems to be able to generate many questions. For this reason, we make sure every official dashboard contains the relevant parts of the data dictionary as per the above.
  2. When an analyst is wanting to create a new dashboard, they want to know if the data they need has an “official” version and if so, where can they find it? So if they wish to know number of customers, which Tableau datasource would have that in, and what are the various definition options for it?This is the main use case for the standalone data dictionary dashboard, which allows an analyst to type “sales” in and find every data source and every definition of sales that have passed some form of governance.Of course if the field they want is not available, they can still use the flexibility of Tableau to find another way to integrate it from a more adhoc source – but they should be directed to an official source if there is one, rather waste time and risk errors/inconsistencies in building their own version.
  3. When someone who maintains a central datasource is wanting to change or enhance it for some reason, they wish to know which workbooks already use the datasource and hence which might be affected or need testing.

Tableau has built in features to enable a lot of this though

Thanks to a chat with the ever-helpful Tableau support team though, it became apparent that Tableau 9 (and indeed many versions prior to this, albeit not presented quite as nicely) has several features along these lines built into it. It’s not quite enough yet to fulfill especially the second use-case above – but the many advantages of being built-in might outweigh the missing features for some users.

Below shows built in solutions to 4 key questions that a user might ask a data dictionary type system to answer.

  1. What does this field mean?
  2. What’s in this datasource?
  3. Which dashboards use this datasource?
  4. Where did the data in this datasource originate, and when was it last updated?

What does this field mean?

Initial data setup [by data publisher]

Within Tableau Desktop, every field can have a (manually entered) comment. To put this in, just right-click the name of the field, choose “Default properties” then “Comment”. In the resulting box, you can enter anything you like – so perhaps the business definition, any formula concerned, who governs it and so on.

1 2

Where will the end-user find this information?

I have seen it appear in 2 main places so far.

  1. When an analyst is designing a viz in Tableau Desktop if they hover over the field in the list or the pill on the dashboard then it is shown as a tooltip.
  2. When an analyst is designing a viz in their web browser via the online editing function3

4

This hover-over functionality is awesome for analysts to quickly get a reminder of what each field they have access to actually means.

Unfortunately it does not seem to appear anywhere in the read-only view of Tableau server, so only people with permission to edit the workbook online will be able to see these descriptions, and they will have to put the workbook into edit mode (and be careful not to mess it up!) to do so.

Wishlist for Tableau 🙂

  • Allow “Comment” to be populated by a field in the database – so one can store central definitions etc. that can flow into each datasource and remain consistent automatically if the same field is used in multiple datasources.
  • Have an option to show-on-hover the description in the tool tip when a dashboard user is using the dashboard on the server. It’s been quite rightly pointed out that it is not always useful/good to have lengthy descriptions in the default tool tip of a visualisation when it can be used for other more interesting stuff. But perhaps there could be a button under the tooltip that says “show comment” or similar that would reveal it when it is needed.

What’s in this datasource?

Initial data setup

Each datasource also needs a description so that an analyst can get an overview of the content, recency, granularity etc. at a glance. This can also be done in Tableau at the stage where one uses the “publish to server” functionality of a datasource to push it to the server as a datasource.

You will be presented with a box allowing you to describe the datasource as a whole as per the below.

5

Fill that in and hit publish and it’s done.

Where will the end-user find this information?

  1. In Tableau Desktop, when you choose to get data by connecting to a Tableau Server, it lists for you all datasources available on the server. There is a tiny “i” symbol next to the name of the datasource. Hover over that and you will see the description.
  2. On Tableau Server, if you go into the “data sources” section, click on the name of the data source concerned, and choose “details” on the top right it is shown there.

6

7

Wishlist for Tableau

  • Make this more discoverable as an end user. Have some optional function in the datasources list in desktop and server to list all of these descriptions to allow the user to chose the most appropriate datasource.
  • Perhaps make it searchable so I can see all datasources to do with “sales” [noted that this is one possible good use of tags]

Which dashboards use this datasource?

Initial data setup

There’s nothing to do! Tableau Server keeps track of this and shows it to you very nicely. Very useful if you are needing to change a datasource and want to know which workbooks are likely to be affected.

Where will the end-user find this information?

In Tableau server, if you go to the data sources section and click on the name of the datasource you’re interested in. Make sure “connected workbooks” section is selected and it will show you all the workbooks that use this datasource.

8


Where did the data in this datasource originate and when was it last updated?

Initial data setup

Again, in general there’s nothing to do! Tableau Server will keep track of this and show it to any user entitled to view the datasource section.

Where will the end-user find this information?

Go to the data sources section in Tableau Server. Here it will show you each datasource that was published including:

  • its name (as chosen by the publisher)
  • the original type of database it came from (e.g. Oracle, SQL server, Excel)
  • which database server/file it came from  – the case of file based systems like Excel it records the file name.
  • Whether’s it’s live or an extract, and if an extract, when the extract last ran.
  • Which project it’s associated with
  • Who owns it (“owns” in this sense just means who published it to the server)
  • When it was last modified

Data sources screen

A small subset of this information is also available in Tableau Desktop in the Connect to data -> Tableau Server screen where one chooses which source to use. In there you see the name, the owner, the project and last modified date only.

Wishlist for Tableau

  • Not a biggie – but would be nice if (optionally?) the full set of info shown in Tableau Server was presented in Tableau Desktop when connecting to such a datasource

Summary

Although a lot of this existed in version 8, the version 9 server interface makes it ever easier to see and understand your data sources. I would still suggest that Tableau is not super-fully featured when it comes to data dictionary type governance and – whilst understanding this is not at all their primary focus – I hope they go even further with these efforts in future.

But, especially if you have a low number of governed datasources, or they have little duplication between them, it may well be worth your while using the above features to create some inbuilt documentation and validation of appropriate data source. This will make your analysts’ lives safer and easier all being well. It’s also a much nicer approach than if you have to go to the effort of trying to design some more comprehensive, but painfully manual, system.

There are of course other products e.g. that sit on ones’ intranet that are specifically designed for this sort of governance (if you have the money to spend). However, they’re unlikely to enable the sort of hover-over tooltip definitions during the viz design process that is so conducive to the analytical flow that the inbuilt features of Tableau would.

“Move datasource” arrives in Tableau Server v9

Oh happy days – just noticed the new Tableau Server / Online version 9 now allows one to move a datasource!

Previously it was very easy to move a workbook from one project to another, but that wasn’t possible for a datasource. Instead one had to delete the original datasource and republish it seperately to a new project. However this broke any workbooks that used it (until they were updated and republished with the new location) so was not very fun.

However, now there’s a lovely new “move” button on the datasources page. This will save much frustration!

Capture

One use-case for this typically surrounds testing. Test datasources can be put into a suitable test folder, used for a while by those who want to. Then when confident that they work, they can be promoted to a more accessible and governed folder.

Every death in the Game of Thrones – a visualisation

TBronn is the #1 killerhe Washington Post published a nice visualisation concerning the many, many deaths in Game of Thrones yesterday – apparently there have been 456 such violent extravaganzas.

Coded by season, allegiance, importance of character, method of death and other such metadata it gives a nice refresh of the important parts of the storyline. Find out which location was deadliest, which character has the most kills and other such fascinating and vital facts.

One has to love the understated nature of the associated data. They record the death of Oberyn as being “Method category: Hands” which, whilst undoubtedly accurate, does not entirely set the scene as to the horror-fest that is more elucidated by Time magazine’s description of it as “his head popped like a grape”.

It certainly made me pull a face not dissimilar to the expression of the unfortunate bystander below.

Reaction to Oberyn's death

Of course the scene is on Youtube if you really must re-view.

Free data: Constituency Explorer – UK demographics, politics, behaviour

From some combination of the Office of National Statistics, the House of Commons and Durham library comes Constituency Explorer.

Constituency Explorer

Billing itself as “reliable evidence for politicians and journalists – data for everyone”, it allows interactive visualisation of many interesting demographics/behavioural/political attributes by UK political constituency. It’s easy to view distributions and compare between a specific contstituency, the region and the country on topics like

  • 2010 election results (turnout and results)
  • vehicle ownership
  • age
  • ethnicity
  • travel to work
  • household composition
  • qualifications
  • etc. etc.

Each chart has also a “download this data” link at the bottom left, which I would assume should give you a nice integratable spreadsheet/xml/something – but at the time of writing unfortunately one gets a “not found” error…

There’s also a fun “how well do you know your constituency” quiz which is nice for comparing one’s media-fueled perception of a given area to reality.