The Datasaurus: a monstrous Anscombe for the 21st century

Most people trained in the ways of data visualisation will be very familiar with Anscombe’s Quartet. For the uninitiated, it’s a set of 4 fairly simple looking X-Y scatterplots that look like this.

Anscombe's Quartet

What’s so great about those then? Well, the reason data vizzers get excited starts to become clear when you realise that the dotted grey lines I have superimposed on each small chart are in fact the mean average of X and Y in each case. And they’re basically the same for each chart.

The identikit summary stats go beyond mere averages. In fact, the variance of both X and Y (and hence the standard deviation) is also the pretty much the same in every chart. As is the correlation coefficient of X and Y, and the regression line that would be the line of best fit if were you to generate a linear model based on each of those 4 datasets.

The point is to show the true power of data visualisation. There are a bunch of clever-sounding summary stats (r-squared is a good one) that some nefarious statisticians might like to baffle the unaware with – but they are oftentimes so summarised that they can lead you to an entirely misleading perception, especially if you are not also an adept statistician.

For example, if someone tells you that their fancy predictive model demonstrates that the relationship between x and y can be expressed as “y = 3 + 0.5x” then you have no way of knowing whether the dataset the model was trained on was that from Anscombe 1, for which it’s possible that it may be a good model, or Anscombe 2, for which it is not, or Anscombe 3 and 4 where the outliers are make that model sub-par in reality, to the point where a school child issued with a sheet of graph paper could probably make a better one.

Yes analytics end-users, demand pictures! OK, there are so many possible summary stats out there that someone expert in collating and mentally visualising the implication of a combination of a hand-picked collection of 30 decimal numbers could perhaps have a decent idea of the distribution of a given set of data – but, unless that’s a skill you already have (clue: if the word “kurtosis” isn’t intuitive to you, you don’t, and it’s nothing to be ashamed of), then why spend years learning to mentally visualise such things, when you could just go ahead and actually visualise it?

But anyway, the quartet was originally created by Mr Anscombe in 1973. Now, a few decades later, it’s time for an even more exciting scatterplot collection, courtesy of Justin Matejka and George Fitzmaurice, take from their paper “Same Stats, Different Graphs“.

They’ve taken the time to create the Datasaurus Dozen. Here they are:

Datasaurus Dozen.png

What what? A star dataset has the same summary statistics as a bunch of lines, an X, a circle or a bunch of other patterns that look a bit like a migraine is coming on?

Yes indeed. Again, these 12 charts all have the same (well, extremely similar) X & Y means, the same X & Y standard deviations and variances, and also the same X & Y linear correlations.

12 charts are obviously more dramatic than 4, and the Datasaurus dozen certainly has a bunch of prettier shapes, but why did they call it Datasaurus? Purely click-bait? Actually no (well, maybe, but there is a valid reason as well!).

Because the 13th of the dozen (a baker’s dozen?) is the chart illustrated below. Please note that if you found Jurassic Park to be unbearably terrifying you should probably close your eyes immediately.

Datasaurus Main

Raa! And yes, this fearsome vision from the distant past also has a X mean of 54.26, and Y mean of 47.83, and X standard deviation of 16.76, a Y standard deviation of 26.93 and a correlation co-efficient of -0.06, just like his twelve siblings above.

If it’s hard to believe, or you just want to play a bit, then the individual datapoints that I put into Tableau to generate the above set of charts is available in this Google sheet – or a basic interactive viz version can be found on Tableau Public here.

Full credit is due to Wikipedia for the Anscombe dataset and Autodesk Research for the Datasaurus data.

Lessons from what happened before Snow’s famous cholera map changed the world

Anyone who studies any amount of the history of, or the best practice for, data visualisation will almost certainly come across a handful of “classic” vizzes. These specific transformations of data-into-diagram have stuck with us through the mists of time in order to become examples that teachers, authors, conference speakers and the like repeatedly pick to illustrate certain key points about the power of dataviz.

A classic when it comes to geospatial analysis is John Snow’s “Cholera map”. Back in the 1850s, it was noted that some areas of the country had a lot more people dying from cholera than other places. At the time, cholera’s transmission mechanism was unknown, so no-one really knew why. And if you don’t know why something’s happening, it’s usually hard to take action against it.

Snow’s map took data that had been gathered about people who had died of cholera, and overlaid the locations where these people resided against a street map of a particularly badly affected part of London. He then added a further data layer denoting the local water supplies.

snowmap

(High-resolution versions available here).

By adding the geospatial element to the visualisation, geographic clusters showed up that provided evidence to suggest that use of a specific local drinking-water source, the now-famous Broad Street public well, was the key common factor for sufferers of this local peak of cholera infection.

Whilst at the time scientists hadn’t yet proven a mechanism for contagion, it turned out later that the well was indeed contaminated, in this case with cholera-infected nappies. When locals pumped water from it to drink, many therefore tragically succumbed to the disease.

Even without understanding the biological process driving the outbreak – nobody knew about germs back then –  seeing this data-driven evidence caused  the authorities to remove the Broad Street pump handle, people could no longer drink the contaminated water, and lives were saved. It’s an example of how data visualisation can open ones’ eyes to otherwise hidden knowledge, in this case with life-or-death consequences.

But what one hears a little less about perhaps is that this wasn’t the first data-driven analysis to confront the same problem. Any real-world practising data analyst might be unsurprised to hear that there’s a bit more to the story than a swift sequence of problem identification -> data gathering -> analysis determining the root cause ->  action being taken.

Snow wasn’t working in a bubble. Another gentleman, by the name of William Farr, whilst working at the General Register Office, had set up a system that recorded people’s deaths along with their cause. This input seems to have been a key enabler of Snow’s analysis.

Lesson 1: sharing data is a Very Good Thing. This is why the open data movement is so important, amongst other reasons. What if Snow hadn’t been able examine Farr’s dataset – could lives have been lost? How would the field of epidemiology have developed without data sharing?

In most cases, no single person can reasonably be expected to both be the original source of all the data they need and then go on to analyse it optimally. “Gathering data” does not even necessarily involve the same set of skills as “analysing data” does – although of course a good data practitioner should usually understand some of the theory of both.

As it happens, William Farr had gone beyond collecting the data. Being of a statistical bent, he had actually already used the same dataset himself to analytically tackle the same question – why are there relatively more cholera deaths in some places than others? He’d actually already found what appeared to be an answer. It later turned out that his conclusion wasn’t correct – but it certainly wasn’t obvious at the time. In fact, it likely seemed more intuitively correct than Snow’s theory back then.

Lesson 2: Here then is a real life example then of the value of analytical iteration. Just because one person has looked at a given dataset doesn’t mean that it’s worthless to have someone else re-analyse it – even if the former analyst has established a conclusion. This is especially important when the stakes are high, and the answer in hand hasn’t been “proven” by virtue of any resulting action confirming the mechanism. We can be pleased that Snow didn’t just think “oh, someone’s already looked at it” and move on to some shiny new activity.

So what was Farr’s original conclusion? Farr had analysed his dataset, again in a geospatial context, and seen a compelling association between the elevation of a piece of land and the number of cholera deaths suffered by people who live on it. In this case, when the land was lower (vs sea level for example) then cholera deaths seemed to increase.

In June 1852, Farr published a paper entitled “Influence of Elevation on the Fatality of Cholera“. It included this table:

farrtable

The relationship seems quite clear; cholera deaths per 10k persons goes up dramatically as the elevation of the land goes down.

Here’s the same data, this time visualised in the form of a linechart, from a 1961 keynote address on “the epidemiology of airborne infection”, published in Bacteriology Reviews. Note the “observed mortality” line.

farrchart.gif

Based on that data, his elevation theory seems a plausible candidate, right?

You might notice that the re-vizzed chart also contains a line concerning the calculated death rate according to “miasma theory”, which seems to have an outcome very similar on this metric to the actual cholera death rate. Miasma was a leading theory of disease-spread back in the nineteenth century, with a pedigree encompassing many centuries. As the London Science Museum tells us:

In miasma theory, diseases were caused by the presence in the air of a miasma, a poisonous vapour in which were suspended particles of decaying matter that was characterised by its foul smell.

This theory was later replaced with the knowledge of germs, but at the time the miasma theory was a strong contender for explaining the distribution of disease. This was probably helped because some potential actions one might take to reduce “miasma” evidently would overlap with those of dealing with germs.

After analysing associations between cholera and multiple geo-variables (crowding, wealth, poor-rate and more), Farr’s paper selects the miasma explanation as the most important one, in a style that seems  quite poetic these days:

From an eminence, on summer evenings, when the sun has set, exhalations are often seen rising at the bottoms of valleys, over rivers, wet meadows, or low streets; the thickness of the fog diminishing and disappearing in upper air. The evaporation is most abundant in the day; but so long as the temperature of the air is high, it sustains the vapour in an invisible body, which is, according to common observation, less noxious while penetrated by sunlight and heat, than when the watery vapour has lost its elasticity, and floats about surcharged with organic compounds, in the chill and darkness of night.

The amount of organic matter, then, in the atmosphere we breathe, and in the waters, will differ at different elevations; and the law which regulates its distribution will bear some resemblance to the law regulating the mortality from cholera at the various elevations.

As we discover later, miasma theory wasn’t correct, and it certainly didn’t offer the optimum answer to addressing the cluster of cholera cases Snow examined.But there was nothing impossible or idiotic about Farr’s work. He (as far as I can see at a glance) gathered accurate enough data and analysed them in a reasonable way. He was testing a hypothesis that was based on the common sense at the time he was working, and found a relationship that does, descriptively, exist.

Lesson 3: correlation is not causation (I bet you’ve never heard that before 🙂 ). Obligatory link to the wonderful Spurious Correlations site.

Lesson 4: just because an analysis seems to support a widely held theory, it doesn’t mean that the theory must be true.

It’s very easy to lay down tools once we seem to have shown that what we have observed is explained by a common theory. Here though we can think of Karl Popper’s views of scientific knowledge being derived via falsification. If there are multiple competing theories in play, the we shouldn’t assume certainty that the dominant one is correct until we have come up with a way of proving the case either way. Sometimes, it’s a worthwhile exercise to try to disprove your findings.

Lesson 5: the most obvious interpretation of the same dataset may vary depending on temporal or other context.

If I was to ask a current-day analyst (who was unfamiliar with the case) to take a look at Farr’s data and provide a view with regards to the explanation of the differences in cholera death rates, then it’s quite possible they’d note the elevation link. I would hope so. But it’s unlikely that, even if they used precisely the same analytical approach, they would suggest that miasma theory is the answer. Whilst I’m hesitant to claim there’s anything that no-one believes, for the most part analysts will probably place an extremely low weight on discredited scientific theories from a couple of centuries ago when it comes to explaining what data shows.

This is more than an idealistic principle – parallels, albeit usually with less at stake, can happen in day-to-day business analysis. Preexisting knowledge changes over time, and differs between groups. Who hasn’t seen (or had of being) the poor analyst who revealed a deep, even dramatic, insight into business performance predicated on data which was later revealed to have been affected by something entirely different.

For my part, I would suggest to learn what’s normal, and apply double-scepticism (but not total disregard!) when you see something that isn’t. This is where domain knowledge is critical to add value to your technical analytical skills. Honestly, it’s more likely that some ETL process messed up your data warehouse, or your store manager is misreporting data, than overnight 100% of the public stopped buying anything at all from your previously highly successful store for instance.

Again, here is an argument for sharing one’s data, holding discussions with people outside of your immediate peer group, and re-analysing data later in time if the context has substantively changed. Although it’s now closed, back in the deep depths of computer data viz history (i.e. the year 2007), IBM launched a data visualisation platform called “Many Eyes”. I was never an avid user, but the concept and name rather enthralled me.

Many Eyes aims to democratize visualization by providing a forum for any users of the site to explore, discuss, and collaborate on visual content…

Sadly, I’m afraid it’s now closed. But other avenues of course exist.

In the data-explanation world, there’s another driving force of change – the development of new technologies for inferring meaning from datapoints. I use “technology” here in the widest possible sense, meaning not necessarily a new version of your favourite dataviz software or a faster computer (not that those don’t help), but also the development of new algorithms, new mathematical processes, new statistical models, new methods of communication, modes of thought and so on.

One statistical model, commonplace in predictive analysis today, is logistic regression. This technique was developed in the 1950s, so was obviously unavailable as a tool for Farr to use a hundred years beforehand. However, in 2004, Bingham et al. published a paper that re-analysed Farr’s data, but this time using logistic regression. Now, even here they still find a notable relationship between elevation and the cholera death rate, reinforcing the idea that Farr’s work was meaningful – but nonetheless conclude that:

Modern logistic regression that makes best use of all the data, however, shows that three variables are independently associated with mortality from cholera. On the basis of the size of effect, it is suggested that water supply most strongly invited further consideration.

Lesson 6: reanalysing data using new “technology” may lead to new or better insights (as long as the new technology is itself more meritorious in some way than the preexisting technology, which is not always the case!).

But anyway, even without such modern-day developments, Snow’s analysis was conducted, and provided evidence that a particular water supply was causing a concentration of cholera cases in a particular district of London. He immediately got the authorities to remove the handle of the contaminated pump, hence preventing its use, and hundreds of people were immediately saved from drinking its foul water and dying.

That’s the story, right? Well, the key events themselves seem to be true, and it remains a great example of that all-too-rare phenomena of data analysis leading to direct action. But it overlooks the point that, by the time the pump was disabled, the local cholera epidemic had already largely subsided.

The International Journal of Epidemiology published a commentary regarding the Broad Street pump in 2002, which included a chart using data taken from Whitehead’s “Remarks on the outbreak of cholera in Broad Street, Golden Square, London, in 1854” paper, which was published in 1867. The chart shows, quite vividly, that by the date that the handle of the pump was removed, the local cholera epidemic that it drove was likely largely over.

handle

As Whitehead wrote:

It is commonly supposed, and sometimes asserted even at meetings of Medical Societies, that the Broad Street outbreak of cholera in 1854 was arrested in mid-career by the closing of the pump in that street. That this is a mistake is sufficiently shown by the following table, which, though incomplete, proves that the outbreak had already reached its climax, and had been steadily on the decline for several days before the pump-handle was removed

Lesson 7: timely analysis is often vital – but if it was genuinely important to analyse urgently, then it’s likely important to take action on the findings equally as fast.

It seems plausible that if the handle had been removed a few days earlier, many more lives could have been saved. This was particularly difficult in this case, as Snow had the unenviable task of persuading the authorities too take action based on a theory that was counter to the prevailing medical wisdom at the time. At least any modern-day analysts can take some solace in the knowledge that even our highest regarded dataviz heroes had some frustration in persuading decision makers to actually act on their findings.

This is not at all to reduce Snow’s impact on the world. His work clearly provided evidence that helped lead to germ theory, which we now hold to be the explanatory factor in cases like these. The implications of this are obviously huge. We save lives based on that knowledge.

Even in the short term, the removal of the handle, whilst too late for much of the initial outbreak, may well have prevented a deadly new outbreak. Whitehead happily acknowledged this in his article.

Here I must not omit to mention that if the removal of the pump-handle had nothing to do with checking the outbreak which had already run its course, it had probably everything to do with preventing a new outbreak; for the father of the infant, who slept in the same kitchen, was attacked with cholera on the very day (Sept. 8th) on which the pump-handle was removed. There can be no doubt that his discharges found their way into the cesspool, and thence into the well. But, thanks to Dr. Snow, the handle was then gone.

Lesson 8: even if it looks like your analysis was ignored until it was too late to solve the immediate problem, don’t be too disheartened –  it may well contribute towards great things in the future.

Books I read in 2016

Reading is one of the favoured hobbies in the DabblingWithData household. In 2016 my beloved fiance invited me to participate in the Goodreads Reading Challenge. It’s simple enough – you set a target and then see if you can read that many books.

The challenge does have its detractors; you can see that an obsession with it will perversely incentivise reading “Spot the Dog” over “Lord of the Rings“. But if you participate in good spirits, then you end up building a fun log of your reading which, if nothing else, gives you enough data that you’ll remember at least the titles of what you read in years hence.

I don’t quite recall where the figure came from, but I had my 2016 challenge set at 50 books. Fifty, you might say, that’s nearly one a week! Surely not possible – or so I thought. I note however that my chief competitor, following a successful year, has set this year’s target to 100, so apparently it’s very possible for some people).

Anyway, Goodreads has both a CSV export feature of the books you log as having read in the competition, and also an API.  I therefore thought I’d have a little explore of what I managed to read. Who knows, perhaps it’ll help improve my 2017 score!

Please click through for slightly more interactive versions of any chart, or follow this link directly. Most data is taken directly from Goodreads, with a little editing by hand.

How much did I read.png

Oh no, I missed my target 😦 Yes, fifty books proved too challenging for me in 2016 – although I got 80% of the way there, which I don’t think is too terrible. My 2017 target remains at fifty.

The cumulative chart shows a nice boost towards the end of August, which was summer holiday time for me. This has led me to conclude the following actionable step: have more holidays.

I was happy to see that I hadn’t subconsciously tried to cheat too much by reading only short books. From the nearly 14k page-equivalents I ploughed through, the single most voluminous book was Anathem. Anathem is a mix of sci-fi and philosophy, full of slightly made-up words just to slow you down further – an actual human:alien glossary is generously included in the back of the book.

The shortest was the Ladybird Book of the Meeting. This was essential reading for work purposes of course, and re-taught me eternal truths such as “Meetings are important because they give everyone a chance to talk about work. Which is easier than doing it”.

Most of my books were in the 2-400 page range – although of course different books make very different usages of a “page”.

So what did I read about?

what-did-i-read

Science fiction is #1 by book volume. I have an affinity for most things that have been deemed geeky through history (and perhaps you do too, if you got this far in!), so this isn’t all that surprising.

Philosophy at #2 is a relatively new habit, at least as a concerted effort. I felt that I’d got into the habit of concentrating too much on data (heresy I know), technology and related subjects in previous years’ reading habits – so thought I’d broaden my horizons a bit by looking into, well, what Google tells me is merely the study of “the fundamental nature of knowledge, reality, and existence”. It’s very interesting, I promise. Although it can be pretty slow to read as every other sentence one does risk ending up staring at the ceiling wondering whether the universe exists, and other such critical issues. Joking aside, the study of epistemology, reality and so on might not be a bad idea for analysty types.

Lower down we’ve got the cheap thriller and detective novels that are somewhat more relaxing, not requiring either a glossary or a headache tablet.

I was a little surprised at what a low proportion of my books were read in eBook format. For most – not all – books, I think eReaders give a much superior reading experience to ye olde paper. This I’m aware is a controversial minority  opinion but I’ll stick to it and point you towards a recent rant on the Hello Internet podcast to explain why.

 

So I’d have guessed a 80-90% eBook rate – but a fair number of paper books actually slipped in. Typically I suspect these are ones I borrowed, or ones that aren’t available in eBook formats. Some of Asimov’s books, of which I read a few this year, for instance are usually not available on Kindle.

On which subject, authors. Most included authors only fed my book habit once last year, although the afore-mentioned Asimov got his hooks into me. This was somewhat aided by the discovery of a cluster of his less well-known books fortuitously being available for 50p each at a charity sale. But if any readers are interested in predictive analytics and haven’t read the Foundation Trilogy, I’d fully recommend even a full price copy for an insight into what the world might have to cope with if your confusion matrix ever showed perfection in all domains.

Sam Harris was the second most read. That fits in with the philosophy theme. He’s also one of the rare people who can at times express opinions that intuitively I do not agree with at all, but does it in a way such that the train of thought that led him to his conclusions is apparent and often quite reasonable. He is, I’m aware, a controversial character on most sides of any political spectrum for one reason or another.

Back to format – I started dabbling with audio books, although at first did not get on so well with them; there’s a certain amount of concentration needed which comes easier to me when visual-reading than audio-reading. But I’m trying again this year, and it’s going better – practice makes perfect?

The “eBook /Audio” category refers to a couple of lecture series from the Great Courses  which give you  a set of half hour lectures to listen to, and an accompanying book to follow along with. These are not free but they cover a much wider range of topics than the average online MOOC seems to (plus you don’t feel bad about not doing assignments – there are none).

Lastly, the GoodReads rating. Do I read books that other people think are great choices? Well, without knowing the background distribution of ratings, and taking into account the number of reviews and from whom, it’s hard to do much except assume a relative ranking when the sample gets large enough.

It does look like my books are on the positive side of the 5-points scale, although definitely not the amongst GoodReads’ most popular. Right now, that list starts with The Hunger Games, which I have read and enjoyed, but it wasn’t in 2016. Looking down the global popularity list, I do see quite a few I’ve had a go at in the past, but almost none that I regret choosing one of my actual choices over this year at first sight!

For the really interested readers out there, you can see the full list of my books and links to the relevant Goodreads pages on the last tab of the viz.

5 Power BI features that might make Tableau users a little jealous

New year, new blog post, new tool version to play with! It’s clear that the field of data-related stuff progresses extremely rapidly at present, and hence it behoves those of us of an analyst bent to, now and then, go explore tools that we don’t use day-to-day. We may already have our favourites in each category, but, unless we’ve done a recent review, it’s quite possible the lesser-loved packages have developed a whole new bunch of goodies since the last checkup.

With that in mind, I’ve taken a look at the latest version of Microsoft Power BI. It’s billed in this manner by its creators:

Power BI transforms your company’s data into rich visuals for you to collect and organize so you can focus on what matters to you.

It’s therefore an obvious competitor for software like Tableau, Qlikview, chart.io, and many others, and largely can replace Microsoft’s previous PowerView offering, which was accessed directly via Excel. In a similar way to the Tableau suite, there’s a Power BI desktop package that analysts install locally on their computer primarily to manipulate data and construct visuals, and a web-based Power BI service that allows for publication and distribution of the resulting file. Actually the online service is pretty powerful in terms of allowing you to create reports and dashboards via the web, and includes a few other nifty features designed to improve the usability of this software genre – so even some analysts might get a lot out of the web-based version alone.

A lot of Power BI is actually free of charge to use, although there is an enhanced “Pro” edition at around US$10 a month, replete with plenty of more enterprisey features as you can see on their comparison chart. If you’re working somewhere with an Office 365 subscription, you might find you already have access to Power BI, even if you didn’t know about it. So, there’s not much to stop you having a play with it if you’re even remotely interested.

Anyhow, this post is not to review Power BI overall, but rather to point out 5 features that stood out to me as not being present in my current dataviz software of choice, Tableau. These therefore aren’t necessarily the general “5 best features of Power BI” – both Tableau and Power BI can create a pretty line chart, so it’s not really worth pointing that out in this context. My choices should then really be considered from the context of someone already deeply familiar with what Tableau or other competitors already offer.

Also note that software packages aren’t supposed to be feature-identical; many programs aimed at solving the same sort of problems may be completely different in their philosophy of design. Adding some features necessitates a cost in terms of whether other features can be supported. This then is not a request to Tableau and competitors to copy these features. But I do vehemently think it’s useful for day-to-day data practitioners to remain aware of what software features are out there in the wild today, just case it gives you a better option to solve a particular problem you encounter one day.

As a spoiler: for what it’s worth, my dive into Power BI hasn’t resulted in me throwing my lovely copy of Tableau away, not a chance; you can pry that from my cold dead hands etc. There’s a certain fluidity in Tableau, especially when used for adhoc analysis, that I’ve not yet encountered in its more obvious competitors, which seems very conducive to digging for insights.

But it has led me to believe that the Microsoft offering has improved substantially since the time years ago I used to battle against v1 PowerPivot (which itself was great for some specific data manipulation activities…but eventually I got tired of the out-of-memory errors!). And, especially due to the way its licensed – to be blunt, far cheaper than Tableau for some configurations – it’ll remain in my mind when considering tools for future projects.

So, in no particular order, here’s some bits and pieces that piqued my curiosity:

1: Focus mode

Let’s start with a simple one. Dashboards typically contain several charts or tables that are designed to provide insight upon a given topic. Ideally the combination of content that makes up a dashboard should usually fit on a single screen, and an overall impression of “is it good or bad news?” should be available at a glance.

In designing dashboards, especially those that are useful for multiple audiences, there’s often therefore a tension between providing enough visualisations such that every user has the information they need, vs making the screen so cluttered or hard to navigate through that no user enjoys the experience of trying to decipher 1-inch square charts whatsoever.

For cases where a particular chart on a dashboard is of interest to a user, Power BI has a “focus” mode that allows the observer to zoom in and interact with that single chart on a dashboard or report on a near-fullscreen basis, without requiring any extra development work on the part of the analyst.

It’s a simple enough concept – the user just clicks a button on whichever visualisation they’re interested in, and it zooms in to fill up most of the screen until they click out of it. It keeps its original interactivity, plus displays some extra meta-information that might be useful (last refresh time etc.). But the main point is it becomes big enough to potentially help generate deeper insights for a particularly interested end user in a way that a little 1 inch square chart shoved at the bottom of a dashboard might struggle to do, even if the 1 inch version is more appropriate for the average dashboard viewer.

If that description isn’t clear, then it’s probably better seen in video form. For example:

 

2: Data driven alerts

Regular readers might have established that I’m a big fan of alerting, when it comes to trying to promote data driven decision making. I’m fairly convinced that many dashboards come with a form of “engagement decay”, where the stakeholder is initially obsessively excited with their ability to access data. But as time goes on they get quite bored of checking to see if everything’s OK – especially if everything usually is OK – and hence stop taking the time to consult a potentially valuable source of decision making.

So, for these types of busy execs, and anyone else wanting to optimise productivity, I like alerts. Just have the dashboard send some sort of notification whenever there’s actually something “interesting” to see.

Sure enough, Power BI has the capacity to alert the user upon certain KPI events, via its own web-based notification centre or, more usefully, email or phone app.

powerbialert

The implementation is pretty simple and somewhat restrictive at the moment. Alerts can only be set up on “numeric tiles featuring cards, KPIs, and gauges”, the alert triggers are basic above X or below X type affairs, and you’re restricted to being alerted once an hour or once a day. So there’s a lot of potential room for development – I’d like to see statistical triggers for instance – “alert me if something unusual happens”.

The good news for Tableau users is that Tableau has promised a similar feature will be coming to their software in the future (and to some extent an analyst can create similar functionality event now with the “don’t send email if view is empty” option recently added). But if you want a nice simple “send me an email whenever my sales drop below £10,000” feature that non-analytical folks can easily use, then Power BI can do that right now.

3: Custom visualisations

All mainstream dataviz products should be able to squeeze out the tried-and-tested basic varieties of visuals; line chart, bar chat, scatterplot et al. And >= 90% of the time this is often enough, in fact usually the best approach for clarity. But sometimes, for better or worse, that’s not sufficient for certain use-cases. You can see this tension surfacing within the Tableau community where, despite the large number of proven chart types it can handle,  there are even larger number of blogs, references documents et al. as to what form one has to coerce your data into order to simulate more esoteric visualisation types within software that has not been natively designed to produce them.

A couple of common examples in recent times would include Sankey charts or hexagonal binning. Yes, you can construct these types of viz in Tableau and other competing products – but it requires a bit of workaroundy pre-work, and entirely interrupts the naturalistic method of exploring data that these tools seek to provide. For example, an average user wishing to construct a Sankey chart in Tableau, may want to search out and thoroughly read one or many of a profusion of useful posts, including those here, here, here, and here and several more places throughout the wilds of the web.

It’s very cool that these resources exist – but imagine if instead of having to rely on researching and recreating clever people’s ingenious workarounds, an expert could just provide a one-click solution to your problem. Or you could share your genius more directly with your peers.

Power BI presents an API where an advanced user can create their own visualisation types. These then integrate within Power BI’s toolbox, as though Microsoft had provided them in the base package. Hence data vizzers of all skill levels can use that type of visual without the need for any programming or mathematical workarounds. It should be noted that the procedure for creating these does require learning a superset of Javascript called Typescript, which would certainly not be expected of most Power BI audiences.

But this barrier is alleviated via the existence of a public gallery of these visualisations that Microsoft maintains, which allows generous developers to share their creations world-wide. A Power BI user wouldn’t have to think about the mathematical properties underyling a Sankey plot – they could just download a Sankey chart type addin such as this one.

sankey.PNG

Now, this open access does introduce some risks of course. Thanks to Spiderman, we all know what great power comes with. And even on the public custom visuals gallery, you’ll see some entries that, well, let’s say Stephen Few might object to.

pyramid

Bonus feature: you can also display native R graphics in your Power BI dashboard, with some limitations.

4: “Pin anything to  dashboard” for non-analyst end users

To understand this one, you need to know something about the Power BI object types. Simply that a “report” is made out of a “dataset”, and a “dashboard” is usually, but not exclusively, made out of components of reports*. A dataviz expert can publish any combination of those (or even publish a mixed set of them as a content pack, which any interested users can download to use with a few clicks – another potentially nifty idea!).

(* Tableau users – you can then think of a report as a worksheet, but a worksheet that can support multiple vizzes with arbitrary placement.)

Reports are what they sound like; the electronic equivalent of a notebook with between zero and many data visualisations on each page concerning a particular topic. Note though an important limitation of being restricted to a single datasource per report. In Power BI you create reports with the simple drag and drop of charting components and configurations, after selecting the appropriate datasource. Charts stick around, in interactive form, wherever you drag them to, almost  as though you were making a Powerpoint slide. No “containers” needed, Tableau-fans 🙂

Dashboards however have a more fixed format; always appearing as though they were a set of tiles, each with a different item in. There’s no restriction on data sources, but some restrictions on functionality; such as no-cross filtering between independent tiles. A dashboard tile can be any viz from any report, a whole report itself (which can then cross-filter within the scope of the report) or some miscellaneous other stuff including “live” Excel workbooks, static images, and even answers to natural language questions you may have asked in the fancy Q&A functionality (“what were our sales last month?”).

So, what’s this about non-analysts? Well, a difference between Power BI dashboards and those from some other tools is that even people considered as as being solely viz consumers can legitimately create their own dashboards. A non-analytical end-user can choose to pin any individual chart from any individual report (or the other types of items listed above) to a new dashboard and hence create a smorgasbord showing exactly the parts of each report / pre-made dashboard they are actually interested in all on one page. After all, the individual viz consumer is by definition best placed to know what’s most important to them.

Here’s what that looks like in reality:

This is perhaps one approach to solving the problem that often in reality the analyst is designing a dashboard for an multi-person audience, within which each individual has slightly different needs. Each user might be interested in a different 3 of the 5 charts in your dashboard. Here, each user could then choose to pin their favourite 3 to their own start up page, or any other dashboard they have control over, together with their favourite data table from another report and most loved Excel workbook, if they insist.

How this actually plays out in practice with novice users would be interesting to see. I think a certain type of non-analyst power user would find this pretty useful, and it’s a more realistic a concept of “even non-analysts can make dashboards with no training” than a lot of these types of tools foolishly promise.

5: More powerful data manipulation tools

This one is more for advanced users. Power BI lets you manipulate the data (you might even say business-user “ETL”) before you start employing it in your visualisations. Most dashboarding tools likely let you do this to some extent – Tableau recently improved its ability to union data for instance, together with some cleaning features, and it’s had joining and blending for a while. You can also write VizQL formulae to produce calculations at the time of connecting to data.

Power BI’s query editor seems to be more powerful than many, with a couple of particular nice features.

Firstly, it uses a language called ‘M’ which is specifically designed with data mashups in mind. Once you’ve obtained your data with the query editor, you can then go on to use the DAX language (designed for data analysis, and whose CALCULATE() function has a soft spot in my heart from previous projects) throughout Power BI in terms of working on data you already have access to.

The query editor is fully web-data enabled; even scraping data right off appropriately formatted web pages without any scripting work at all. Here’s the Microsoft team grabbing and applying a few transforms to IMDB data.

One query-editor feature I particularly like somewhat addresses the disadvantage that some of these user-friendly manipulation tools have vs scripting languages like R; that of reproducibility.

In Power BI, as you go through and apply countless modifications to your incoming dataset, a list of “applied steps” appears to the side of your data pane. Here’s an example from the getting started guide.

appliedsteps

It’s a chronological list of everything you’ve done to manipulate the data, and you also have the ability to go back and delete or edit the steps as you please. No more wondering “how on earth did I get the data into this format?” after an hour of fiddling around transforming data.

There’s plenty of built-in options for cleaning up mucky data; including unpivoting, reordering, replacing values and a fill-down type operation that fills down data until it next sees a value in the same column,  which handles those annoying Excel sheets where each group of rows only has its name filled in on the top row. Unioning and joining is of course very possible,  and you’ll have access to a relationships diagram view, for anyone who fancies having a look at, or modifying, how tables relate to each other.

Analysts are not limited to connecting to existing data either. Non-DBA types can create new tables directly in Power BI and type or paste data directly into them if you wish (although I’d be wary of over-using this feature…be sure to future-proof your work!). You can also upload your standard Excel workbooks directly to the service for web Power BI to access to its underlying data.

If Power BI already has the data tables you want, but they’re just formatted suboptimally or over-granular, then you can use DAX to create calculated tables whereby you use the contents of other imported tables to build your own in-memory virtual table. This might allow you to, for instance, reduce your use of intermediate database temporary tables for some operations, perhaps performing some 1-time aggregation before analysing for instance.

Do good and bad viz choices exist?

Browsing the wonderful timeline of Twitter one evening, I noted an interesting discussion on subjects including Tableau Public, best practice, chart choices and dataviz critique. It’s perhaps too long to go into here, but this tweet from Chris Love caught my eye.

Not being particularly auspicious with regards to summarising my thoughts into 140 characters, I wanted to explore some thoughts around the subject here. Overall, I would concur with the sentiment as expressed – particularly when it had to be crammed into such a small space, and taken out of context as I have here 🙂

But, to take the first premise, whilst there are probably no viz types that are inherently terrible or universally awesome, I think one can argue that there are good or bad viz choices in many situations. It might be the case in some instances that there’s no best or worst viz choice (although I think we may find that there often is, at least out of the limited selection most people are inclined to use). Here I am imagining something akin to a data-viz version of Harris’ “moral landscape“; it may not be clear what the best chart is, but there will be local maximums that are unquestionably better for purpose than some surrounding valleys.

So, how do we decide what the best, or at least a good, viz choice is? Well, it surely comes down to intention. What is the aim of the author?

This is not necessarily self-evident, although I would suggest defaulting to something like “clearly communicating an interesting insight based on an amalgamation of datapoints” as a common one. But there are others:

  • providing a mechanism to allow end-users to explore large datasets which may or may not contain insights,
  • providing propaganda to back up an argument,
  • or selling a lot of books or artwork

to name a few.

The reason we need to understand the intention is because that should be the measure of whether the viz is good or bad.

Imagine my aim is to communicate that 10% of my customers are so unprofitable that we would be better off without them to an audience of ten may-as-well-be-clones business managers – note that the details of the audience is very important here too.

I’ll go away and draw 2 different visualisations of the same data (perhaps a bar chart and, hey, why not, a 3-d hexmap radial chart 🙂 ). I’ll then give version 1 to five of the managers, and version 2 to the other five. Half an hour later, I’ll quiz them on what they learned . Simplistically, I shall feel satisfied that whichever one of them generated the correct understanding in the most managers was the better viz in this instance.

Yes yes, this isn’t a perfect double-blind controlled experiment, but hopefully the point is apparent. “Proper” formal research on optimising data visualisation is certainly done, and very necessary it is too. There’s far too many examples to list, but classics in the field might include the paper “Graphical Perception” by Cleveland and McGill, which helped us understand which types of charts were conducive to being visually decoded accurately by us humans and our built-in limitations.

Commercially, companies like IBM or Autodesk or Google have research departments tackling related questions. In academia, there’s groups like the University of Washington Interactive Data Lab (which, interestingly enough, started out as the Stanford Vizualisation Group whose work on “Polaris” was later released commercially as none other than Tableau software).

If you’re looking for ideas to contribute to on this front, Stephen Few maintains a list of some research he’d like to see done on the subject in future, and no doubt there are infinitely many more possibilities if none of those pique your curiosity.

But the point is: for certain given aims, it is often possible to use experimental procedures and the resulting data, to say, as surely as we can say many things, visualisation A is better than visualisation B at achieving its aim.

But not go too far in expressing certainty here! There are several things to note, all contributing to the fact that very often there is not one best viz for a single dataset – context is key.

  • What is the aim of the viz? We covered that one already. Using a set of attractive colours may be more important than correct labelling on axes if you’re wanting to sell a poster for instance. Certain types of chart make for easier and more accurate types of particular comparisons than others. If you’re trying to learn or teach how to create a particular type of uber-creative chart in a certain tool, then you’re going to rather fail to accomplish that if you end up making a bar chart.
  • Who is the audience? For example, some charts can convey a lot of information is a small space; for instance box-and-whisker plots. An analyst or statistician will probably very happily receive these plots to understand and compare distributions and other descriptive stats in the right circumstances. I love them.However, extensive experience tells me that, no, the average person in the street does not. They are far less intuitive than bar or line charts to the non-analytically inclined/trained. However inefficient you might regard it, a table and 3 histograms might communicate the insight to them more successfully than a boxplot would. If they show an interest, by all means take the time to explain how to read a box plot; extol the virtues of the data-based lifestyle we all know; rejoice in being able to teach a fellow human a useful new piece of knowledge. But, in reality, your short-term job is more likely to be to communicate an important insight rather than provide an A-level statistics course – and if you don’t do well at fulfilling what you’re being employed to do, then you might not be employed to do it for all that long.

As well as there being no single best viz type in a generic sense, there’s also no one universally worst viz type. If there was, the datarati would just ban it. Which, I guess, some people are inclined to do – but, sorry, pie charts still exist. And they’re still at least “locally-good” in some contexts – like this one (source: everywhere on the internet):

pie

But, hey, you don’t have the time to run multiple experiments on multiple audiences. Let’s imagine you also are quite new to the game, with very little personal experience. How would you know which viz type to pick? Well, this is going to be a pretty boring answer sorry – and there’s more to elaborate on later, but, one way relates to the fact that, just like in any other field there, are actually “experts” in data viz. And outside of Michael Gove’s deluded rants, we should acknowledge they usually have some value.

In 1928, Bertrand Russell wrote an essay called ‘On the Value of Scepticism‘, where he laid out 3 guidelines for life in general.

 (1) that when the experts are agreed, the opposite opinion cannot be held to be certain;

(2) that when they are not agreed, no opinion can be regarded as certain by a non-expert;

and (3) that when they all hold that no sufficient grounds for a positive opinion exist, the ordinary man would do well to suspend his judgment.

So, we can bastardise these a bit to give it a dataviz context. If you’re really unsure of what viz to pick, then refer to some set of experts (to which we must acknowledge there’s subjectivity in picking…perhaps more on this in future).

If “experts” mostly think that data of type D used to convey an insight of type I to an audience of type A for purpose P is best represented in a line chart, then that’s probably the way to go if you don’t have substantial reason to believe otherwise. Russell would say that at least you can’t be held as being “certainly wrong” in your decision, even if your boss complains. Likewise, if there’s honestly no concurrence in opinion, then, have a go and take your pick of the suggestions – again, no-one should tell you off for because you did something unquestionably wrong!

For example, my bias is towards feeling that, when communicating “standard” insights efficiently via charts to a literate but non-expert audience, you can’t go too far wrong in reading some of Stephen Few’s books. Harsh and austere they may seem at times, but I believe them to be based on quality research in fields such as human perception as well as experience in the field.

But that’s not to say that his well founded, well presented guidelines, are always right. Just because 90% of the time you might be most successful in representing a certain type of time series as a line chart doesn’t mean that you always will be. Remember also, you may have a totally different aim to the audience to whom Mr Few aims his books at, in which case you cannot assume at all that the same best-practice standards would apply.

And, despite the above guidelines, because (amongst other reasons) not all possible information is ever available to us at any given time, sometimes experts are simply wrong. It turns out that the earth probably isn’t the centre of the universe, despite what you’d probably hear if you went back to experts from a millennia ago. You should just take care to find some decent reason to doubt the prevailing expertise, rather than simply ignoring it.

What we deem as the relative “goodness” of data viz techniques is also surely not static over time. For one, not all forms of data visualisation have existed since the dawn of mankind.

The aforementioned box and whisker plot is held to have been invented by John Tukey. He was only born in 1915, so if I were to travel back 200 years in time with my perfectly presented plot, then it’s unlikely I’d find many people to who find it intuitive to interpret. Hence, if my aim was to be to communicate insights quickly and clearly, then on the balance of probabilities this would probably be a bad attempt. It may not be the worst attempt, as the concept is still valid and hence could likely be explained to some inhabitants of the time – but in terms of bang for buck, there’d be no doubt be higher peaks in the “communicating data insights quickly” landscape available to me nearby.

We should also remember that time hasn’t stopped. Contrary to Francis Fukuyama’s famous essay and book, we probably haven’t reached the end of history even politically just yet, and we most certainly haven’t done so in the world of data. Given the rate of usable data creation, it might be that we’ve only dipped our toe in so far. So, what we think is best practice today may likely not be the same a hundred years hence; some of it may not be so even next year.

Some, but not all, obstacles or opportunities surround technology. Already the world has moved very quickly from graph paper, to desktop PCs, to people carrying around super-computers that only have small screens in their pockets. The most effective, most efficient, ways to communicate data insights will differ in each case. As an example I’m very familiar with, the  Tableau software application, clearly acknowledged this in their last release which includes facilities for displaying data differently depending on what device they’re been viewed on. Not that we need to throw the baby out with the bathwater, but even our hero Mr Tukey may not have had the iPhone 7 in mind when considering optimum data presentation.

Smartwatches have also appeared, albeit are not so mainstream at the moment. How do you communicate data stories when you have literally an inch of screen to play with? Is it possible? Almost certainly so, but probably not in the same way as on a 32 inch screen; and are the personal characteristics and needs of smart watch users anyway the same as the audience who views vizzes on a larger screen?

And what if Amazon (Echo), Google (Home) and others are right to think that in the future a substantial amount of our information based interactions may be done verbally, to a box that sits on the kitchen counter and doesn’t even have a screen? What does “data visualisation” mean in this context? Is it even a thing? But a lot of the questions I might want to ask my future good friend Alexa might well be questions that can only answered by some transformation and re-presentation in audio form of data.

I already can verbally ask my phone to provide me some forms of dataviz. In the below example, it shows me a chart and a summary table. It also provides me a very brief audio summary for the occasions where I can’t view the screen, shown in the bold text above the chart. But, I can’t say I’ve heard of a huge amount of discussion about how to optimise the audio part of the “viz” for insight. Perhaps there should be.

image

Technology aside though, the field should not rest on its laurels; the line chart may or may not ever die, but experimentation and new ideas should always be welcomed. I’d argue that we may be able to prove  in many cases that, today, for a given audience, for a given aim, with a given dataset, out of the various visualisations we most commonly have access to, that one is demonstrably better than another, and that we can back that up via the scientific method.

But what if there’s an even better one out there we never even thought of? What if there is some form of time series that is best visualised in a pie chart? OK, it may seem pretty unlikely but, as per other fields of scientific endeavour, we shouldn’t stop people testing their hypotheses – as long as they remain ethical – or the march of progress may be severely hampered.

Plus, we might all be out of a job. If we fall into the trap of thinking the best of our knowledge today is the best of all knowledge that will ever be available, that the haphazard messy inefficiencies of creativity are a distraction from the proven-efficient execution of the task at hand, then it’ll not be too long before a lot of the typical role of a basic data analyst is swallowed up in the impending march of our robotic overlords.

Remember, a key job of a lot of data-people is really to answer important questions, not to draw charts. You do the second in order to facilitate the first, but your personal approach to insight generation is often in actuality a means to another end.

Your customer wants to know “in what month were my sales highest?”. And, lo and behold, when I open a spreadsheet in the sort of technology that many people treat as the norm these days, Google sheets, I find that I can simply type or speak in the question “What month were my sales highest?” and it tells me very clearly, for free, immediately, without employing anyone to do anything or waiting for someone to get back from their holiday.

capture

Yes, that feature only copes with pretty simplistic analysis at the moment, and you have to be careful how you phrase your questions – but the results are only going to get better over time, and spread into more and more products. Microsoft PowerBI already has a basic natural language feature, and Tableau is at a minimum researching into it. Just wait until this is all hooked up to the various technological “cognitive services” which are already on offer in some form or other. A reliable, auto-generated answer to “what will my sales be next week if I launch a new product category today?” may free up a few more people to spend time with their family, euphemistically or otherwise.

So in the name of progress, we can and should, per Chris’ original tweet, be open to giving and receiving constructive criticism, whether positive or negative. There is value in this, even in the unlikely event that we have already hit on the single best, universal, way of of representing a particular dataset for all time.

Recall John Stuart Mill’s famous essay, “On Liberty” (written in 1869, yes, even before the boxplot existed). It’s so very quotable for many parts of life, but let’s take for example a paragraph from chapter two, regarding the “liberty of thought and discussion”. Why shouldn’t we ban opinions, even when we believe we know them to be bad opinions?

But the peculiar evil of silencing the expression of an opinion is, that it is robbing the human race; posterity as well as the existing generation; those who dissent from the opinion, still more than those who hold it.

If the opinion is right, they are deprived of the opportunity of exchanging error for truth: if wrong, they lose, what is almost as great a benefit, the clearer perception and livelier impression of truth, produced by its collision with error.

Are pie charts good for a specific combination of time series data, audience and aim?

Well – assuming a particularly charitable view of human discourse –  after rational discussion we will either establish that yes, they actually are, in which case the naysayers can “exchange error for truth” to the benefit of our entire field.

Or, if the consensus view of “no way” holds strong, then, having been tested, we will have reinforced the reason why this is in both the minds of the questioner, and ourselves – hence helping us remember the good reasons why we hold our opinions, and ensuring we never lapse into the depths of pseudo-religious dogma.

The Tableau #MakeoverMonday doesn’t need to be complicated

For a while, a couple of  key members of the insatiably effervescent Tableau community, Andy Cotgreave and Andy Kriebel, have been running a “Makeover Monday” activity. Read more and get involved here – but a simplistic summary would be that they distribute a nicely processed dataset on a topic of the day that relates to someone else’s existing visualisation, and all the rest of us Tableau fans can have a go at making our own chart, dashboard or similar to share back with the community so we can inspire and learn from each other.

It’s a great idea, and generates a whole bunch of interesting entries each week. But Andy K noticed that each Monday’s dataset was getting way more downloads than the number of charts later uploaded, and opened a discussion as to why.

There are of course many possible reasons, but one that came through strongly was that, whilst they were interested in the principle, people didn’t think they had the time to produce something comparable to some of the masterpieces that frequent the submissions. That’s a sentiment I wholeheartedly agree with, and, in retrospect – albeit subconsciously – why I never gave it a go myself.

Chris Love, someone who likely interacts with far more Tableau users than most of us do, makes the same point in his post on the benefits of Keeping It Simple Stupid. I believe it was written before the current MakeoverMonday discussions began in earnest, but was certainly very prescient in its applications to this question.

Despite this awesome community many new users I speak to are often put off sharing their work because of the high level of vizzes out there. They worry their work simply isn’t up to scratch because it doesn’t offer the same level of complexity.

 

To be clear, the original Makeover Monday guidelines did include the guideline that it was quite proper to just spend an hour fiddling around with it. But firstly, after a hard day battling against the dark forces of poor data quality and data-free decisions at work, it can be a struggle to keep on trucking for another hour, however fun it would be in other contexts.

And that’s if you can persuade your family that they should let you keep tapping away for another hour doing what, from the outside, looks kind of like you forgot to finish work. In fact a lot of the worship I have for the zens is how they fit what they do into their lives.

But, beyond that, an hour is not going to be enough to “compete” with the best of what you see other people doing in terms of presentation quality.

I like to think I’m quite adept with Tableau (hey, I have a qualification and everything :-)), but I doubt I could create and validate something like this beauty using an unfamiliar dataset on an unfamiliar topic in under an hour.

 

It’s beautiful; the authors of this and many other Monday Makeovers clearly have an immense amount of skill and vision. It is fascinating to see both the design ideas and technical implementation required to coerce Tableau into doing certain non-native things. I love seeing this stuff, and very much hope it continues.

But if one is not prepared to commit the sort of time needed to do that regularly to this activity, then one has to try and get over the psychological difficulty of sharing a piece of work which one perceives is likely to be thought of as “worse” than what’s already there. This is through no fault of the MakeoverMonday chiefs, who make it very clear that producing a NYT infographic each week is not the aim here – but I certainly see why it’s a deterrent from more of the data-downloaders uploading their work. And it’s great to see that topic being directly addressed.

After all, for those of us who use Tableau for the day-to-day joys of business, we probably don’t rush off and produce something like this wonderful piece every time some product owner comes along to ask us an “urgent” question.

Instead, we spend a few minutes making a line chart, that gives them some insight into the answer to their question. We upload an interactive bar chart, with default Tableau colours and fonts, to let them explore a bit deeper and so on. We sit in a meeting and dynamically provide an answer to enable live decision-making that before we had tools like this would have had to wait a couple of weeks to get a csv report on. Real value is generated, and people are sometimes even impressed, despite the fact that we didn’t include hand-drawn iconography, gradient-filled with the company colours.

Something like this perhaps:

Yes, it’s “simple”, it’s unlikely to go Tableau-viral, but it makes a key story held within that data very clear to see. And its far more typical of the day-to-day Tableau use I see in the workplace.

For the average business question, we probably do not spend a few hours researching and designing a beautiful colour scheme in order to perform the underlying maths needed to make a dashboard combining a hexmap, a Sankey chart and a network graph in a tool that is not primarily designed to do any of those things directly.

No-one doubts that you can cajole Tableau into such artistry, and there is sometimes real value obtainable by doing so,  or that those who carry it out may be creative geniuses -but unless they have a day job that is very different than that of mine and my colleagues, then I suspect it’s not their day-to-day either. It’s probably more an expression of their talent and passion for the Tableau product.

Pragmatically, if I need to make, for instance, a quick network chart for “business”, then, all other things being equal, I’m afraid I’m more likely I get out a tool that’s designed to do that rather than take a bit more time to work out how to implement it in Tableau, no matter how much I love it (by the way, Gephi is my tool of choice for that – it is nowhere near as user friendly as Tableau, but it is specifically designed for that sort of graph visualisation; also recent versions of Alteryx can do the basics). Honestly, it’s rare for me that these more unusual charts need to be part of a standard dashboard; our organisation is simply not at a level of viz-maturity where these diagrams are the most useful for most people in the intended audience, if indeed they are for many organisations.

And if you’re a professional whose job is creating awesome newspaper style infographics, then I suspect that you’re not using Tableau as the tool that provides the final output either, more often than not. That’s not its key strength in my view; that’s not how they sell it – although they are justly proud of the design-thought that does go into the software in general. But if paper-WSJ is your target audience, you might be better of using a more custom design-focused tool, like Adobe Illustrator (and Coursera will teach you that specific use-case, if you’re interested).

I hope nothing here will cause offence. I do understand the excitement and admire anyone’s efforts to push the boundaries of the tool – I have done so myself, spending way more time than is strictly speaking necessary in terms of a theoretical metric of “insights generated per hour” to make something that looks cool, whether in or out of work. For a certain kind of person it’s fun, it is a nice challenge, it’s a change from a blue line on top of an orange line, and sometimes it might even produce a revelation that really does change the world in some way.

This work surely needs to be done; adherents to (a bastardised version of) Thomas Kuhn’s theory of scientific revolutions might even claim this “pushing to the limits” as one of the ways of engendering the mini-crisis necessary to drive forward real progress in the field. I’m sure some of the valuable Tableau “ideas“, that feed the development of the software in part, have come from people pushing the envelope, finding value, and realising there should be an easier way to generate it.

There’s also the issue of engagement: depending on your aim, optimising your work for being shared worldwide may be more important to you than optimising it for efficiency, or even clarity and accuracy. This may sound like heresy, and it may even touch on ethical issues, but I suspect a survey of the most well-known visualisations outside of the data community would reveal a discontinuity with the ideals of Stephen Few et al!

But it may also be intimidating to the weary data voyager when deciding whether to participate in these sort of Tableau community activities if it seems like everyone else produces Da Vinci masterpieces on demand.

Now, I can’t prove this with data right now, sorry, but I just think it cannot be the case. You may see a lot of fancy and amazing things on the internet – but that’s the nature of how stuff gets shared around; it’s a key component of virality. If you create a default line chart, it may actually be the best answer to a given question, but outside a small community who is actively interested in the subject domain at hand, it’s not necessarily going to get much notice. I mean, you could probably find someone who made a Very Good Decision based even on those ghastly Excel 2003 default charts with the horrendous grey background if you try hard enough.

excel2003

Never forget…

 

So, anyway, time to put my money where my mouth is and actually participate in MakeoverMonday. I don’t need to spend even an hour making something if I don’t want to, right?  (after all, I’ve used up all my time writing the above!)

Tableau is sold with emphasis on its speed of data sense-marking, claiming to enable producing something reasonably intelligible 10-100x faster than other tools. If we buy into that hype, then spending 10 minutes of Tableau time (necessitating making 1 less cup of tea perhaps) should enable me to produce something that it could have taken up to 17 hours to produce in Excel.

OK, that might be pushing the marketing rather too literally, but the point is hopefully clear. For #MakeoverMonday, some people may concentrate on how far can they push Tableau outside of its comfort zone, others may focus on how they can integrate the latest best practice in visual design, whereas here I will concentrate on whether I can make anything intelligible in the time that it takes to wait for a coffee in Starbucks (on a bad day) – the “10 minute” viz.

So here’s my first “baked in just 10 minutes” viz on the latest MakeoverMonday topic – the growth of the population of Bermuda. Nothing fancy, time ran out just as I was changing fonts, but hey, it’s a readable chart that tells you something about the population change in Bermuda over time. Click through for the slightly interactive version – although of course, it, for instance, has the nasty default tooltips, thanks to the 10 minutes running out just as I was changing the font for the chart titles…

Bermuda population growth.png

 

 

The EU referendum: voting intention vs voting turnout

Next month, the UK is having a referendum on the question of whether it should remain in the European Union, or leave it. All us citizens are having the opportunity to pop down to the ballot box to register our views. And in the mean time we’re subjected to a fairly horrendous  mishmash of “facts” and arguments as to why we should stay or go.

To get the obvious question out of the way, allow me to volunteer that I believe remaining in the EU is the better option, both conceptually and practically. So go tick the right box please! But I can certainly understand the level of confusion amongst the undecided when, to pick one example, one side says things like “The EU is a threat to the NHS” (and produces a much ridiculed video to “illustrate” it) and the other says “Only staying in Europe will protect our NHS”.

So, what’s the result to be? Well, as with any such election, the result depends on both which side each eligible citizen actually would vote for, and the likelihood of that person actually bothering to turn out and vote.

Although overall polling is quite close at the moment, different sub-groups of the population have been identified that are more positive or more negative towards the prospect of remaining in the EU. Furthermore, these groups range in likelihood with regards to saying they will go out and vote (which it must be said is a radically different proposition to actually going out and voting – talk is cheap – but one has to start somewhere).

Yougov recently published some figures they collected that allow one to connect certain subgroups in terms of the % of them that are in favour of remaining (or leaving, if you prefer to think of it that way around) with the rank order of how likely they are to say they’ll actually go and vote. Below, I’ve taken the liberty of incorporating that data into a dashboard that allows exploration of the populations for which they segmented for, their relative likelihood to vote “remain” (invert it if you prefer “leave”), and how likely they are to turn out and vote.

Click here or on the picture below to go and play. And see below for some obvious takeaways.

Groups in favour of remaining in the EU vs referendum turnout intention

So, a few thoughts:

First we should note that the ranks on the slope chart perhaps over-emphasise differences. The scatterplot helps integrate the idea of what the actual percentage of each population that might vote to remain in Europe is, as opposed to the simple ranking. Although there is substantial variation, there’s no mind-blowing trend in terms of the % who would vote remain and the turnout rank (1 = most likely to claim they will turn out to vote).

Remain support % vs turnout rank

I’ve highlighted the extremes on the chart above. Those most in favour to remain are Labour supporters; those least in favour are UKIP supporters. Although we might note that there’s apparently 3% of UKIP fans who would vote to remain. This is possibly a 3% that should get around to changing party affiliation, given that UKIP was largely set up to campaign to get the UK out of Europe, and its current manifesto rants against “a political establishment that wants to keep us enslaved in the Euro project”.

Those claiming to be most likely to vote are those who say they have a high interest in politics, those least likely are those that say they have a low interest. This makes perfect sense – although it should be noted that one’s personal interest in politics of course does not entirely affect the impact of other people’s political decisions that will then be imposed upon you.

So what? Well, in a conference I went to recently, I was told that a certain US object d’ridicule Donald Trump has made effective use of data in his campaign (or at least his staff did). To paraphrase, they apparently realised rather quickly that no amount of data science would result in the ability to make people who do not already like Donald Trump’s senseless, dangerous, awful policies become fans of him (can you guess my feelings?). That would take more magic than even data could bring.

But they realised that they could target quite precisely where the sort of people who do already tend to like him live, and hence harangue them to get out and vote. And whether that is the reason that this malevolent joker is still in the running or not I wouldn’t like to say – but it looks like it didn’t hurt.

So, righteous Remainers, let’s do likewise. Let’s look for some populations that are already the very favourable to remaining in the EU, and see whether they’re likely to turn out unaided.

Want to remain

Well, unfortunately all of the top “in favour to remain” groups seem to be ranked lower in terms of turnout than in terms of pro-remain feeling, but one variable sticks out like a sore thumb: age. It appears that people at the lower end of the age groups, here 18-39, are both some of the most likely subsections of people to be pro-Remain, and some of the least likely to say they’ll go and vote. So, citizens, it is your duty to go out and accost some youngsters; drag’em to the polling booth if necessary. It’s also of interest to note that if leaving the EU is a “bad thing”, then, long term, it’s the younger members of society who are likely to suffer the most (assuming it’s not over-turned any time soon).

But who do we need to nobble educate? Let’s look at the subsections of population that are most eager to leave the EU:

Want to leave.png

OK, some of the pro-leavers also rank quite low in terms of turnout, all good. But a couple of lines rather stand out.

One is age based again; here the opposite end of the spectrum, 60+ year-olds, are some of the least likely to want to remain in Europe and some of the most likely to say they’ll go and vote (historically, the latter has indeed been true). And, well, UKIP people don’t like Europe pretty much by definition – but they seem worryingly likely to claim they’re going to turn up and vote. Time to go on a quick affiliation conversion mission – or at least plan a big purple-and-yellow distraction of some kind…?

 

There’s at least one obvious critical measure missing from this analysis, and that is the respective sizes of the subpopulations. The population of UKIP supporters for instance is very likely, even now, to be smaller than the number of 60+ year olds, thankfully – a fact that you’d have to take into account when deciding how to have the biggest impact.

Whilst the Yougov data published did not include these volumes, they did build a fun interactive “referendum simulator” that, presumably taking this into account, lets you simulate the likely results based on your view of the likely turnout, age & class skew based on their latest polling numbers.

Unsafe abortions: visualising the “preventable pandemic”

In the past few weeks, I was appalled to read that an UK resident was given a prison sentence for the supposed “crime” of having an abortion. This happened because she lives in Northern Ireland, a country where having an abortion is in theory punishable by a life sentence in jail – unless the person in need happens to be rich enough to arrange an overseas appointment for the procedure, in which case it’s OK.

Abortion rights have been a hugely contentious issue over time, but for those of us who reside in a wealthy country with relatively progressive laws on the matter, and the medical resources needed to perform such procedures efficiently, it’s not always easy to remember what the less fortunate may face in other jurisdictions.

In 2016, can it really still be the case that any substantial number of women face legal or logistic issues in their right to choose what happens to their body, under conditions where the huge scientific consensus is against the prospect of any other being suffering? How often do abortions occur – over time, or in different parts of the world? Is there a connection between more liberal laws and abortion rates? And what are the downsides of illiberal, or medically challenged, environments? These, and more, are questions I had that data analysis surely could have a part in answering.

I found useful data in two key places; a 2012 paper published in the Lancet, titled “Induced abortion: incidence and trends worldwide from 1995 to 2008” and from various World Health Organisation publications on the subject.

It should be noted that abortion incidence data is notoriously hard to gather accurately. Obviously, medical records are not sufficient given the existence of illegal or self-administered procedures noted above. It is also not the case that every women has been interviewed about this subject. Worse yet, even where they have been, abortion remains a topic that’s subject to discomfort, prejudice, fear, exclusion, secrecy or even punishment. This occurs in some situations more than others, but the net effect is that it’s the sort of question where straightforward, honest responses to basic survey questions cannot always be expected.

I would suggest to read the 2012 paper above and its appendices to understand more about how the figures I used were modelled by the researchers who obtained them. But the results they show have been peer reviewed, and show enough variance that I believe they tell a useful, indeed vital, story about the unnecessary suffering of women.

It’s time to look into the data. Please click through below and explore the story points to investigate those questions and more. And once you’ve done that -or if you don’t have the inclination to do so – I have some more thoughts to share below.

ua

Thanks for persisting. No need to read further if you were just interested in the data or what you can do with it in Tableau. What follows is simply commentary.

This blog is ostensibly about “data”, the use of which some attribute notions of cold objectiveness to; a Spock-like detachment coming from seeing an abstract number versus understanding events in the real world. But, in my view, most good uses of data necessarily result in the emergence of a narrative; this is a (the?) key skill of a data analyst. The stories data tells may raise emotions, positive or negative. And seeing this data did so in me.

For those that didn’t decide to click through, here is a brief summary of what I saw. It’s largely based on data about the global abortion rate, most often defined here as the number of abortions divided by the number of women aged 15-44. Much of the data is based on 2008. For further source details, please see the visualisation and its sources (primarily this one).

  • The abortion rate in 2008 is pretty similar to that in 2003, which followed a significant drop from 1995. Globally it’s around 28 abortions per 1,000 women aged 15-44. This equates to nearly 44 million abortions per year. This is a process that affects very many women who go through it, affecting also the network of people that love, care for or simply know them.
  • Abortions can be safe or unsafe. The World Health Organisation defines unsafe abortions as being those that consist of:

a procedure for terminating an unintended pregnancy either by individuals without the necessary skills or in an environment that does not conform to minimum medical standards, or both.

  • In reality, this translates to a large variety of sometimes disturbing methods, from ingestion of toxic substances, inappropriate use of medicines, physical trauma to the uterus (the use of a coathanger is the archetypal image for this, so much so that protesters against the criminalisation of abortion have used them as symbols) – or less focussed physical damage; such as throwing oneself down stairs, or off roofs.
  • Appallingly, the proportion of abortions that were unsafe in 2008 has gone up from previous years.
  • Any medical procedure is rarely 100% safe, but a safe, legal, medically controlled abortion contains a pretty negligible chance of death. Unsafe abortions are hundreds of times more likely to be fatal to the recipient. And for those that aren’t, literally millions of people suffer consequences so severe they have to seek hospital treatment afterwards – and these are the “lucky” ones for whom hospital treatment is even available. This is to say nothing of the damaging psychological effects.
  • Therefore, societies that enforce or encourage unsafe abortions should do so in the knowledge that their position is killing women.
  • Some may argue that abortion, which few people of any persuasion could think of as a happy or desirable occurrence, is encouraged where it is freely legally available. They are wrong. There is no suggestion in this data that stricter anti-abortion laws decrease the incidence of abortions.

    WHO report concurs:

Making abortion legal, safe, and accessible does not appreciably increase demand. Instead, the principal effect is shifting previously clandestine, unsafe procedures to legal and safe ones.

  • In fact, if anything, in this data the association runs the other way. Geopolitical regions with a higher proportion of people living in areas where abortions are illegal actually, on the whole, see a higher rate of abortion. I am not suggesting here that more restrictive laws cause more abortions directly, but it is clearly not the case that making abortion illegal necessarily makes it happen less frequently.
  • But stricter laws do, more straightforwardly, lead to a higher proportion of the abortions that take place anyway being unsafe. And thus, on average, to more women dying.

Abortion is a contentious issue and it will no doubt remain so, perhaps mostly for historic, religious or misogynistic reasons. There are nonetheless valid physical and psychological reasons why abortion is, and should be, controlled to some extent. No mainstream view thinks that one should treat the topic lightly or wants to see the procedure becoming a routine event. As the BBC notes, even ardent “pro-choice” activists generally see it as the least bad of a set of bad courses of action available in a situation that noone wanted to occur in the first place, and surely no-one that goes through it is happy it happened. But it does happen, it will happen, and we know how to save thousands of lives.

Seeing this data may well not change your mind if you’re someone who campaigns against legal abortion. It’s hard to shift a world-view that dramatically, especially where so-called moral arguments may be involved.

But – to paraphrase the Vizioneer, paraphrasing William Wilberforce, with his superb writeup after visualising essential data on the atrocities of modern-day human trafficking – once you see the data then you can no longer say you do not know.

The criminalisation-of-abortion lobby are often termed “pro-lifers”. To me, it now seems that that nomenclature has been seized in a twisted, inappropriate way. Once you know that the policies you campaign for will unquestionably lead to the harming and death of real, conscious, living people – then you no longer have the right to label yourself pro-life.

Stress, depression and anxiety in the workplace – what does the data show?

Stress, depression and anxiety are all disorders that can have extremely serious effects for the sufferer. The Health and Safety Executive list quite a few, of varying ranges of severity and scope.

It’s acknowledged that in some cases these can be brought on by problems in the workplace; an issue that desperately needs addressing and resolving given the criticality of paid work in most people’s lives.

Most years, a Labour Force Survey is carried out within the UK, to gain information as to the prevalence and characteristics of people reporting suffering from these conditions in the workplace. Please click through below and explore the tabs to see what the latest edition’s data showed.

Some example questions to consider:

  • how many people in the UK have suffered stress, anxiety or depression as a result of their work in the UK?
  • are some types of people more often affected than others?
  • are certain types of jobs more prone to inducing stress than others? Are there any obvious patterns?
  • does the industry one works in make any difference?
  • how many working days are lost due to these conditions?

Capture

The persuasiveness of dataviz

The intrinsic power of the the chart is highlighted nicely in a recent Harvard Business Review post.

In an experiment (*), Aner Tal et al. had a couple of groups read about a new medication that supposedly reduced the incidence of illness by 40%. This was clearly stated in the text the readers were given.

The only difference between the two groups was that one of them was reading a document that had a very simple bar chart below it. The chart just showed exactly the same claim; that the incidence of illness went down 40% with this new medication.

Capture

When they tried to measure, the presence of this chart didn’t seem to increase the understanding or the information retention of the people viewing it in comparison to the other group.

However, it did make a difference to what the readers believed.

97% of those who had seen the chart believed the medication would reduce illness, vs just 68% those who had read (and understood) the exact same information, but in purely text form.

This is clearly irrational, but shows the impact dataviz can have – even when it technically delivers no new information and does not make the existing information easier to understand.

But don’t think rationality will save you! The researchers found that those people who agreed most strongly with “I believe in science” (a strange thing to ask to be fair, but I assume at least somewhat positively associated with those people who consider themselves most rational) saw the largest boost to believing in the medication post-chart. They hypothesise some sort of “I believe in science -> charts mean science -> I believe in whatever the chart says” type halo phenomena.

I then wondered what would happen if one popped an entirely irrelevant chart under the article. Would that work as well? Not even delving into the wonderful world of how to lie with charts, but simply present a chart on an entirely different subject. Perhaps one from one of the hilarious faux-science hair product adverts would be a good choice (85% neutrilliam! 5.32948% aquaspheres!).

It didn’t seem like the researchers tried that one in the above study, but there is a similiarish study from 2012. No charts involved sadly, but it looked at the impact of putting a totally irrelevant mathematical equation on a document. That they called it “The Nonsense Math Effect” gives a clue as to the results.

Here they had groups of highly educated post-grads look at abstracts of two (real) journal papers, one on anthropology and one on sociology. They then had to rate their perception of the quality of the research on a scale from 1 to 100.

Both groups read the same abstracts, except one group would see the following sentence added on to the end of the abstract.

A mathematical model (TPP = T0 − fT0d2f −fTP d2f ) is developed to describe sequential effects.

This sentence was taken from a totally different paper, which concerned reaction times. There was no relationship between this sentence and the two abstracts the readers were given. In fact:

none of the original abstracts mention any sequential effects or anything that the symbols in the equation could reasonably correspond to

Can you guess what happened?

Of course, the net effect was that the group that read the abstracts with this meaningless sentence pasted on at the end rated the quality of the research significantly higher than those that didn’t (**). The research was indeed more highly regarded if a string of contextually meaningless characters that look a bit like complicated maths was written below it.

Remember, datavizzers, with great power comes great responsibility. Be sure to never abuse your profession.

Capture

 

(*) It’s not listed in the article, but I believe the published article they refer to is this one, although you’ll need a subscription to the “Public Understanding of Science” journal to get to the full paper.

(**) When broken down, there was one group of readers who didn’t fall into that trap: those who were experts in maths, science and technology (and those who studied medicine were perhaps not statistically significantly different).  Most of the world doesn’t hold post-graduate degrees in mathematics though.