Lessons from what happened before Snow’s famous cholera map changed the world

Anyone who studies any amount of the history of, or the best practice for, data visualisation will almost certainly come across a handful of “classic” vizzes. These specific transformations of data-into-diagram have stuck with us through the mists of time in order to become examples that teachers, authors, conference speakers and the like repeatedly pick to illustrate certain key points about the power of dataviz.

A classic when it comes to geospatial analysis is John Snow’s “Cholera map”. Back in the 1850s, it was noted that some areas of the country had a lot more people dying from cholera than other places. At the time, cholera’s transmission mechanism was unknown, so no-one really knew why. And if you don’t know why something’s happening, it’s usually hard to take action against it.

Snow’s map took data that had been gathered about people who had died of cholera, and overlaid the locations where these people resided against a street map of a particularly badly affected part of London. He then added a further data layer denoting the local water supplies.

snowmap

(High-resolution versions available here).

By adding the geospatial element to the visualisation, geographic clusters showed up that provided evidence to suggest that use of a specific local drinking-water source, the now-famous Broad Street public well, was the key common factor for sufferers of this local peak of cholera infection.

Whilst at the time scientists hadn’t yet proven a mechanism for contagion, it turned out later that the well was indeed contaminated, in this case with cholera-infected nappies. When locals pumped water from it to drink, many therefore tragically succumbed to the disease.

Even without understanding the biological process driving the outbreak – nobody knew about germs back then –  seeing this data-driven evidence caused  the authorities to remove the Broad Street pump handle, people could no longer drink the contaminated water, and lives were saved. It’s an example of how data visualisation can open ones’ eyes to otherwise hidden knowledge, in this case with life-or-death consequences.

But what one hears a little less about perhaps is that this wasn’t the first data-driven analysis to confront the same problem. Any real-world practising data analyst might be unsurprised to hear that there’s a bit more to the story than a swift sequence of problem identification -> data gathering -> analysis determining the root cause ->  action being taken.

Snow wasn’t working in a bubble. Another gentleman, by the name of William Farr, whilst working at the General Register Office, had set up a system that recorded people’s deaths along with their cause. This input seems to have been a key enabler of Snow’s analysis.

Lesson 1: sharing data is a Very Good Thing. This is why the open data movement is so important, amongst other reasons. What if Snow hadn’t been able examine Farr’s dataset – could lives have been lost? How would the field of epidemiology have developed without data sharing?

In most cases, no single person can reasonably be expected to both be the original source of all the data they need and then go on to analyse it optimally. “Gathering data” does not even necessarily involve the same set of skills as “analysing data” does – although of course a good data practitioner should usually understand some of the theory of both.

As it happens, William Farr had gone beyond collecting the data. Being of a statistical bent, he had actually already used the same dataset himself to analytically tackle the same question – why are there relatively more cholera deaths in some places than others? He’d actually already found what appeared to be an answer. It later turned out that his conclusion wasn’t correct – but it certainly wasn’t obvious at the time. In fact, it likely seemed more intuitively correct than Snow’s theory back then.

Lesson 2: Here then is a real life example then of the value of analytical iteration. Just because one person has looked at a given dataset doesn’t mean that it’s worthless to have someone else re-analyse it – even if the former analyst has established a conclusion. This is especially important when the stakes are high, and the answer in hand hasn’t been “proven” by virtue of any resulting action confirming the mechanism. We can be pleased that Snow didn’t just think “oh, someone’s already looked at it” and move on to some shiny new activity.

So what was Farr’s original conclusion? Farr had analysed his dataset, again in a geospatial context, and seen a compelling association between the elevation of a piece of land and the number of cholera deaths suffered by people who live on it. In this case, when the land was lower (vs sea level for example) then cholera deaths seemed to increase.

In June 1852, Farr published a paper entitled “Influence of Elevation on the Fatality of Cholera“. It included this table:

farrtable

The relationship seems quite clear; cholera deaths per 10k persons goes up dramatically as the elevation of the land goes down.

Here’s the same data, this time visualised in the form of a linechart, from a 1961 keynote address on “the epidemiology of airborne infection”, published in Bacteriology Reviews. Note the “observed mortality” line.

farrchart.gif

Based on that data, his elevation theory seems a plausible candidate, right?

You might notice that the re-vizzed chart also contains a line concerning the calculated death rate according to “miasma theory”, which seems to have an outcome very similar on this metric to the actual cholera death rate. Miasma was a leading theory of disease-spread back in the nineteenth century, with a pedigree encompassing many centuries. As the London Science Museum tells us:

In miasma theory, diseases were caused by the presence in the air of a miasma, a poisonous vapour in which were suspended particles of decaying matter that was characterised by its foul smell.

This theory was later replaced with the knowledge of germs, but at the time the miasma theory was a strong contender for explaining the distribution of disease. This was probably helped because some potential actions one might take to reduce “miasma” evidently would overlap with those of dealing with germs.

After analysing associations between cholera and multiple geo-variables (crowding, wealth, poor-rate and more), Farr’s paper selects the miasma explanation as the most important one, in a style that seems  quite poetic these days:

From an eminence, on summer evenings, when the sun has set, exhalations are often seen rising at the bottoms of valleys, over rivers, wet meadows, or low streets; the thickness of the fog diminishing and disappearing in upper air. The evaporation is most abundant in the day; but so long as the temperature of the air is high, it sustains the vapour in an invisible body, which is, according to common observation, less noxious while penetrated by sunlight and heat, than when the watery vapour has lost its elasticity, and floats about surcharged with organic compounds, in the chill and darkness of night.

The amount of organic matter, then, in the atmosphere we breathe, and in the waters, will differ at different elevations; and the law which regulates its distribution will bear some resemblance to the law regulating the mortality from cholera at the various elevations.

As we discover later, miasma theory wasn’t correct, and it certainly didn’t offer the optimum answer to addressing the cluster of cholera cases Snow examined.But there was nothing impossible or idiotic about Farr’s work. He (as far as I can see at a glance) gathered accurate enough data and analysed them in a reasonable way. He was testing a hypothesis that was based on the common sense at the time he was working, and found a relationship that does, descriptively, exist.

Lesson 3: correlation is not causation (I bet you’ve never heard that before ūüôā ). Obligatory link to the wonderful Spurious Correlations site.

Lesson 4: just because an analysis seems to support a widely held theory, it doesn’t mean that the theory must be true.

It’s very easy to lay down tools once we seem to have shown that what we have observed is explained by a common theory. Here though we can think of Karl Popper’s views of scientific knowledge being derived via falsification. If there are multiple competing theories in play, the we shouldn’t assume certainty that the dominant one is correct until we have come up with a way of proving the case either way. Sometimes, it’s a worthwhile exercise to try to disprove your findings.

Lesson 5: the most obvious interpretation of the same dataset may vary depending on temporal or other context.

If I was to ask a current-day analyst (who was unfamiliar with the case) to take a look at Farr’s data and provide a view with regards to the explanation of the differences in cholera death rates, then it’s quite possible they’d note the elevation link. I would hope so. But it’s unlikely that, even if they used precisely the same analytical approach, they would suggest that miasma theory is the answer. Whilst I’m hesitant to claim there’s anything that no-one believes, for the most part analysts will probably place an extremely low weight on discredited scientific theories from a couple of centuries ago when it comes to explaining what data shows.

This is more than an idealistic principle – parallels, albeit usually with less at stake, can happen in day-to-day business analysis. Preexisting knowledge changes over time, and differs between groups. Who hasn’t seen (or had of being) the poor analyst who revealed a deep, even dramatic, insight into business performance predicated on data which was later revealed to have been affected by something entirely different.

For my part, I would suggest to learn what’s normal, and apply double-scepticism (but not total disregard!) when you see something that isn’t. This is where domain knowledge is critical to add value to your technical analytical skills. Honestly, it’s more likely that some ETL process messed up your data warehouse, or your store manager is misreporting data, than overnight 100% of the public stopped buying anything at all from your previously highly successful store for instance.

Again, here is an argument for sharing one’s data, holding discussions with people outside of your immediate peer group, and re-analysing data later in time if the context has substantively changed. Although it’s now closed, back in the deep depths of computer data viz history (i.e. the year 2007), IBM launched a data visualisation platform called “Many Eyes”. I was never an avid user, but the concept and name rather enthralled me.

Many Eyes aims to democratize visualization by providing a forum for any users of the site to explore, discuss, and collaborate on visual content…

Sadly, I’m afraid it’s now closed. But other avenues of course exist.

In the data-explanation world, there’s another driving force of change – the development of new technologies for inferring meaning from datapoints. I use “technology” here in the widest possible sense, meaning not necessarily a new version of your favourite dataviz software or a faster computer (not that those don’t help), but also the development of new algorithms, new mathematical processes, new statistical models, new methods of communication, modes of thought and so on.

One statistical model, commonplace in predictive analysis today, is logistic regression. This technique was developed in the 1950s, so was obviously unavailable as a tool for Farr to use a hundred years beforehand. However, in 2004, Bingham et al. published a paper that re-analysed Farr’s data, but this time using logistic regression. Now, even here they still find a notable relationship between elevation and the cholera death rate, reinforcing the idea that Farr’s work was meaningful – but nonetheless conclude that:

Modern logistic regression that makes best use of all the data, however, shows that three variables are independently associated with mortality from cholera. On the basis of the size of effect, it is suggested that water supply most strongly invited further consideration.

Lesson 6: reanalysing data using new “technology” may lead to new or better insights (as long as the new technology is itself more meritorious in some way than the preexisting technology, which is not always the case!).

But anyway, even without such modern-day developments, Snow’s analysis was conducted, and provided evidence that a particular water supply was causing a concentration of cholera cases in a particular district of London. He immediately got the authorities to remove the handle of the contaminated pump, hence preventing its use, and hundreds of people were immediately saved from drinking its foul water and dying.

That’s the story, right? Well, the key events themselves seem to be true, and it remains a great example of that all-too-rare phenomena of data analysis leading to direct action. But it overlooks the point that, by the time the pump was disabled, the local cholera epidemic had already largely subsided.

The International Journal of Epidemiology published a commentary regarding the Broad Street pump in 2002, which included a chart using data taken from Whitehead’s “Remarks on the outbreak of cholera in Broad Street, Golden Square, London, in 1854” paper, which was published in 1867. The chart shows, quite vividly, that by the date that the handle of the pump was removed, the local cholera epidemic that it drove was likely largely over.

handle

As Whitehead wrote:

It is commonly supposed, and sometimes asserted even at meetings of Medical Societies, that the Broad Street outbreak of cholera in 1854 was arrested in mid-career by the closing of the pump in that street. That this is a mistake is sufficiently shown by the following table, which, though incomplete, proves that the outbreak had already reached its climax, and had been steadily on the decline for several days before the pump-handle was removed

Lesson 7: timely analysis is often vital – but if it was genuinely important to analyse urgently, then it’s likely important to take action on the findings equally as fast.

It seems plausible that if the handle had been removed a few days earlier, many more lives could have been saved. This was particularly difficult in this case, as Snow had the unenviable task of persuading the authorities too take action based on a theory that was counter to the prevailing medical wisdom at the time. At least any modern-day analysts can take some solace in the knowledge that even our highest regarded dataviz heroes had some frustration in persuading decision makers to actually act on their findings.

This is not at all to reduce Snow’s impact on the world. His work clearly provided evidence that helped lead to germ theory, which we now hold to be the explanatory factor in cases like these. The implications of this are obviously huge. We save lives based on that knowledge.

Even in the short term, the removal of the handle, whilst too late for much of the initial outbreak, may well have prevented a deadly new outbreak. Whitehead happily acknowledged this in his article.

Here I must not omit to mention that if the removal of the pump-handle had nothing to do with checking the outbreak which had already run its course, it had probably everything to do with preventing a new outbreak; for the father of the infant, who slept in the same kitchen, was attacked with cholera on the very day (Sept. 8th) on which the pump-handle was removed. There can be no doubt that his discharges found their way into the cesspool, and thence into the well. But, thanks to Dr. Snow, the handle was then gone.

Lesson 8: even if it looks like your analysis was ignored until it was too late to solve the immediate problem, don’t be too disheartened –  it may well contribute towards great things in the future.

The Tableau #MakeoverMonday doesn’t need to be complicated

For a while, a couple of¬†¬†key members of the insatiably effervescent Tableau community, Andy Cotgreave¬†and Andy Kriebel, have been running a “Makeover Monday” activity. Read more and get involved here – but a simplistic¬†summary would be that they distribute¬†a nicely¬†processed dataset on a topic of the day¬†that¬†relates to someone else’s existing visualisation, and all¬†the rest of us Tableau fans can have a go at making our own chart, dashboard¬†or similar¬†to share back with the community¬†so we can¬†inspire and learn from each other.

It’s a great idea, and generates a whole bunch of interesting entries each week. But Andy K noticed that¬†each Monday’s dataset was getting way more downloads than the number of charts later uploaded, and opened a discussion as to why.

There are of course many¬†possible reasons, but one that came through strongly was that, whilst they were interested in the principle, people didn’t think they had the time to¬†produce something comparable to some of the masterpieces that frequent the submissions. That’s a sentiment I wholeheartedly agree with, and, in retrospect – albeit subconsciously – why I never gave it a go myself.

Chris Love, someone who likely interacts with far more Tableau users than most of us do, makes the same point in his post on the benefits of Keeping It Simple Stupid. I believe it was written before the current MakeoverMonday discussions began in earnest, but was certainly very prescient in its applications to this question.

Despite this awesome community many new users I speak to are often put off sharing their work because of the high level of vizzes out there. They worry their work simply isn’t up to scratch because it doesn’t offer the same level of complexity.

 

To be clear, the original Makeover Monday guidelines did include the guideline that it was quite proper to just spend an hour fiddling around with it. But firstly, after a hard day battling against the dark forces of poor data quality and data-free decisions at work, it can be a struggle to keep on trucking for another hour, however fun it would be in other contexts.

And that’s¬†if you can persuade your family that they should let you keep tapping away for another hour doing what, from the outside, looks kind of like you forgot to finish work.¬†In fact a lot of the worship I have for the zens is how they fit what they do into their lives.

But, beyond that,¬†an hour is not going to be enough to “compete” with the best of what you see other people doing in terms of presentation quality.

I like to think I’m quite adept¬†with Tableau¬†(hey, I have a qualification and everything :-)), but I doubt I¬†could create and validate¬†something like this beauty using an unfamiliar dataset on an unfamiliar topic in under an hour.

 

It’s beautiful; the authors of this and many other¬†Monday Makeovers clearly have an immense amount of skill and vision. It¬†is fascinating to see both the design ideas and technical implementation required to coerce Tableau into doing certain non-native things. I love seeing this stuff, and very much hope it continues.

But¬†if¬†one is not prepared to commit the sort of time needed to do that regularly to this activity, then one has to try and get over the psychological difficulty of sharing a piece of work which¬†one perceives is likely to be thought of as “worse” than what’s already there. This is through no fault of the MakeoverMonday chiefs, who make it very clear¬†that producing a NYT infographic each week is not the aim here – but I certainly see why it’s a deterrent from more of the data-downloaders uploading their work. And it’s great to see that topic being directly addressed.

After all, for those of us who use Tableau for the day-to-day joys of business, we probably don’t rush off and produce something like this wonderful piece every time¬†some product owner comes along to ask us an “urgent” question.

Instead,¬†we spend a few minutes making a line chart, that gives them some¬†insight into the answer to their question.¬†We¬†upload an interactive bar chart, with default Tableau colours and fonts, to let them explore a bit deeper and so on. We sit in a meeting and dynamically provide an answer to enable live decision-making that before we had tools like this would have had to wait a couple of weeks to get a csv report on. Real value is generated, and people are sometimes even impressed, despite the fact that¬†we didn’t include hand-drawn iconography, gradient-filled with¬†the company colours.

Something like this perhaps:

Yes, it’s “simple”, it’s unlikely to go Tableau-viral,¬†but it makes¬†a key story held within that data very clear to see. And its far more typical of the day-to-day Tableau use I see in the workplace.

For the average business question, we probably do not spend a few hours researching and designing a beautiful colour scheme in order to perform the underlying maths needed to make a dashboard combining a hexmap, a Sankey chart and a network graph in a tool that is not primarily designed to do any of those things directly.

No-one doubts that you can cajole Tableau into¬†such artistry, and there is sometimes¬†real value¬†obtainable by doing so, ¬†or¬†that¬†those who carry it out¬†may¬†be creative geniuses¬†-but unless they have a day job that is very different than that of mine and my colleagues, then I suspect it’s not their day-to-day either. It’s probably more¬†an expression of their talent and passion for the Tableau product.

Pragmatically, if I need to make, for instance, a quick network chart for “business”, then, all other things being equal, I’m afraid I’m more likely I get out a tool that’s designed to do that rather than take a bit more time to work out how to implement it in Tableau, no matter how much I love it (by the way,¬†Gephi is my tool of choice for that – it is nowhere near as user friendly as Tableau, but it is specifically designed for that sort of graph visualisation; also recent versions of Alteryx can do the basics). Honestly, it’s rare for me that these more unusual charts need to be part of a standard¬†dashboard; our organisation is simply not at a level of viz-maturity where¬†these diagrams are the most useful for most people in the intended audience, if indeed they are for many organisations.

And if you’re a professional¬†whose job is¬†creating¬†awesome newspaper style infographics, then I suspect that you’re¬†not using Tableau as¬†the tool that provides the final output¬†either,¬†more often than not. That’s not its key strength in my view; that’s not how they sell it – although they are justly proud of the design-thought that does go into the software in general. But if paper-WSJ is your target audience, you might be better of using a more custom design-focused tool, like Adobe Illustrator (and Coursera will teach you that specific use-case, if you’re interested).

I hope¬†nothing¬†here will¬†cause offence. I do understand the excitement and admire anyone’s efforts to push the boundaries of the tool – I have done so myself,¬†spending way more time than is strictly speaking necessary in terms of a¬†theoretical metric of “insights generated per hour”¬†to make something that looks cool, whether in or out of work. For a certain kind of person it’s fun, it is a nice challenge, it’s a change from a blue line on top of an orange line, and sometimes it might even produce a revelation that really does change the world in some way.

This work surely¬†needs to be done; adherents to (a bastardised version of) Thomas¬†Kuhn’s theory of scientific revolutions might even claim¬†this “pushing to the limits” as one¬†of the ways of engendering the¬†mini-crisis necessary to drive¬†forward real progress in the field. I’m sure some of the valuable Tableau “ideas“, that feed the development of the software in part, have come from people¬†pushing the envelope, finding value, and realising there should be an easier way to generate it.

There’s also the issue of engagement: depending on your aim,¬†optimising your work for¬†being shared worldwide may be more important to you than¬†optimising it for efficiency, or even clarity and¬†accuracy. This may sound like heresy, and it may even touch on ethical issues, but¬†I suspect a¬†survey of the most well-known¬†visualisations outside of the data community would reveal a discontinuity with the ideals of Stephen Few et al!

But it may also be intimidating to the weary data voyager when deciding whether to participate in these sort of Tableau community activities if it seems like everyone else produces Da Vinci masterpieces on demand.

Now, I can’t prove this with data right now, sorry, but I just think it cannot be the case. You may see a lot of fancy and amazing things on the internet¬†– but that’s the nature of how stuff gets shared around; it’s a key component of virality. If you create¬†a default line chart, it may actually be the best answer to a given question, but outside a small¬†community who is actively interested in the subject domain at hand, it’s not necessarily going to get much notice. I mean,¬†you could probably find someone who made a Very Good¬†Decision based¬†even on those¬†ghastly Excel 2003 default charts with the horrendous grey background if you try hard enough.

excel2003

Never forget…

 

So, anyway, time to put my money where my mouth is and actually participate in MakeoverMonday. I don’t need to spend even an hour making something if I don’t want to, right? ¬†(after all, I’ve used up all my time writing the above!)

Tableau is sold with emphasis on its speed of data sense-marking, claiming to enable producing something reasonably intelligible 10-100x faster than other tools. If we buy into that hype, then spending 10 minutes of Tableau time (necessitating making 1 less cup of tea perhaps) should enable me to produce something that it could have taken up to 17 hours to produce in Excel.

OK, that might be pushing the marketing rather¬†too literally, but the point is hopefully clear. For #MakeoverMonday, some people may concentrate on how far can they push Tableau¬†outside of its comfort zone, others may focus¬†on how they can integrate the latest best practice in visual design, whereas here I will concentrate on whether I can make anything intelligible¬†in the time that it takes to wait for a coffee in Starbucks (on a bad day) – the “10 minute” viz.

So¬†here’s my first “baked in just¬†10 minutes” viz on¬†the latest MakeoverMonday topic – the growth of the population of Bermuda. Nothing fancy, time ran out just as I was changing¬†fonts, but hey, it’s a readable chart that¬†tells you something about the¬†population change in Bermuda over time. Click through for the slightly interactive version – although of course, it, for instance, has the nasty default¬†tooltips, thanks to the 10 minutes running out just as I was changing the font for the chart titles…

Bermuda population growth.png

 

 

The EU referendum: voting intention vs voting turnout

Next month, the UK is having a¬†referendum on the question of whether it should¬†remain¬†in the European Union, or leave it. All us citizens are having the opportunity to pop down to the ballot box to register our views. And in the mean time we’re subjected to a fairly horrendous ¬†mishmash of “facts” and arguments as to why we should stay or go.

To get the obvious question out of the way,¬†allow me to volunteer that I believe remaining in the EU is the better option, both conceptually and practically. So go tick the right box please! But I can certainly understand¬†the level of confusion amongst the undecided when, to pick one example,¬†one side says¬†things like “The EU is a threat to the NHS” (and produces a much ridiculed video to “illustrate” it) and the other says “Only staying in Europe will protect our NHS”.

So,¬†what’s the result to be? Well, as with any such election, the result depends on both which side¬†each¬†eligible citizen¬†actually would vote for, and the likelihood of that person actually bothering to turn out and vote.

Although overall polling is quite close at the moment, different sub-groups of the population have been identified that are more positive or more negative towards the prospect of remaining in the EU. Furthermore, these groups range in likelihood with regards to saying they will go out and vote (which it must be said is a radically different proposition to actually going out and voting Рtalk is cheap Рbut one has to start somewhere).

Yougov recently¬†published some figures¬†they collected that allow¬†one to connect certain subgroups in terms of the % of them that are in favour of remaining (or leaving, if you prefer to think of it that way around) with the rank order of how likely they are to say¬†they’ll¬†actually go and vote. Below, I’ve taken the liberty of incorporating that data into a dashboard that¬†allows exploration of the¬†populations for which they segmented for, their¬†relative likelihood to vote “remain” (invert it if you prefer “leave”), and how likely they are to turn out and vote.

Click here or on the picture below to go and play. And see below for some obvious takeaways.

Groups in favour of remaining in the EU vs referendum turnout intention

So, a few thoughts:

First we should note that the ranks on the slope chart perhaps over-emphasise differences. The scatterplot helps integrate the idea of what the actual percentage of each population that¬†might vote to remain in Europe is, as opposed to the simple ranking.¬†Although there is substantial variation, there’s¬†no mind-blowing¬†trend in terms of the¬†%¬†who would vote remain and the¬†turnout rank (1 = most likely to claim they will turn out to vote).

Remain support % vs turnout rank

I’ve highlighted the extremes on the chart above. Those most in favour to remain are Labour supporters; those least in favour are UKIP supporters. Although we might¬†note that¬†there’s apparently 3% of UKIP fans who would vote to remain. This is possibly a 3% that¬†should get around to changing party affiliation, given that UKIP was largely set up to campaign to get the UK out of Europe, and¬†its current¬†manifesto rants against “a political establishment that wants to keep us enslaved in the Euro project”.

Those¬†claiming to be most likely to vote are those who say they have a high interest in politics, those least likely are those that say they have a low interest. This makes perfect sense – although it should be noted that¬†one’s personal interest in politics of course does not entirely affect the impact¬†of other people’s political decisions that will then be imposed upon you.

So what?¬†Well, in a conference I went to recently, I was told that a certain US object d’ridicule Donald Trump has made effective use of data in his campaign (or¬†at least his staff did). To paraphrase, they apparently realised rather quickly that no amount of¬†data science would result in the ability¬†to¬†make people who do not already like Donald Trump’s senseless, dangerous, awful policies become fans of¬†him (can you¬†guess my feelings?).¬†That would take more magic than even data could bring.

But they realised that they could target quite precisely where the sort of people who¬†do already¬†tend to like him live, and hence harangue them to get out and vote. And whether that¬†is the reason that this¬†malevolent joker is still in the running or not I wouldn’t like to say – but it looks like it didn’t hurt.

So, righteous¬†Remainers, let’s do likewise. Let’s look for some populations that are¬†already the very¬†favourable to remaining in the EU, and see whether they’re likely to turn out unaided.

Want to remain

Well, unfortunately all of the top “in favour to remain” groups seem to be ranked lower in terms of turnout than in terms of pro-remain feeling, but one variable sticks out like a sore thumb: age. It appears that people at the lower end of the age groups, here¬†18-39, are both some of the most likely subsections of people to be pro-Remain, and some of the least likely to say they’ll go and vote. So, citizens, it is your duty to go out and accost some youngsters; drag’em to the polling booth if necessary. It’s also of interest to note that if leaving the EU is a “bad thing”, then, long term, it’s¬†the younger members of society who are likely to suffer the most (assuming it’s not over-turned any time soon).

But who do we need to nobble educate? Let’s look at the subsections of population that are most eager to leave the EU:

Want to leave.png

OK, some of the pro-leavers also rank quite low in terms of turnout, all good. But a couple of lines rather stand out.

One is age based again; here the opposite end of the spectrum, 60+ year-olds, are some of the least likely to want to remain in Europe and some of the most likely to say they’ll go and vote (historically, the latter has indeed been true). And, well, UKIP people don’t like Europe pretty much¬†by definition¬†–¬†but they seem worryingly likely to¬†claim they’re going to turn up and vote. Time to go on a quick affiliation conversion mission¬†–¬†or¬†at least plan a big purple-and-yellow distraction of some kind…?

 

There’s at least one obvious critical¬†measure missing from this analysis, and that is the respective sizes of the¬†subpopulations.¬†The population of UKIP supporters for instance is very likely,¬†even now, to be smaller than the number of 60+ year olds, thankfully – a fact¬†that you’d have to take into account when deciding how to have the biggest impact.

Whilst the Yougov data published did not include these volumes, they did build¬†a fun interactive “referendum simulator” that, presumably taking this into account, lets you simulate the likely results based on your view of the likely turnout, age¬†& class skew based on their latest polling numbers.

Unsafe abortions: visualising the “preventable pandemic”

In the past few weeks, I was appalled to read that an UK resident was given a prison sentence¬†for the supposed “crime” of having an abortion. This happened¬†because she lives in Northern Ireland, a country where having an abortion is in theory punishable by a life sentence in jail – unless¬†the person in need happens to be rich enough¬†to arrange an overseas appointment for the procedure, in which case it’s OK.

Abortion rights have been a hugely contentious issue over time, but for those of us who reside in a wealthy country with relatively progressive laws on the matter, and the medical resources needed to perform such procedures efficiently, it’s not always easy to remember what the less fortunate may face in other jurisdictions.

In 2016, can it really still be the case that any substantial number of women face legal or logistic issues in their right to choose what happens to their¬†body, under conditions where the huge scientific consensus is against the prospect of any other being suffering? How often do¬†abortions occur –¬†over time, or in different parts of the world? Is there a connection between more liberal laws and abortion rates? And what are the downsides of illiberal, or medically challenged, environments?¬†These, and more, are questions I had that data analysis surely could have a part in answering.

I found useful data in two key places; a 2012 paper published in the Lancet, titled “Induced abortion: incidence and trends worldwide from 1995 to 2008” and¬†from various World Health Organisation publications on the subject.

It should be noted that abortion incidence data is notoriously hard to gather accurately. Obviously, medical records are not sufficient given the existence of illegal or self-administered procedures noted above. It¬†is also not the case that every women has been interviewed about this subject. Worse yet,¬†even where they have been, abortion remains a topic that’s subject to discomfort, prejudice, fear, exclusion, secrecy or even punishment. This occurs in some situations¬†more than others, but the net effect is that it’s the sort of question where straightforward, honest responses to basic survey questions cannot always be expected.

I would suggest to read the 2012 paper above and its appendices to understand more about how the figures I used were modelled by the researchers who obtained them. But the results they show have been peer reviewed, and show enough variance that I believe they tell a useful, indeed vital, story about the unnecessary suffering of women.

It’s time to look into the data. Please click through below and explore the story points to investigate those questions and more. And once you’ve done that¬†-or if you don’t have the inclination to do so¬†–¬†I have some more thoughts to share below.

ua

Thanks for persisting. No need to read further if you were just interested in the data or what you can do with it in Tableau. What follows is simply commentary.

This blog is ostensibly about “data”, the use of which some attribute notions of cold objectiveness to; a Spock-like detachment coming from seeing an abstract number versus understanding¬†events in the real world. But, in my view, most good uses of data necessarily result in the emergence of a narrative; this is a (the?) key skill of a data analyst. The stories data tells may raise emotions, positive or negative. And seeing this data did so in me.

For those that didn’t decide to click through, here is a brief summary of what I saw. It’s largely¬†based on data about the global abortion rate, most often defined here as the number of abortions divided by the number of women aged 15-44. Much of the data is based on 2008. For further source details, please see the visualisation and its sources (primarily this one).

  • The abortion rate in 2008 is pretty similar to that in 2003, which followed a significant drop from 1995. Globally it’s around 28 abortions per 1,000 women aged 15-44. This equates to nearly 44 million abortions per year. This is a process that affects very many women who go through it, affecting also¬†the network of people that love, care for or simply know them.
  • Abortions can be safe or unsafe.¬†The World Health Organisation defines unsafe abortions as being those that consist of:

a procedure for terminating an unintended pregnancy either by individuals without the necessary skills or in an environment that does not conform to minimum medical standards, or both.

  • In reality, this translates to a large variety of¬†sometimes disturbing methods, from ingestion of toxic substances, inappropriate use of medicines, physical trauma to the uterus (the use of a coathanger is the archetypal image for this, so much so that protesters against the criminalisation¬†of abortion have used them as symbols) – or less focussed physical damage; such as throwing oneself down stairs, or off roofs.
  • Appallingly, the proportion of abortions that were unsafe in 2008 has gone up from previous years.
  • Any medical procedure¬†is¬†rarely¬†100% safe, but a safe, legal, medically controlled abortion contains a pretty negligible chance of death. Unsafe abortions are hundreds of times¬†more likely to be fatal to the recipient. And for those that aren’t, literally millions of people suffer consequences so severe they have to seek hospital treatment afterwards – and these are the “lucky” ones for whom hospital treatment is even available. This is to say nothing of the damaging psychological effects.
  • Therefore, societies that enforce or encourage unsafe abortions should do so in the knowledge that their position is killing women.
  • Some may argue that abortion, which few people of any persuasion could think of as a happy or desirable occurrence, is encouraged where it is freely legally available. They are wrong. There is no suggestion in this data that stricter anti-abortion laws decrease the incidence of abortions.

    A WHO report concurs:

Making abortion legal, safe, and accessible does not appreciably increase demand. Instead, the principal effect is shifting previously clandestine, unsafe procedures to legal and safe ones.

  • In fact, if anything, in this data the association runs the other way. Geopolitical regions with a higher proportion of people living in areas where abortions are illegal actually, on the whole, see a higher rate of abortion. I am not suggesting here that more restrictive laws cause more abortions directly, but it is clearly not the case that making abortion illegal necessarily makes it happen less frequently.
  • But stricter laws do, more straightforwardly, lead to a higher proportion of the abortions that take place anyway being unsafe. And thus, on average, to more women dying.

Abortion is a contentious issue and it will no doubt remain so, perhaps¬†mostly¬†for historic, religious or misogynistic reasons. There are nonetheless valid physical and psychological reasons why abortion¬†is, and should be,¬†controlled to some extent. No mainstream view¬†thinks that one should treat the topic lightly or wants to see the procedure becoming a routine event. As the BBC notes, even ardent “pro-choice” activists generally see it as the least bad of a set of bad courses of action available¬†in¬†a¬†situation that noone wanted to occur in the first place,¬†and surely no-one that goes through it is happy it happened. But it does happen, it will happen, and we know how to save thousands of¬†lives.

Seeing this data may well not change your mind if you’re someone who campaigns against legal abortion. It’s hard to shift a world-view that dramatically, especially where so-called moral arguments may be involved.

But Рto paraphrase the Vizioneer, paraphrasing William Wilberforce, with his superb writeup after visualising essential data on the atrocities of modern-day human trafficking Рonce you see the data then you can no longer say you do not know.

The criminalisation-of-abortion lobby are often termed “pro-lifers”. To me, it now seems that¬†that¬†nomenclature has been seized in a twisted, inappropriate way.¬†Once you know that the policies you campaign for will unquestionably lead to the harming and¬†death of real, conscious, living people¬†–¬†then¬†you no longer have the right to label yourself pro-life.

The 2016 UK Budget – what does the data say?

On March 16th 2016, our Chancellor George Osborne set out¬†the cavalcade of new¬†policies that contribute towards¬†this year’s UK budget. Each¬†results in either a cost or saving to the public funds, which has to be¬†forecast as part of the budget¬†release.

Given the constant focus on “austerity”, seeing what¬†this Government chooses to spend its money on and where it makes cuts can be instructive in understanding the priorities of¬†elected (?) ¬†representatives.

Click through this link (or the image below) to access a visualisation to help understand and explore what the budget contains-  what money George spends on which policies, how he saves funds and who it affects most.

Budget 2016

Stress, depression and anxiety in the workplace – what does the data show?

Stress, depression and anxiety are all disorders that can have extremely serious effects for the sufferer. The Health and Safety Executive list quite a few, of varying ranges of severity and scope.

It’s¬†acknowledged that in some cases these can be brought on by problems¬†in the workplace;¬†an issue that desperately needs¬†addressing and resolving given the criticality of¬†paid work in most people’s lives.

Most years, a¬†Labour Force Survey is carried out within the UK, to¬†gain information as to the prevalence and characteristics of people reporting suffering from these conditions in the workplace. Please click through below and explore the tabs to¬†see what¬†the latest edition’s data showed.

Some example questions to consider:

  • how many people in the UK have suffered stress, anxiety or depression as a result of their work in the UK?
  • are some types of people more¬†often affected than others?
  • are¬†certain types of jobs more prone to¬†inducing stress than others? Are there any obvious patterns?
  • does the industry one works¬†in make any difference?
  • how many working days are lost due to these conditions?

Capture

Characteristics of England’s secondary school teachers

In exploring the data behind¬†England’s teacher¬†supply model, it became apparent that the split of teachers by gender and age shows¬†certain patterns by subject. Click through and use the below viz interactively to answer questions such as:

  • How many secondary school teachers are there in the UK?
  • What percentage of all teachers are female?
  • Are there¬†certain subjects¬†where females are over-represented¬†in teachers vs others where males are over-represented?¬†Have we overcome the historic gender stereotypes?
  • What proportion of¬†teachers are below the age of 25? What subjects do they tend to teach?
  • What age-groups are particularly over-represented in females teaching art and design?
  • …and many more.

Use the first tab for a general overview and ranking of subjects on these indices; and the second tab to provide an easy comparison for your chosen subject vs all others.

Capture

 

Are station toilets profitable?

After being charged 50p for the convenience of using a station convenience, I became curious as to whether the owners were making much money on this most annoying expression of a capitalistic monopoly high on the needs of many humans.

It turns out data on those managed by Network Rail is available in the name of transparency Рso please click through and enjoy interacting with a quick viz on the subject.

Train station toilet viz

New chart types coming in Excel 2016

As far as I can recall, it has been many many years since a new chart type of significance has found its way into an Excel update. However for the 2016 release we’re getting some new treats, as seen in this presentation from Scott Ruble.

New chart types in Excel

Several of those chart types are versions of what could already be done in previous versions.

However, few of us did have the time or inclination, so moving these to a one-press-ish method is to be welcomed.

Treemaps and sunbursts are new to Excel to my knowledge (outside of various purchasable addins). I’m not sure they will, or should, become the most used charts in the toolbox, but including them makes Excel a little more competitive with its data visualisation peers on paper.

Early days yet, but from the demo video there doesn’t seem to be any amazing changes to the workflow involved in building an Excel chart. I suspect the likes of Tableau will therefore still be a more pleasurable and faster experience to use for the serious analyst in a hurry.

But anything that makes invaluable visualisations like histograms and box-plots easier to produce on the software most companies install on most computers is a big positive. I look forward to giving it a go. Perhaps it will even tempt some business managers to learn what a box-plot is.