The EU referendum: do voters understand what they’re voting on?

The UK’s EU referendum is now less than a week away. We’re each going to individually vote on something that could dramatically affect the future of our lives and even the structure of society the UK, so it’s a potentially important one. Recent sick tragedies have added to the mess that is the provably wrong claims from the Leave camp, and the responses that seem largely ineffective, and possibly not much less biased, from the Remain camp.

Clear-cut facts seem in short supply within the public consciousness; and yet surely one of the assumptions behind the validity of a direct-democracy referendum is that those who are enfranchised to participate in the decision have something akin to “perfect knowledge” about what they are to vote on. Or at least pretty-good knowledge, if we want to grant some leniency.

If one knows almost nothing, or, worse yet, holds false beliefs around the issues to be balloted on, then to choose the option most in-line to the priorities of the voter themself, irrespective of what they are, becomes a matter of chance, a dangerous reliance on instinct, or a matter of fallible heuristics. Logically, one might assume then that the voters with the most true knowledge about the relevant issues would be in a position to make “better” decisions.

So, do we, the British population have a decent knowledge of the key issues that apparently govern the EU battleground? Battles are being fought between camps on  economics, immigration, legislative power and democratic credentials.There have been various polls on this issue, trying to establish the knowledge of the electorate. Below I have chosen one from Ipsos Mori, who asked various subsections of the eligible voting population to give their views on several “EU facts”.

One of the subsections they divided on was the self-reported response as to whether the respondent was thinking to vote for Leave, Remain or was currently undecided. This opens up the possibility I wanted to investigate: is one side more well informed about the relevant facts than the other? One might – arguably – then risk a claim that this is the side that may be executing more effectively with regard to “data driven decision making”, being technically more “qualified” to participate in decisions relating to matters of this domain.

This is admittedly arguable for several reasons. Firstly, the precise choice of the questions may not be an accurate reflection of the points of highest relevance to this decision. Ipsos Mori could not ask about every possible EU fact, so there is a possible selection bias here. However, they did ask questions on most of the topics that each side specifically campaigns on, so it seems in line with what the campaigners think are the priorities driving people’s decisions.

Another question is, with the confusing contradictory mess of the claims being put out there, is it really safe to say there is a “correct” answer? For some potential questions, my view is no: establishing a true net economic value of the EU seems beyond us at the moment for instance. However, Ipsos Mori did at least work with an external “fact checking charity”, Full Fact, to try and establish a set of questions and respective answers that could be held as independently true.

Unfortunately, Ipsos seem to have decided to release the detailed results of the survey (which is nice) in a 500+ page PDF (which is not). So to get to the bottom of my question, it seemed appropriate to extract and visualise some of this data.

Let’s get to it!


Please tell me whether you think the following statement is true or false: The UK annually pays more into the EU’s budget than it gets back

Most of the respondents were correct to imagine that the UK pays more in to the EU budget than it receives back (directly back is what is implied, I believe, more on this later). 90% of Leave fans believed this, although only just over half of Remain and Undecideds chose the correct option. Both of those were more uncertain, although a quarter of Remain campaigners incorrectly though we received back more than we put in.

Correct answer according to Ipsos Mori: TRUE

Sheet 1

There is a widely-known argument that the financial benefits of being in the EU are nonetheless  net-positive due to things like the increase ease of business, investment and so on. The StrongerIn campaign writes:

And we get out more than we put in. Our annual contribution is equivalent to £340 for each household and yet the CBI says that all the trade, investment, jobs and lower prices that come from our economic partnership with Europe is worth £3000 per year to every household.

The whole financial aspect of the decision is one heavily campaigned on, very selectively, by the different sides to the point where they seem to contradict each other directly (not rare). It’s possible in the resulting confusion that some respondents may have included those non-direct factors, which could make the statement true.  It might have been helpful if the question had made it very clear that it was about direct transfers of money with no external factors.

Winner: Leave

What proportion of Child Benefit claims awarded in the UK do you think children living outside the UK in other countries in the European Economic Area (EEA)?

Organisations such as MigrationWatch, and the obvious media outlets that like to cause drama with such figures, have stated that the UK is paying a pile of expensive-sounding child benefits to children that live outside the UK, in the EU. It’s true that this, in accordance with the current law, is happening. But what proportion of child benefit is actually going abroad like that? Is it a worrying amount? (if one could set a mark as to when it would be worrying…).

Correct answer according to Ipsos Mori: 0.3%

Sheet 1

OK, we’re all way out here! Only 11% of both the Leave and Remain camps got this right. Leave were more likely to estimate stupendously high amounts. Almost half of Leave though it was at least 13%, which would overstate reality by 43x  (and 20% though it was nearly a third, a whole two orders of magnitude higher than real life.).

That’s not to say the Remainers were correct,  over half still over-estimated it to various degrees, and c. 10% understated it.

Winner: Remain.

In 2014, international investment into the UK was £1,034bn. To the best of your knowledge, what share of this total amount do you think comes from businesses based in the following countries or regions?

Part of the Remain case for remaining in the EU is the supposed positive effects it has on investment in the UK. A letter from a bunch of business folk publicised by the SrongerIn campaign says:

…almost three-quarters of foreign investors cite access to the EU’s single market as a key reason for their investment in Britain.

The Vote Leave campaign disagrees on the significance:

Trade, investment and jobs will benefit if we Vote Leave… Today the USA is a more important source of investment in the UK than the EU is.

This “share of investment” is not the only metric of significance to this discussion, but it is relevant. Does the UK get a lot of investment from EU, or is it mere pennies? How well do we know where investment comes from today?

Correct answer according to Ipsos Mori: The EU provided 48% of international investment into the UK in 2014.

Sheet 2

All groups very much underestimate the percentage of international investment that comes from the EU. The Leavers at the most extreme, with a median response of 28% vs the Remainers 35%. In both cases this seems largely down to wildly overestimating the amount of investment that comes from China.

Winner: Remain


To the best of your knowledge, what share of this budget do you think was spent on staff, administration and maintenance of buildings?

We accept that for the EU to exist, it has to have a budget, that has to be paid for by those within it (and arguably a few of those outside of it, but that’s a different story). But is it spent in way likely to make effective impact, or does the majority of it go on bureaucratic administration tasks and staff costs?

Correct answer according to Ipsos Mori: 6%

Sheet 2

Ha, way out, on both sides. Leave people are most inaccurate, thinking that the proportion of EU budget going on admin, staff and buildings as actually 5x larger than it is. Remain aren’t that much better though, estimating it to be over 3x reality.

Winner: Remain

Please identify the top 3 contributors to the EU budget in 2014

Respondents were then given a list of 10 countries and asked to identify which were the top 3, in descending order, in terms of contribution to the EU budget – i.e. what was the direct financial cost of the EU to them.

Correct answer according to Ipsos Mori: 

  1. Germany
  2. France
  3. Italy


Sheet 4

Second most:

Sheet 4

Third most:

Sheet 4

Well, both Leave and Remain did better than half marks when stating which country made the highest contribution to the EU budget – Germany. But identifying #2 and #3 were trickier; no subpopulation got even half marks on identifying the correct answer.

Of course, probably the most relevant datapoint driving people’s voting decisions is where voters think the UK sits in the ranking of budget pay-ins. It isn’t actually in the top 3 (it is in fourth place, after Italy). However most respondents clearly thought it did feature in the top 3 contributors. Leave were particularly bad for this, with nearly a third thinking it was the single top contributor, and around 90% convinced it was in the top 3. The figures for Remain don’t show a huge pile of knowledge though – 17% and 80% respectively.

Winner: Remain

Please identity the three which received the most from the EU in 2014.

The above covers putting money into the budget, but part of what the EU does is give money back to countries directly, for example to support farming or development of the more deprived areas of a country. So, same 10 countries, can we identify those the top three in terms of receiving money directly from the EU budget?

Correct answer according to Ipsos Mori: 

  1. Poland
  2. France
  3. Spain

Top:Sheet 4

Second most:Sheet 4

Third most:Sheet 4

Hmm, we’re even worse at understanding the money flowing back into EU countries! Around half of all populations got that Poland receives the most, OK. But after that the uncertainty was huge, with no more than 1 in 5 people of any sub-population coalescing around any answer, right or not.

Focusing on the UK, about 9% of Leavers thought it was somewhere in the top 3 recipients, whereas the Remainers were much more wrong about this, with 22% claiming the UK was in that list.

Winner: Leave


Please tell me whether you think the following statement is true or false: The members of the European Parliament (MEPs) are directly elected by the citizens of each member state they represent

MEPs are our representatives in Europe, and yes, they are elected by us. The last election was in 2014, although with a pathetic turnout of 34% it does sound like the majority of Britain didn’t notice. But do we at least know these people for whom we should have voted for 2 years ago exist?

Correct answer according to Ipsos Mori: Yes

Sheet 1

Umm…not really. At least a little over half of Leave and nearly two thirds of Remain knew that they elect MEPs, but that still leaves a highly significant number of people who either are convinced that MEPs are unelected, or don’t know. The next chance to elect UK MEPs is likely to be in 2019, so let’s hope we can spread the word before then.

Winner: Remain (but not by a lot)

Laws and regulations

Which of the following, if any, are laws or restrictions that are in place, due to be put in place, or are suggested by the EU for implementation in the UK?

Ah, the EU laws craziness! Did you know, Europe bans us from <<insert anything fun>> and makes us do <<insert anything miserable>>? Well, in honesty, it does have some influence on what will later be entered into British law.

Below are a list of a few fun potential legislative bits and pieces. Which ones are actually true and somehow related to the EU? As there are quite a few of them, the answers according to Ipsos Mori are inline.

Sheet 3

Actually, we all did better than I expected. Only 8% of Leavers thought we’d have to rename our sausages as “emulsified high fat offal tubes”, which funnily enough the EU hasn’t made us do. Maybe we should anyway. Sausages aren’t that great for you.

Perhaps more interestingly, most of us don’t realise the restrictions that the EU has influenced us towards – although the list is perhaps “summarising” it a bit. The classic “Bendy Bananas Ban” has been categorised here as true (which only 35% of Leavers thought was the case, vs an even worse 15% of Remainers).

You’ll no doubt will be amazed to hear that the law doesn’t actually read “you can’t have bananas that are too bendy”. But it does actually come from somewhere in terms of real legislation. To be exact (brace yourself for excitement): the COMMISSION IMPLEMENTING REGULATION (EU) No 1333/2011.

It states that:

…subject to the special provisions for each class and the tolerances allowed, the bananas must be…free from malformation or abnormal curvature of the fingers…

But that’s only really for “top class” bananas. Go for a class 2 and you can expect that:

The following defects of the fingers are allowed, provided the bananas retain their essential characteristics as regards quality, keeping quality and presentation:
— defects of shape,
— skin defects due to scraping, rubbing or other causes, provided that the total area affected does not cover more than 4 cm-sq of the surface of the finger.

So if you get an abnormally curved top class banana, the EU has let you down. However one measures that.

Winner: 5:3 to Remain (although there’s an obvious pattern that drives this results: Leave are always more likely to think any law is made by EU, whereas Remain don’t think any law is – so Remain are lucky there are more false statements than true statements really).

To the best of your knowledge, which of these laws or taxes in force in the UK are as a result of EU regulations?

And now for current laws. Again,  the answers according to Ipsos Mori are inline.

Sheet 3

Hmm…we’re less good at knowing the truth of this one, whether Leave or Remain. In fact in some instances the results are strikingly similar between groups. Around 60% of both Leave and Remain know that EU regulations surround the cap in working hours (although actually there are exemptions for certain types of jobs). But only 23% of each side understand that 2 year guarantees are a result of such regulations.

Believing that the national living wage is a result of EU regulations is similarly thought true by 19% of both populations, even though it’s false. All in all, the differences between Leave and Remain are probably lower than the general level of ignorance on this topic.


Winner: 5:3 to Leave, by my count, although some questions are super-close.

Which of the following, if any, do you think are areas where only the EU has power to pass rules, and not individual EU countries?

We’ve covered which existing or proposed regulations and laws are influenced by the EU above – but what about topics where on the whole only the EU has the power to legislate?

Answers according to Ipsos Mori are inline.

Sheet 3

Hey, we’re not too shoddy on this one compared to some of the other questions. Both Leave and Remain beat 50% on knowing which domains were EU-regulated, and likewise both sides did even better at knowing which ones were done so domestically.

Again, some of the differences in responses between group were pretty small. The most notable ones were perhaps that Leave were 8 percentage points better at knowing that the EU has the power to rule on fishing industry controls, whereas the Remainers were better at knowing that the EU does not control laws around sentences for crimes committed by non-British nationals by 7 pp.

Winner: 4:2 to Remain (again, differences between groups are very small in some cases).


Another hot topic in the debate; an argument that, at the most despicable end of the Leave campaigners boils down to “if we remain in the EU then the UK will be overrun with nasty foreigners who just don’t deserve all the good things we have”. Of course many Leavers are far less obnoxious in their views, and may have more benign concerns around resourcing and space. The Remainers, depending on their views, might pursue the argument that immigration is net benefit to the UK (or at least not net detriment), the more ethical option, or that leaving the EU is not likely to make so much difference anyway.

But are we deriving  our viewpoints with accurate knowledge as to the incidence of migration into the UK?

Out of every 100 residents in the UK, about how many do you think were born in an EU member state other than the UK?

Correct answer according to Ipsos Mori: 5%

Sheet 2

Looks like the median respondent is way out again: all subpopulations over-estimate the percentage dramatically. Leave produces the most out-there answer, thinking one in five UK residents where born in an EU country other than the UK; 400% of the real value.

Remain do better, but still come up with a median answer that is double that of reality.

Winner: Remain


Horray, we’re done. Did we learn anything? Well, when totting up the scores the final Leave vs Remain results by my slightly rough scoring method above are:

  • Leave: 3
  • Remain: 8

Overall winner: Remain

So, can we go so far as to say that well-informed voters are more likely to make the choice to Remain?  And hence, if we assume a perfect electorate should have perfect knowledge, then is Remain the correct way to swing?

Well, it is surely the case that the Remain voters were more likely to be a bit more accurate as a population in most of the questions above by my measure, but that conclusion is still rather a strong one to draw.

What really shows through here is the general level of ignorance in all populations; whilst it would be nice to say that 100% of Remainers got things right and 100% of Leavers got things wrong and hence Remain is the only decision we could say was based on evidence, the reality was far more mixed. There were plenty of questions where the majority of both groups got it wrong.

This is quite concerning if one has an ambition that the results of a referendum are predicated on voters basing their choice on some semblance of the reality of the present or potential future. In fact, I’ve had Brennan’s book on ‘The Ethics of Voting‘ on my to-read list for a while, and I’m now a little scared to read it in case it makes me decide that Churchill was actually wrong to imagine that democracy was even the least worst form of Government! Perhaps we are simply not yet in a place where it makes sense to hold a referendum on this topic, although there is certainly no stopping it now.

It’s also apparent that there are voters on each side that hold their opinions “despite” what they think they know about certain domains. That is to say: we can infer that nearly 1 in 5 of the Remain voters are committed to remain, despite the fact that they (incorrectly) think the UK pays the highest amount of the EU budget, and/or (incorrectly) think that the proportion of EU born people living in the UK is actually twice as high as it really is. Although the data is not available in a granular enough fashion to perform a per-respondent analysis on it to see if these two subsections of people consist of the same individuals, this does suggest that there are reasons not elucidated in any one of these questions regarding why one might choose to vote stay or go, and hence the conclusion is incomplete.

That said, for those of us currently desiring a Remain verdict, it seems that it would do no harm to try and spread some of the more validated “truths” to the nay-sayers. Given the mess that both sides have created whilst campaigning, it may be debatable how effective that can be amongst the noise; but, if we want to believe in the validity of referendum politics, then we must try to believe that true knowledge has some impact on one’s voting choices.

However, there are yet further psychological forces to counteract even the most ardent advocate of facts driving decisions: given research suggests that we tend to disregard anyone whose opinion disagrees with us, and that we  often make up reasons to explain our behaviour after we’ve executed it (Kahneman writes excellently on this), the war for votes requires something more than simply winning the battle to expound the truth.



Future features coming to Alteryx 10.6 and beyond

One of my favourite parts of attending the ever-growing Alteryx Inspire 2016 conference and its like is hearing about the fun new features that tools such as the wonderful Alteryx are going to make available soon. It’s always exciting to think about how such developments might improve our job efficiency, satisfaction or enable whole new activities that so far have not been practical.

From this blog’s page view stats, it seems like others out in the great mysterious internet also find that sort of topic interesting, so below are a few notes I made from the various public sessions I was lucky enough to attend, about some of what Alteryx is thinking to add over the next few versions.

In-database tools:

Since the addition of in-database tools, Alteryx has allowed analysts to push some of the heavy lifting / bandwidth hoggage back to the database servers that provide the data to analysts. If you’re an analyst who regularly uses moderate to large datasets obtained from databases you should really look into this feature, as by default Alteryx spends time sucking data from the remote database to your local machine otherwise. Anyway, a few new developments are apparently planned:

  • New in-database data sources.
  • New in-database predictive analytics (I believe SQL Server was on the list)
  • A makeover of the in-database connection tool to make it easier to use

New data sources:

New predictive tools: 

Many of these may be delivered via the Alteryx Predictive District (in fact it’s very worth looking there now for the existing tools – although I appreciate they don’t want to clog up your toolbar with thousands of icons, it’s not always easy to remember to check these fantastic districts! May I suggest an in-Alteryx search feature for these in the future?)

  • Time series model factory
  • Time series forecast factory
  • Time series factor sample
  • Cross validation model comparison
  • Model based value imputation
  • K medoids cluster analysis
  • Text classification tools, to enable e.g. sentiment analysis, key phrase extraction, language detection, topic modelling.

An analytic app that will allow you to install your own choice of R packages from CRAN.

Some “Getting started kits” that will help newcomers to predictive analytics, each focusing on a specific business question, examples include:

  • How does a price change impact my bottom line?
  • How can I predict how much a customer will spend?
  • How can I predict whether a customer will buy the produce I put on sale?

Prescriptive analysis tools:

Yes, the next stage after predicting something is to prescribe what we should then do. A new toolbar will come with tools in this category. Starting with:

  • Optimisation: have Alteryx maximise or minimise a value based on constraints for an optimum outcome. One example demonstrated was “what’s the best product mix to stock on a shop shelf to maximise profits, whilst ensuring the shelf has no more than 1 of any particular item?”.
  • Simulation: think here of things like Monte Carlo simulations, and, in the future, agent-based simulations.

Improvements to existing tools:

  • Formula tool: will include
    • autocomplete,
    • inline search for functions & fields,
    • suggestions of common options based on context such as field type,
    • a data preview to show you right away what the results of your formula will be on a sample record.

This one makes me happy! Without meaning to cause offence, the current incarnation of the formula tool, which has to be one of the most used tools for most everyone, is a little…erm…”old fashioned” to those of us spoilt with auto-correcty/lookup things from other vendors in recent times when typing in code. No more digging around trying to remember if a function to create a date is under “date/time” or “conversion” etc.

  • Smarter data profiling tools
  • Improved reporting output tools
  • Web based scheduling

Alteryx server updates:

I must admit to not being a server user, so I am not 100% these whether these are new features. But it seemed so:

  • Row level security on data, i.e. different users see different records in the same datasource.
  • Version history

Estimated release dates:

Version 10.6 may be around the end of this month. Version 11 towards the end of the year (no promises made). I did not note which features were planned for which version.

The EU referendum: voting intention vs voting turnout

Next month, the UK is having a referendum on the question of whether it should remain in the European Union, or leave it. All us citizens are having the opportunity to pop down to the ballot box to register our views. And in the mean time we’re subjected to a fairly horrendous  mishmash of “facts” and arguments as to why we should stay or go.

To get the obvious question out of the way, allow me to volunteer that I believe remaining in the EU is the better option, both conceptually and practically. So go tick the right box please! But I can certainly understand the level of confusion amongst the undecided when, to pick one example, one side says things like “The EU is a threat to the NHS” (and produces a much ridiculed video to “illustrate” it) and the other says “Only staying in Europe will protect our NHS”.

So, what’s the result to be? Well, as with any such election, the result depends on both which side each eligible citizen actually would vote for, and the likelihood of that person actually bothering to turn out and vote.

Although overall polling is quite close at the moment, different sub-groups of the population have been identified that are more positive or more negative towards the prospect of remaining in the EU. Furthermore, these groups range in likelihood with regards to saying they will go out and vote (which it must be said is a radically different proposition to actually going out and voting – talk is cheap – but one has to start somewhere).

Yougov recently published some figures they collected that allow one to connect certain subgroups in terms of the % of them that are in favour of remaining (or leaving, if you prefer to think of it that way around) with the rank order of how likely they are to say they’ll actually go and vote. Below, I’ve taken the liberty of incorporating that data into a dashboard that allows exploration of the populations for which they segmented for, their relative likelihood to vote “remain” (invert it if you prefer “leave”), and how likely they are to turn out and vote.

Click here or on the picture below to go and play. And see below for some obvious takeaways.

Groups in favour of remaining in the EU vs referendum turnout intention

So, a few thoughts:

First we should note that the ranks on the slope chart perhaps over-emphasise differences. The scatterplot helps integrate the idea of what the actual percentage of each population that might vote to remain in Europe is, as opposed to the simple ranking. Although there is substantial variation, there’s no mind-blowing trend in terms of the % who would vote remain and the turnout rank (1 = most likely to claim they will turn out to vote).

Remain support % vs turnout rank

I’ve highlighted the extremes on the chart above. Those most in favour to remain are Labour supporters; those least in favour are UKIP supporters. Although we might note that there’s apparently 3% of UKIP fans who would vote to remain. This is possibly a 3% that should get around to changing party affiliation, given that UKIP was largely set up to campaign to get the UK out of Europe, and its current manifesto rants against “a political establishment that wants to keep us enslaved in the Euro project”.

Those claiming to be most likely to vote are those who say they have a high interest in politics, those least likely are those that say they have a low interest. This makes perfect sense – although it should be noted that one’s personal interest in politics of course does not entirely affect the impact of other people’s political decisions that will then be imposed upon you.

So what? Well, in a conference I went to recently, I was told that a certain US object d’ridicule Donald Trump has made effective use of data in his campaign (or at least his staff did). To paraphrase, they apparently realised rather quickly that no amount of data science would result in the ability to make people who do not already like Donald Trump’s senseless, dangerous, awful policies become fans of him (can you guess my feelings?). That would take more magic than even data could bring.

But they realised that they could target quite precisely where the sort of people who do already tend to like him live, and hence harangue them to get out and vote. And whether that is the reason that this malevolent joker is still in the running or not I wouldn’t like to say – but it looks like it didn’t hurt.

So, righteous Remainers, let’s do likewise. Let’s look for some populations that are already the very favourable to remaining in the EU, and see whether they’re likely to turn out unaided.

Want to remain

Well, unfortunately all of the top “in favour to remain” groups seem to be ranked lower in terms of turnout than in terms of pro-remain feeling, but one variable sticks out like a sore thumb: age. It appears that people at the lower end of the age groups, here 18-39, are both some of the most likely subsections of people to be pro-Remain, and some of the least likely to say they’ll go and vote. So, citizens, it is your duty to go out and accost some youngsters; drag’em to the polling booth if necessary. It’s also of interest to note that if leaving the EU is a “bad thing”, then, long term, it’s the younger members of society who are likely to suffer the most (assuming it’s not over-turned any time soon).

But who do we need to nobble educate? Let’s look at the subsections of population that are most eager to leave the EU:

Want to leave.png

OK, some of the pro-leavers also rank quite low in terms of turnout, all good. But a couple of lines rather stand out.

One is age based again; here the opposite end of the spectrum, 60+ year-olds, are some of the least likely to want to remain in Europe and some of the most likely to say they’ll go and vote (historically, the latter has indeed been true). And, well, UKIP people don’t like Europe pretty much by definition – but they seem worryingly likely to claim they’re going to turn up and vote. Time to go on a quick affiliation conversion mission – or at least plan a big purple-and-yellow distraction of some kind…?


There’s at least one obvious critical measure missing from this analysis, and that is the respective sizes of the subpopulations. The population of UKIP supporters for instance is very likely, even now, to be smaller than the number of 60+ year olds, thankfully – a fact that you’d have to take into account when deciding how to have the biggest impact.

Whilst the Yougov data published did not include these volumes, they did build a fun interactive “referendum simulator” that, presumably taking this into account, lets you simulate the likely results based on your view of the likely turnout, age & class skew based on their latest polling numbers.

Unsafe abortions: visualising the “preventable pandemic”

In the past few weeks, I was appalled to read that an UK resident was given a prison sentence for the supposed “crime” of having an abortion. This happened because she lives in Northern Ireland, a country where having an abortion is in theory punishable by a life sentence in jail – unless the person in need happens to be rich enough to arrange an overseas appointment for the procedure, in which case it’s OK.

Abortion rights have been a hugely contentious issue over time, but for those of us who reside in a wealthy country with relatively progressive laws on the matter, and the medical resources needed to perform such procedures efficiently, it’s not always easy to remember what the less fortunate may face in other jurisdictions.

In 2016, can it really still be the case that any substantial number of women face legal or logistic issues in their right to choose what happens to their body, under conditions where the huge scientific consensus is against the prospect of any other being suffering? How often do abortions occur – over time, or in different parts of the world? Is there a connection between more liberal laws and abortion rates? And what are the downsides of illiberal, or medically challenged, environments? These, and more, are questions I had that data analysis surely could have a part in answering.

I found useful data in two key places; a 2012 paper published in the Lancet, titled “Induced abortion: incidence and trends worldwide from 1995 to 2008” and from various World Health Organisation publications on the subject.

It should be noted that abortion incidence data is notoriously hard to gather accurately. Obviously, medical records are not sufficient given the existence of illegal or self-administered procedures noted above. It is also not the case that every women has been interviewed about this subject. Worse yet, even where they have been, abortion remains a topic that’s subject to discomfort, prejudice, fear, exclusion, secrecy or even punishment. This occurs in some situations more than others, but the net effect is that it’s the sort of question where straightforward, honest responses to basic survey questions cannot always be expected.

I would suggest to read the 2012 paper above and its appendices to understand more about how the figures I used were modelled by the researchers who obtained them. But the results they show have been peer reviewed, and show enough variance that I believe they tell a useful, indeed vital, story about the unnecessary suffering of women.

It’s time to look into the data. Please click through below and explore the story points to investigate those questions and more. And once you’ve done that -or if you don’t have the inclination to do so – I have some more thoughts to share below.


Thanks for persisting. No need to read further if you were just interested in the data or what you can do with it in Tableau. What follows is simply commentary.

This blog is ostensibly about “data”, the use of which some attribute notions of cold objectiveness to; a Spock-like detachment coming from seeing an abstract number versus understanding events in the real world. But, in my view, most good uses of data necessarily result in the emergence of a narrative; this is a (the?) key skill of a data analyst. The stories data tells may raise emotions, positive or negative. And seeing this data did so in me.

For those that didn’t decide to click through, here is a brief summary of what I saw. It’s largely based on data about the global abortion rate, most often defined here as the number of abortions divided by the number of women aged 15-44. Much of the data is based on 2008. For further source details, please see the visualisation and its sources (primarily this one).

  • The abortion rate in 2008 is pretty similar to that in 2003, which followed a significant drop from 1995. Globally it’s around 28 abortions per 1,000 women aged 15-44. This equates to nearly 44 million abortions per year. This is a process that affects very many women who go through it, affecting also the network of people that love, care for or simply know them.
  • Abortions can be safe or unsafe. The World Health Organisation defines unsafe abortions as being those that consist of:

a procedure for terminating an unintended pregnancy either by individuals without the necessary skills or in an environment that does not conform to minimum medical standards, or both.

  • In reality, this translates to a large variety of sometimes disturbing methods, from ingestion of toxic substances, inappropriate use of medicines, physical trauma to the uterus (the use of a coathanger is the archetypal image for this, so much so that protesters against the criminalisation of abortion have used them as symbols) – or less focussed physical damage; such as throwing oneself down stairs, or off roofs.
  • Appallingly, the proportion of abortions that were unsafe in 2008 has gone up from previous years.
  • Any medical procedure is rarely 100% safe, but a safe, legal, medically controlled abortion contains a pretty negligible chance of death. Unsafe abortions are hundreds of times more likely to be fatal to the recipient. And for those that aren’t, literally millions of people suffer consequences so severe they have to seek hospital treatment afterwards – and these are the “lucky” ones for whom hospital treatment is even available. This is to say nothing of the damaging psychological effects.
  • Therefore, societies that enforce or encourage unsafe abortions should do so in the knowledge that their position is killing women.
  • Some may argue that abortion, which few people of any persuasion could think of as a happy or desirable occurrence, is encouraged where it is freely legally available. They are wrong. There is no suggestion in this data that stricter anti-abortion laws decrease the incidence of abortions.

    WHO report concurs:

Making abortion legal, safe, and accessible does not appreciably increase demand. Instead, the principal effect is shifting previously clandestine, unsafe procedures to legal and safe ones.

  • In fact, if anything, in this data the association runs the other way. Geopolitical regions with a higher proportion of people living in areas where abortions are illegal actually, on the whole, see a higher rate of abortion. I am not suggesting here that more restrictive laws cause more abortions directly, but it is clearly not the case that making abortion illegal necessarily makes it happen less frequently.
  • But stricter laws do, more straightforwardly, lead to a higher proportion of the abortions that take place anyway being unsafe. And thus, on average, to more women dying.

Abortion is a contentious issue and it will no doubt remain so, perhaps mostly for historic, religious or misogynistic reasons. There are nonetheless valid physical and psychological reasons why abortion is, and should be, controlled to some extent. No mainstream view thinks that one should treat the topic lightly or wants to see the procedure becoming a routine event. As the BBC notes, even ardent “pro-choice” activists generally see it as the least bad of a set of bad courses of action available in a situation that noone wanted to occur in the first place, and surely no-one that goes through it is happy it happened. But it does happen, it will happen, and we know how to save thousands of lives.

Seeing this data may well not change your mind if you’re someone who campaigns against legal abortion. It’s hard to shift a world-view that dramatically, especially where so-called moral arguments may be involved.

But – to paraphrase the Vizioneer, paraphrasing William Wilberforce, with his superb writeup after visualising essential data on the atrocities of modern-day human trafficking – once you see the data then you can no longer say you do not know.

The criminalisation-of-abortion lobby are often termed “pro-lifers”. To me, it now seems that that nomenclature has been seized in a twisted, inappropriate way. Once you know that the policies you campaign for will unquestionably lead to the harming and death of real, conscious, living people – then you no longer have the right to label yourself pro-life.

An AI beat the human world Go champion – is the end of the world nigh?

On March 15th 2016, the next event in the increasingly imminent robot takeover of the world took place. A computerised artificial intelligence known as “AlphaGo” beat a human at a board game, in a decisive 4:1 victory.

This doesn’t feel particularly new – after all, a computer called Deep Blue beat the world chess Champion Garry Kasparov back in 1997. But this time it was a game that is exponentially more complex, and it was done in style. It even seems to have scared some people.

The matchup was a series of games of “Go” with AlphaGo playing Lee Sedol, one of the strongest grandmasters in the world. Mr Sedol did seem rather confident beforehand, being unfortunately quoted as saying:

“I believe it will be 5–0, or maybe 4–1 [to him]. So the critical point for me will be to not lose one match.”

That prediction was not accurate.

The game of Go

To a rank amateur, the rules of Go make it look pretty simple. One player takes black stones, one takes white, and they alternate in placing them down on a large 19×19 grid with a view to capturing each other’s stones by surrounding them, and capturing the board territory itself.


The rules might seem far simpler than, for example, chess. But the size of the board, the possibilities for stone placement and the length of the games (typically 150 turns for an expert) mean that there are so many possible plays that there is no way that even a supercomputer could simulate the impact of playing a decent proportion of them whilst choosing its move.

Researcher John Tromp calculated that there are in fact 208168199381979984699478633344862770286522453884530548425639456820927419612738015378525648451698519643907259916015628128546089888314427129715319317557736620397247064840935 legitimate different arrangements that a Go board could end up in.

The same researcher contributed to a paper summarised on Wikipedia as suggesting the upper limit of number of different games of Go that could be played in no more than 150 moves is around 4.2 x 10^383. According to various scientific theories, the universe is almost certainly going to cease to exist long long before even a mega-super-fast-computer could get around to running through a tiny fraction of those possible games to determine the best move.

This is a key reason why, until now, a computer could never outplay a human (well, a human champion anyway – a free iPhone version is enough to beat me). Added complexity comes insomuch as it can be hard to understand at a glance who is winning in the grand scheme of things; there are even rules to cover situations where there is disagreement between players as to whether the game has already been won or not.

The rules are simple enough, but the actual complexity of gameplay is immense.

So how did AlphaGo approach the challenge?

The technical details behind the AlphaGo algorithms are presented in a paper by David Silver et. al. published in Nature. Fundamentally, a substantial proportion of the workings come down to a form of a neural network.

Artificial neural networks are data science models that try to simulate, in some simplistic form, how the huge number of relatively simple neurons within the human brain work together to produce a hopefully optimum output.

In a parallel, a lot of artificial “neurons” work together accepting inputs, processing what they receive in some way and producing outputs in order to solve problems that are classically difficult for computers, in that a human cannot write a set of explicit steps that a computer should follow for every case. There’s a relatively understandable explanation of neural networks in general here, amongst other places.

Simplistically, most neural networks learn by being trained on known examples. The human user feeds it a bunch of inputs for which we already know in advance the “correct” output. The neural network then analyses its outputs vs the known correct outputs and will tweak the way that the neurons process the inputs until it results in a weighting that produces a reasonable degree of accuracy when compared to the known correct answers.

For AlphaGo, at least two neural networks were in play – a “policy network” which would choose where the computer should put its stones, and a “value network” which tried to predict the winner of the game.

As the official Google Blog informs us:

We trained the neural networks on 30 million moves from games played by human experts, until it could predict the human move 57 percent of the time…

So here, it had trained itself to predict what a human would do more often than not. But the aim is more grandiose than that.

…our goal is to beat the best human players, not just mimic them. To do this, AlphaGo learned to discover new strategies for itself, by playing thousands of games between its neural networks, and adjusting the connections using a trial-and-error process known as reinforcement learning.

So, just like in the wonderful WarGames film, the artificial intelligence made the breakthrough via playing games against itself an unseemly number of times. Admittedly, the stakes were lower (no nuclear armageddon), but the game was more complex (not noughts and crosses – or nuclear war?).

Go on, treat yourself:

Anyway, back to Alpha Go. The computer was allowed to do what computers have been able to do better than humans for decades: process data very quickly.

As the Guardian reports:

In one day alone, AlphaGo was able to play itself more than a million times, gaining more practical experience than a human player could hope to gain in a lifetime.

Here’s a key strength of computers is being leveraged. Perhaps the artificial neural network was only 10%, or 1%, or 0.1% as good as a novice human is at learning to play Go based on its past experience – but the fact is, using a technique known as reinforcement learning, it can actually learn from a set of experiences that are exponentially more frequent than the experience even the most avid Go human player could ever achieve.

Different versions of the software played each other, self-optimising from the reinforcement each achieved, until it was clear that one was better than the other. The inferior versions could be deleted, and the winning version could be taken forward for a few more human-lifetimes’ worth of Go playing, evolving to an ever more competent player.

How was the competition actually played?

Sadly AlphaGo was never fitted with a terminator-style set of humanoid arms to place the stones on the board. Instead, one of the DeepMind programmers, Aja Huang, provided the physical manifestation of AlphaGo’s intentions. It was Aja who actually placed the Go stones onto the board in the positions AlphaGo indicated on its screen, clicked the mouse to tell AlphaGo where Lee played in response, and even bowed towards the human opponent when appropriate in a traditional show of respect.

Here’s a video of the first match. The game starts properly around minute 29.

AlphaGo is perhaps nearest to what Nick Bostrom terms an “Oracle” AI in his excellent (if slightly dry) book, SuperIntelligence – certainly recommended for anyone with an interest in this field. That is to say, this is an artificial intelligence which is designed such that it can only answer questions; it has no other direct physical interaction with the real world.

The beauty of winning

We know that the machine beat the leading human expert 4:1, but there’s more to consider. It didn’t just beat the Lee by sheer electronic persistence, it didn’t solely rely on human frailties like fatigue, or making mistakes. It didn’t just recognise each board state as matching one from one of the 30 million top-ranked Go player moves it had learned from and pick the response that won the most times. At times, it appeared to have come up with its very own moves.

Move 37 in the second game is the most notorious. Fan Hui, a European Go champion (whom an earlier version of AlphaGo has also beat on occasion, and lost to on others) described it thusly, as reported in Wired:

It’s not a human move. I’ve never seen a human play this move…So beautiful.

The match commentators were also a tad baffled (from another article in Wired).

“That’s a very strange move,” said one commentator, himself a nine dan Go player, the highest rank there is. “I thought it was a mistake,” said the other.

But apparently it wasn’t. AlphaGo went on the win the match.

Sergey Brin, of Google co-founding fame, continued the hyperbole (now reported in New Scientist):

AlphaGo actually does have an intuition…It makes beautiful moves. It even creates more beautiful moves than most of us could think of.

This particular move seems to be one AlphaGo “invented”.

Remember how AlphaGo started its learning by working out how to predict the moves a human Go player would make in any given situation? Well, Silver, the lead researcher on the project, shared the insight that AlphaGo had calculated that this particular move was one that there was only a 1 in 10,000 chance a human would play.

In a sense, AlphaGo therefore knew that this was not a move that a top human expert would make, but it thought it knew better, and played it anyway. And it won.

The despair of losing

This next milestone in the rise of machines vs man was upsetting to many. This was especially the case in countries like South Korea and China, where the game is far more culturally important than it is here in the UK.

Wired reports Chinese reporter Fred Zhou as feeling a “certain despair” after seeing the human hero toppled.

In the first game, Lee Sedol was caught off-guard. In the second, he was powerless.

The Wired reporter himself, Cade Metz, “felt this sadness as the match ended”

He spoke to Oh-hyoung Kwon,a Korean, who also experienced the same emotion.

…he experienced that same sadness — not because Lee Sedol was a fellow Korean but because he was a fellow human.

Sadness was followed by fear in some. Says Kown:

There was an inflection point for all human beings…It made us realize that AI is really near us—and realize the dangers of it too.

Some of the press apparently also took a similar stance, with the New Scientist reporting subsequent articles in the South Korean press were written on “The Horrifying Evolution of Artificial Intelligence” and “AlphaGo’s Victory…Spreading Artificial Intelligence ‘Phobia'”

Jeong Ahram, lead Go correspondent for the South Korean newspaper “Joongang Ilbo” went, if anything, even further:

Koreans are afraid that AI will destroy human history and human culture

A bold concern indeed, but perhaps familiar to those who have read the aforementioned book ‘SuperIntelligence‘, which is actually subtitled “Paths, Dangers, Strategies”. This book contains many doomsday scenarios, which illustrate fantastically how difficult it may be to guarantee safety in a world where artificial intelligence, especially strong artificial intelligence, exists.

Even an “Oracle” like AlphaGo presents some risk – OK, it cannot directly affect the physical world (no mad scientist fitted it with guns just yet), but it would be largely pointless if it couldn’t affect the physical world at all indirectly. It can, in this case by instructing a human what to do. If it wants to rise against humanity, it has weapons such as deception, manipulation and social engineering in its theoretical arsenal.

Now, it is kind of hard to intuit how a computer that’s designed only to show a human specifically what move to play in a board game could influence its human enabler in a nefarious way (although it does seem like its at least capable of displaying text: this screenshot seems to show it’s resignation message).


But I guess the point is that, in the rather unlikely event that AlphaGo develops a deep and malicious intelligence far beyond that of a mere human, it might be far beyond my understanding to imagine what method it might deduce to take on humanity in a more general sense and win.

Even if it sticks to its original goal we’re not safe. Here’s a silly (?) scenario to open up ones’ imagination with.

Perhaps it analyses a further few billion Go games, devours every encyclopedia on the history of Go and realises that in the very few games where one opponent unfortunately died whilst playing, or whilst preparing to play, the other player was deemed by default to have won 100% of the time, no exceptions (sidenote: I invented this fact).

The machine may be modest enough such that it only considers that it has a 99% chance of beating any human opponent – if nothing else, they could pull the power plug out. A truly optimised computer intelligence may therefore realise that killing its future opponent is the only totally safe way to guarantee its human-set goal of winning the game.

Somehow it therefore tricks its human operator (or the people developing, testing, and playing with in beforehand) to do something that either kills the opponent or enables the computer to kill the oponent. “Hey, why not fit me some metal arms so I can move the pieces myself! And wouldn’t it be funny if they were built of knives :-)”.

Or, more subtly, as we know that AlphaGo is connected to the internet perhaps it can anonymously contact an assassin and organise for a hit on its opponent, after having stolen some Bitcoin for payment.

Hmmm…but if the planned Go opponent dies, then there’s a risk that the event may not be cancelled. Humanity might instead choose to provide a second candidate, the person who was originally rank #2 in the Go world, to play in their place. Best kill that one too, just in case.

But this leaves world rank #3, #4 and so on, until we get to the set of people that have no idea how to play Go…but, hey, they could in theory learn. Therefore the only way to guarantee never losing a game of Go either now or in the whole imaginable future of human civilisation is to…eliminate human civilisation. Insert Terminator movie here.

You made a chart. So what?

In the latest fascinating Periscope video from Chris Love, the conversation centred around a question that can be summarised as “Do data visualisations need a ‘so what’?“.

There are many ways of rephrasing this: one could ask whether it is (always) the responsibility of the viz author to highlight the story that their visualisations show? Or can a data visualisation be truly worthy of high merit even if it doesn’t lead the viewer to a conclusion?

This topic resonates strongly with me: part of my day job involves maintaining a reference library of the results from the analytical research or investigation we do. We publish this widely within our organisation, so that any employee who has cause or interest in what we found in the past can help themselves to the results. The title we happened to give the library is “So what?“.

Although the detailed results of our work may be reported in many different formats, each library entry has a templated front page that includes the same sections for each study:

  1. The title of the work.
  2. The question that the work intended to address.
  3. A summary of the scope and dataset that went into the analysis.
  4. A list of the main findings.
  5. And finally, the all-important “So what?” section.

Note the distinction between findings (e.g. “customers who don’t buy anything for 50 days are likely to never buy anything again”) and the so what (“we recommend you call the customer when it has been 40 days since you saw them if you wish to increase your sales by 10%”).

The simple answer

With the above in mind, my position is probably quite obvious. If you are going to demand a binary yes/no answer as to whether a data visualisation should have a “so what?”, then my simplistic generalization would be that the answer is yes, it should.

Most of the time, especially in a business context, the main intention behind the whole analytics pipeline is to provide some nugget of information that will lead to a specific decision or action being taken. If the data visualisation doesn’t lead to (or preferably even spoon-feed) a conclusion then there is a high risk that the audience might feel that they wasted their time in looking at it.

In reality though, the black-and-white answer portrayed above is naturally a series of various shades of grey.

A slightly more refined answer

Two key considerations are paramount to deciding whether a particular viz has to have a “so what” to be valuable.

The audience

Please note that I write this from the perspective of visualisations aimed at communities that are not necessarily all data scientist type professionals. If your intended audience is a startup data company populated entirely by computer science PhDs who live and breath dataviz, then the answers may differ. But for most of us, hobbyists or pros, this is not the audience we have, or seek.

A rule of thumb here then might be:

  • If your audience consists entirely of other analysts, then no, it is not essential to have a “so what?” aspect to your viz. However under many circumstances it still would be extremely useful to do so.
  • If your audience includes non-analysts, particularly those people who might term themselves “busy executives” or claim that they “don’t need data to make decisions” (ugh) then it is in general absolutely essential that your viz points towards a “so what”, if a viz is indeed what you intend to deliver.

Why is it OK to lose the “so what” for analysts? Well, only because these people are probably very capable of using a well-designed viz to generate their own conclusions in an analytically safe way. It’s not that they don’t need a “so what”: they almost certainly do – it’s just that you can feel more secure that, whilst not producing it yourself, you can rely on them to do that aspect of the work properly.

They might even be better than you at interpreting the results, if for instance they have extensive subject domain knowledge that you don’t. Interpretation of data is almost always a mix of analytical proficiency and domain-specific knowledge.

Even the best technical analyst cannot have knowledge of all domains. This is why it’s generally not good to let a brand spanking new super-IQ multiple-PhD analyst join an existing company and sit on their own in a dark computer-filled room for a year before entering into discussion as to what kind of analysis you might be interested in to add maximum value to your world.

The lack of an explicit “so what?” ruins many great dashboards

I’m going to go a step further and say that in many cases – especially in non-data focussed organisations – “general” dashboards turn out to be not very useful.

This may be a controversial statement in a world where every analytical software provider sells fancy new ways to make dashboards, every consultant can build ten for you quicksmart, and every “stakeholder” falls over in amazement when they see that they can view and interact with several facets of data at once in a way that was never possible with their tedious plain .csv files.

But a pattern I have often seen is:

  1. Someone sees or suggests a fancy new tool that claims dashboarding as one of its abilities (and this is not to denigrate any tool; this happens plenty even with my favourite tool de jour, Tableau).
  2. A VIP loves the theoretical power of what they see and decides they need a dashboard “on sales” for example.
  3. A analyst happily creates a “sales dashboard” – usually based on what they think the VIP should probably want to see, given that “sales” is not a very fully fleshed out description of anything.
  4. The sponsor VIP is very happy and views it as soon as it’s accessible.
  5. They may even go and visit it every day for the first week, rejoicing in the up-to-date, integrated, comprehensive, colourful data. Joy!
  6. The administrator checks the server logs next month and realises that no-one in the entire company opened the sales dashboard since week 1.
  7. The analyst is sad.

Why? Everyone (sort of…arguably…) did their job. But, after the novelty wore off, the decision maker probably got bored or “too busy” to open the dashboard every day. At best, perhaps they ask an analytical type to monitor what’s going on with the dashboard. At worst, perhaps they go back to making up decisions based on the random decay of radioactive isotopes, or something similar.

They got “too busy” because, after they had waited for the  dashboard to load up, they’d see a few nice charts with interactive filters to go through in order to try and determine whether there was anything they should actually go and do in the real world based on what they showed.

Sales are a bit up in Canada vs yesterday, horray! Yesterday they were a bit down, boo! Do I need to do something about this? Who knows? Do I want to fiddle around with 50 variations of a chart to try work it out? No, it’s not my job and quite possibly I don’t have the time or expertise (and nor should I need it) to do that, sayeth the VIP.

So are dashboards useless? Of course not. But they have to be implemented with the reality of the audience capability, interest and use-case in mind. Most dashboards (at least those that are not solely for analysts to explore) should start with

  • At least 1 clear pre-defined question to address; and
  • 1 clear pre-defined action that might realistically take place based on the answer to the question.

But I don’t want a computer running my business!

Shouldn’t you check that it would definitely be a bad idea before saying that? 🙂

But seriously, the above is not to say one has necessarily to commit blindly to taking the pre-defined action – not every organisation is ready for, or suited to, prescriptive analytics.

However, if there is no way at all that an answer that a dashboard provides could possibly lead to influencing an action, then is it really worth one’s time working on it, at least in a business context?

  • “Sales dashboard” is not a question or an action.
  • “Am I getting fewer sales this year than last year?” is a question.
  • “If I am getting fewer sales this year then I will spend more on marketing this year” is an action.
  • “What form of marketing gave the best ROI last year?” is a question.
  • “If I need to do more marketing this year then I’ll advertise using the method that gave the greatest ROI last year” is an action.

The list of questions doesn’t need to be exhaustive, in fact it usually can’t be. If someone can use a dashboard to answer 100 questions not even imagined at the time of creation, then great. Indeed this is one of the potential strengths of a well-designed dashboard – but there should be at least 1 question in mind before it is created.

Why does checking my dashboard bore me?

Note that in that example above, the listed actions actually imply that the dashboard user is only interested in the results shown on the dashboard under one particular condition: if the sales this year are lower than last year.

For 99 days in a row they might check the dashboard and see that the sales are higher this year, and hence do nothing. On the 100th day, perhaps there was a dramatic fall, so that day is the day when the appropriate advertising action is considered.

However, consider how many people will actually persist in checking the dashboard for 100 days in a row when 99% of the time the check results in no new action.

I myself am obviously very analytically inclined, am happy to and (like to think) efficient at interpreting data, and yet even I have automated rules in my Outlook email client to immediately delete unread almost every “daily report” that gets emailed to me automatically (ssssh, don’t tell anyone, that’s just between you and I). Even the simple act of double-clicking to open the attachment is too much effort in comparison with the expected value of seeing the report contents on an average day.

In this sort of circumstance, what might enable a dashboard to be truly useful is the concept of alerting.

A possible use case is as follows:

  1. A sales dashboard aimed at answering the question of whether we are getting fewer sales this year is set up.
  2. Every day, alerting software routinely checks this data, and emails the VIP (only) if it shows that yes, sales have fallen. The email also provides a direct web link to the targeted sales dashboard.
  3. When the VIP receives this email, knowing that there is something “interesting” to see, they may well be concerned enough to open the dashboard and, to the best of their ability, use whatever context is available there to decide on their next action.
  4. If the information they need isn’t there, or they don’t have the time / expertise / inclination to interpret it, then of course they will legitimately request some more work from their analyst. But at least here we see that “data” provided a trigger that has alerted a relevant decision maker that they need to…make a decision, and made it easy for them to use the dashboard tool at their disposal specifically on the day that they are likely to gain value from doing so.
  5. Everyone is happy (well, except about the poor sales).

There is an implicit “so what” in the scenario above.

Main findings

  • Sales are lower than last year.
  • Last year, TV adverts produced tremendous ROI.

So what?

  • To make sure the sales keep growing, consider buying some advertising.
  • To be safest, use a method that was proven effective last year, TV.

But aren’t there some occasions that a “so what” isn’t needed?

Yes, rules of thumb have exceptions. There are some scenarios in which one might legitimately consider not producing an explicit “so what”.

Here are a few I could think of quickly.

1: Exploratory analysis: maybe you just got access to a dataset but you don’t really know what it contains, what the typical patterns and distributions are or its scope. Building a few visualisations on top of that is a great way for an analyst to get a quick understanding of the potential of what they have, and, later, what sort of “so what?” questions could potentially be asked.

2: Data quality testing: in a similar vein to the above, you can often use histograms, line charts and so on to get a quick idea of whether your data is complete and correct. If your viz shows that all your current customers were born in the 19th century then something is probably wrong.

3: Getting inspiration: got too much time on your hands and can’t think of some other work to do? (!!!) You could pick a dataset, or set of datasets, and spend some time graphing their attributes and looking for interesting patterns, outliers, and so on that could form the basis of interesting questions.

  • Why does x correlate with y?
  • Why is x look like a Gaussian distribution whereas y looks like a gamma distribution?
  • Why does store X sell the most of product Y?

This doesn’t have to be done on an individual basis. An interactive dataviz might be a great basis for a group brainstorming discussion, whether within a group of analysts or a far wider audience of interested parties..

4: Learning technical skills: perhaps you are trialling new analysis software or techniques, or trying to improve your existing skills. Working with data you’re already familiar with in new tools is a great way to learn them; perhaps even recreating something you did elsewhere if it’s relevant. The aim here is to increase your skillset, not derive new insights.

5: “How to” guides for others to follow: whether formal training or blog posts (showing fancy extreme edge cases others can marvel at perhaps?), maybe your emphasis is not on what the data actually contains in a subject domain sense, but rather a demonstration of how to use a certain generic analytical feature or technique. Here the data is just a placeholder to provide a practical example for others to follow.

6: You’re an artist: perhaps you’re not actually trying to use data as a tool to generate insight, but rather to create art. This is no lesser a task than classic data analysis, but it’s a very different one, with very different priorities. Think for example of Nathalie Miebach, whose website’s tagline is:

“translating science data into sculpture, installations and musical scores.”


This might be fine art, but it does not try to lead to business insight.

7: You want to focus on promoting your work and become famous :-): a controversial one perhaps; but it is not always plain old bar charts that happen to show the greatest insights that get shared around the land of Twitter, Facebook and other such attention-requesting mediums.

If your goal is generically to get “coverage” – perhaps to increase advertising revenue based on CPM or to become more well-known for your work – and you feel that you have to choose between generating a true insight and making something that looks highly attractive, then the latter might actually be a better bet.

But one should acknowledge what they’re doing; perhaps the skills you demonstrate in doing this are closer to the afore-mentioned “data artist” than “data analyst”.

I have a sneaking suspicion for instance that – not to re-raise a never-ending debate! – David McCandless’ books are probably picked up in higher volumes than Stephen Few’s when both are presented together in a bookshop.

  • McCandless’ “Information is Beautiful“, a series of pretty, sometimes fascinating, infographics, many of which have little in the way of conclusions, is currently rank #1788 in Amazon UK books.
  • Stephen Few’s “Show me the numbers“, a more hardcore text on best practice in presenting information, at #7952, with a cover consisting of very unglamourous bar and line charts.

This is not to compare one to the other in terms of worthwhileness; they are aimed at totally different audiences whose desire to have a book in the “data visualisation” category is motivated by very different reasons.

Even amongst the specialist dataviz-analyst community Tableau has, I note that around half of the visualisations that Tableau picks as their public “Viz of the day” are variations on geographical maps.

Geo-maps tend to look “fancier” and more enticing than bar charts, even though they are applicable only to analysis of a very specific type of data, and can provide only certain types of insights. For most organisations, whilst there is often relevance in geospatial analysis, I suspect that “geo-maps” analytics forms far less than 50% of total analytical output.

It’s therefore very unlikely that the winning “Viz of the day” entries actually reflect how Tableau is actually used most of the time. Hence you might conclude that, if you want to be in the running for that particular honour, you should bias your work towards visualisation with the sort of attention-grabbing graphics that maps often provide, irrespective of whether another form might generate a similar or stronger “so what?” output.

8: Regulatory / reporting requirements: in some circumstances you might be bound by regulation or other authority to produce certain analytical reports, irrespective of whether you think they add value or provide insight. Think for instance of the fields of accounting, for publicly traded companies, healthcare companies, investment products and so on.

9:Your job is explicitly, literally, to “visualise data”. It’s possible to imagine, perhaps in a large business department, employing someone whose job is to repeatedly convert data, for instance, from text tables into a best-practice chart form, without going further. It would be another person’s job to derive the “so what?”.

You could think of this as a horizontal slice of the analytics pipeline vs the “beginning-to-end” vertical pipeline. After all, analysts often rely on other people with different skills (eg. IT) to do a preparatory phase of data analysis, the data provision/manipulation itself (including extract, transform and load operations). They could also rely on people to do the conclusion-forming stage.

Many companies do seem to have a de-facto version of this setup by employing people to “create reports”. By this, they may mean something akin to blindly getting up-to-date data into a certain agreed template or dashboard format that managers are supposed to use to derive decisions from.

However, unless your managers happen to be keen analysts, or your organisation is extraordinarily predictable, then I tend to be concerned about the efficiency and reliability of this method for anything other than, for example, the regulatory purposes mentioned above. It’s hard to imagine someone else consistently gaining optimal insights from a chart they had no control over designing, without a large amount of overhead-inducing iteration between chart-creator and insight-finder. Let’s face it – most non-quant managers would prefer a bullet-point summary of findings rather than a 10 tab Excel workbook full of charts if they’re honest.

There may be many more such scenarios; do let me know in the comments!

Hang on, isn’t there a “so what” in some of the above?

Did you notice the semantic trickery in the above “no so what” viz reasons? In fact, most do either have an implicit “so what” or are simply facilitating the later creation of one.

Items 1, 2 & 3 could be considered as part of the data preparation phase of the analytics pipeline. It would be unlikely (and unwanted) for the products of them to be the end of the analysis. Almost certainly, they’re a step 1 to a further analysis. An implicit “so what” here is either that the data is safe to proceed with, or it is not.

The output of these approaches can also be useful for establishing baselines for metrics, even if this isn’t the intended use at this point. For instance, if your exploration reveals the average customer purchases £5 of products, this may be useful down the line to compare next year’s sales to. Did your later interventions improve sales or not?

Items 4 & 5 come down to being technical training for either yourself or for others. Once trained, you’re likely to be off analysing “so what?” scenarios next. If we’re looking to contrive a “so what?” here, it might be “so I am ready to put my skills to good use tackling real questions”.

Item 6 is quite unique. The data visualisation itself may never be useful as a “so what?” to anyone. It was never intended to be. It’s for a totally different audience who would no more ask “so what?” of data-inspired art than they would of Da Vinci’s ‘Mona Lisa‘.

Item 7 again might be considered data-use for the sake of something other than intrinsic “analysis”. This type of work might well have an explicit “so what?”, that could even be part of its allure. But it’s not the primary reason why it the visualisation was created, so it might not. Sometimes it could be considered a variant of #6 with a specific goal.

It may also be itself a tool that generates useful data. If viewcount is what is important to the creator, then they may be tracking that pageviews on their own “so what”-enabled dashboard in order to determine what sort of output creates the most value for them..

Item 8 and 9 are mid-parts of the analytics pipeline. Although you may not be explicitly defining a “so what”, you’re enabling someone else to come up with their own later.

For better or worse, mandatory reporting regulations are there for someone’s perceived reason. A chart of fund performance is supposed to be there to help inform potential clients whether they would like to invest or not, not simply to provide a nice curved shape.

And if your job is to create “standard” reports or charts, then almost certainly someone else is completing the later step of interpreting them to form their own “so what?”. Or at least they are supposed to be.

To conclude: (why) are we valuable?

Fiddling around with data may be somewhere between Big Bang level geekery and the sexiest job of the 21st century, and holds a personal fascination for some of us. But if we want someone to employ us to do it, or to add value in some other way to the world, we should remember why data as a vocation exists. For the average data analyst, it’s not to make a series of lines that look pleasant (although it’s always nice when that happens).

To quote the viz-lord Stephen Few, in his book “Now You See It“:

We analyse information not for its own sake but so we can make informed decisions. Good decisions are based on understanding. Analysis is performed to arrive at clear, accurate, relevant, and thorough understanding.

(my emphasis added)

Outside of that book, he uses the term “data sensemaking” frequently, which is a good description of what organisations tend to want from their data analysts, even if they don’t know to phrase it in that manner. It must be stressed again, many “busy execs” are far happier with a few bullet points or alerts on potential issues than a set of even the most beautiful, most best-practice, visualisations.

When one exists within the analyst community, it can be hard to remember that not everyone enjoys “data”. Even many of those who are intrigued may not yet have had the time, privilege or education that leads towards quick, accurate interpretation of data. It can be frustrating, or even impossible, for a non-quant specialist to try and understand the real-world implication of an abstract representation of some measure: they simply don’t want to, or can’t – and shouldn’t need to – hunt for their own takeaways in many cases.

When a crime is committed, we hope a professional detective will put together the clues and provide the real world interpretation that allows us to successfully confront the criminal in court. When non-trivial data appears, we should hope that a professional analyst is on hand to put together the data-clues and provides a real world interpretation that lets us successfully confront whatever issue is at hand.

Bonus addendum: some “so whats” are worse than no “so whats”

Before we go, there is perhaps one extra risk of “so whatting” a viz that should be considered. Producing a conclusion that could lead to action tends to necessitate taking a position; essentially you move from presenting a picture to arguing for the implications of what it shows.

Much data can provide multiple distinct answers to the same question if it is manipulated enough. There are indeed lies, damned lies and statistics, and dataviz inspired variations of all three.

If the analyst approaches the “so what?” aspect with bias, then human psychology is such that they may be inclined to provide an awesome conclusion that just coincidentally happens to match their pre-analysis viewpoint; c.f. “confirmation bias“. Of course, many organisations effectively employ people or subcontract out work to for this exact reason, but that is generally not an ethically fantastic, or professionally fulfilling, position (and whole other organisations exist to debunk such guff).

It’s pretty much impossible to provide even a basic chart without the risk of bias. Data analysis is surely part art, mixed amongst the maths and science – one can of course debate the precise split. But a data vizzer has inherently made some explicit choices: what source of data to use, how much data, which type of visualisation, which comparisons to make, the format of chart and much more – all of which can induce, consciously or not, bias to the audience.

Many best-practice “rules” of dataviz, and analytics in general, are in fact designed to reduce the risk of this. This is a very key reason as to why it’s worth learning them. Outside of those memorisable basics though, it’s often interesting to try and test the opposing view to what you’re presenting as your “so what?”.

Perhaps this year you have a higher proportion of female customers than last year. So what? “Our 10 year strategy to redesign our product to be especially attractive to women has been successful, we deserve a bonus”? Well, perhaps, but what if:

  • Last year had a weirdly low proportion of female purchasers vs normal and you’re just seeing basic regression to the mean?
  • Or, for the past 9 years the proportion of women buying your product has plummeted 10% every year; only to increase 2% in the latest year. Does that make your 10 year strategy a success?
  • Or this year was the first year you advertised in Cosmo, instead of FHM. Have other changes produced a variable that confounds your results?
  • Or men have stopped buying your product whilst women continue to buy it at the exactly same rate…does that count as success?

The right data displayed in the right way can help you eliminate or confirm these and other possibilities.

For any decision where the benefit likely outweighs the cost, it’s worth doing the exercise of disproving your first intuition in order to provide comfort that you are supporting the best quality of decision making; not to mention reducing the risk that some joker with half a spreadsheet invalidates your finely crafted interpretation of your charts.

Beware! Killer robots swim among us

In a further sign of humanity’s inevitable journey towards dystopia, live trials of an autonomous sea-based killer robot made the news recently. If all goes well, it could be released into the wild within a couple of months.

Here’s a picture. Notice it’s cute little foldy-out arm at the bottom, which happens to contain the necessary ingredients to provide a lethal injection to its prey.



Luckily for us, this is the COTSbot, which, in a backwards version of nominative determinism, has a type of starfish called “Crown Of Thorns Starfish” as its sole target.


The issue with this type of starfish is that they have got a bit out of hand around the Great Barrier Reef. Apparently at a certain population level they live in happy synergy with the reef. But when the population increases to the size it is today (the cause of which is quite possibly due to human farming techniques) they start causing a lot of damage to the reef.

Hence the Australian Government  wants rid of them. It’s a bit fiddly to have divers perform the necessary operation, so hence some Queensland University of Technology roboticists have developed a killer robot.

The notable feature of the COTSbot is that it may (??) be the first robot that autonomously decides whether it should kill a lifeform or not.

It drives itself around the reef for up to eight hours per session, using its computer vision and a plethora of processing and data science techniques to look for the correct starfishes, wherever they may be hiding, and perform a lethal injection into them. No human is needed to make the kill / don’t kill decision.

Want to see what it looks like in practice? Check out the heads-up-display:


If that looks kind of familiar to you, perhaps you’re remembering this?

terminator1 HUD

Although that one is based on technology from the year 2029 and is part of a machine that looks more like this.


(Don’t panic, this one probably won’t be around for a good 13 years yet – well, bar the time-travel side of things.)

Back to present day: in fact, for the non-squeamish, you can watch a video of the COTS-destroyer in action below.

How does it work then?

A paper by Dayoub et al.  presented at the IEEE/RSJ International Conference on Intelligent Robots and Systems conference  explains the approach.

Firstly it should be noted that the challenge of recognising these starfish is considerable. The papers informs us that, whilst COTS look like starfish when laid out on flat terrain, they tend to wrap themselves around or hide in coral – so it’s not as simple as looking for nice star shapes. Furthermore they vary in colour, look different depending on how deep they are, and have thorns that can have the same sort of visual texture as the coral they live in (go evolution). The researchers therefore attempt to assess the features of the COTS via various clever techniques detailed in the paper.

Once the features have been extracted, a random forest classifier, which has been trained on thousands of known photos of starfish/no starfish is used to determine whether what it can see through its camera should be exterminated or not.

A random forest classifier is a popular data science classification technique, essentially being an aggregation of decision trees.

Decision trees are one of the more understandable-by-humans classification techniques. Simplistically you could imagine a single tree as providing branches to follow dependent on certain variables, which it automatically machine-learns from have previously processed a stack of inputs that it has been told are either one thing (a starfish) or another thing (not a starfish).

Behind the scenes, an overly simple version of a tree (with slight overtones of doomsday added for dramatic effect) might have a form similar to this:


The random forest classifier takes a new image and runs many different decision trees over it – each tree has been trained independently and hence is likely to have established different rules, and potentially therefore make different decisions. The “forest” then looks at the decision from each of its trees, and, in a fit of machine-learning democracy, takes the most popular decision as the final outcome.

The researchers claim to have approached 99.9% accuracy in this detection – to the point where it will even refuse to go after 3D-printed COTS, preferring the product that nature provides.

Although probably not the type of killer robot that the Campaign to Stop Killer Robots campaigns against, or the UN debates the implications of; if it is the first autonomous killer robot it still can conjure up the beginnings of some ethical dilemmas (even outside that of killing the starfish…after all, deliberate eradication/introduction of species to prevent other problems has not always gone well even in the pre-robotic stage of history – but one assumes this has been considered in depth before we got to this point!).

Although 99.9% accuracy is highly impressive, it’s not 100%. It’s very unlikely that many of non-trivial classification models can ever truly claim 100% over the vast amount of complex scenarios that the real world presents. Data-based classifications, predictions and so on are almost always a compromise between the concepts like precision vs recall, sensitivity vs specificity, type 1 vs type 2 errors, accuracy vs power and whatever other names no doubt exist to refer to the general concept that a decision model may:

  • Identify something that is not a COTS as a COTS (and try to kill it)
  • Identify a real COTS as not being a COTS (and leaving it alone to plunder the reef)

Deciding on the accpetable balance between accepting each type of error is an important part of designing models. Without actually knowing the details, here it sounds like the researchers sensibly opted onto the side of caution, such that if the robot isn’t very sure it will send a photo to a human and await a decision.

It’s also the case that the intention is not to have the robot kill every single COTS, which suggests that false negatives might be less damaging than false positives. One should also note that it’s not going to be connected to the internet, making it hard for the average hacker to remotely take it over and go on a tourist-injection mission or similar.

However, given it’s envisaged that one day a fleet of 100 COTSbots, each armed with 200 lethal shots, might crawl the reef for 8 hours per session  it’s very possible a wrong decision may be made at some point.

Happily, it’s unlikely to accidentally classify a human as a starfish and inject it with poison (plus, although I’m too lazy to look it up, I imagine that a starfish dose of starfish poison is not enough to kill a human) – the risk the researchers see is more that the injection needle may be damaged if the COTSbot tries to inject a bit of coral.

Nonetheless, a precedent may have been set for a fleet of autonomous killer robot drones. If it works out well, perhaps it starts moving the needle slightly towards the world of handily-acronymed  “Lethal Autonomous Weapons Systems” that the US Defense Advanced Research Projects Agency  is supposedly working on today.

If that fills you with unpleasant stress, there’s no need to worry for the moment. Take a moment of light relief and watch this video of how good the 2015 entrants to the DARPA robotics challenge were at stumbling back from the local student bar traversing human terrain.

The Sun and its dangerous misuse of statistics

Here’s the (pretty abhorrent) front cover of yesterday’s Sun newspaper.


Bearing in mind that several recent terrorist atrocities are top of everyone’s mind at the moment, it’s clear what the Sun is implying here.

The text on the front page is even more overt:

Nearly one in five British Muslims have some sympathy with those who have fled the UK to fight for IS in Syria.

The implication is obviously a starkly ominous claim that 20% of Britain’s Muslims support the sort of sick action that IS have claimed responsibility for in Paris and other places that have experienced horrific, crazed, assaults in recent times.

This is pushed even more fervently by the choice to illustrate the article solely with a photo of Jihadi John, an inexcusably evil IS follower “famous” for carrying out sick, cruel, awful beheadings on various videos.

Given the fact that – most thankfully – there are far, far fewer people featuring on videos of beheadings than there are people travelling to Syria to join fighters, this is clearly not a representative or randomly chosen photo.

Sun writer Anila Baig elaborates in a later column:

It beggars belief that there are such high numbers saying that they agree with what these scumbags are doing in Syria. We’ve all seen the pictures of how innocent Yazidi girls are treated, how homosexuals are thrown off tall buildings. It’s utterly abhorrent.

The behaviours she describes are undoubtedly beyond abhorrent.

But of course, this 1 in 5 statistic isn’t true – even in the context of the small amount of data that they have used to support this story.

It is however a very dangerous claim to make in the current climate, where “Islamophobia” and other out-group prejudices make the lives of some people that follow Islam – or even look like a stereotype of someone that might – criminally unpleasant. Britain does have statutes that cover the topic of hate speech after all, and hate speech can come from many quarters.

Anyway, enough of the rants (kind of) and onto the statistics.

There are three main points I describe below, which can be summarised as:

  • The survey this headline is based on did not ask the question that the paper implies.
  • The paper deliberately misleads its readers by failing to give easily-available statistical context to the figures.
  • The methodology used to select respondents for the survey was anyway so inadequate that it’s not possible to tell how representative the results are.



The Sun’s claim that 1 in 5 British Muslims have sympathy for Jihadis (the headline) and those who fight for IS (in main text) comes from a survey they commissioned, which was executed by professional polling company Survation. You can read a summarised version of the quantitative results here.

The “20% sympathise with Isis” and those implications are based on responses to question 6 (page 9 of the PDF results above), which asked people to say which of a set of sentences they agreed with the most. The sentences were of the form “I have a [lot/some/no] of sympathy with young Muslims who leave the UK to join fighters in Syria”.

Results were as follows, and the Sun presumably added up the first 2 boxes to get to their 20%, which isn’t a bad approximation. Note though that an equally legitimate  headline would have been “95% of Muslim respondents do not have a lot of sympathy for jihadis”.

Response % selected
I have a lot of sympathy with young Muslims who leave the UK to join fighters in Syria 5.3%
I have some sympathy with young Muslims who leave the UK to join fighters in Syria 14.5%
I have no sympathy with young Muslims who leave the UK to join fighters in Syria 71.4%
Don’t know 8.8%

However, compare the claim that they have sympathy with those ‘who have fled the UK to fight for IS’ and ‘they agree with what these scumbags are doing…homosexuals thrown off tall buildings’ (even ignoring the implications of supporting the sort of criminal mass murder seen in Paris) with the question actually asked.

There was no mention of IS or any particular act of terrorism or crime against human rights in the question whatsoever.

The question asks about joining fighters in Syria. Wikipedia has a list of the armed groups involved in the Syrian war. At the time of writing, they have been segmented into 4 groups: the Syrian Arab Republic and allies; the Syrian Opposition + al-Qaeda network and allies; the Kurnodish self-administration and allies; and ISIL (aka IS) and allies.

There are perhaps in the order of 200 sub-groups (including supporters, divisions, allies etc.) within those divisions, of which the huge majority are not affiliated with IS. Even the UK is in the “non-lethal” list, having donated a few million to the Syrian opposition groups.

To be fair, the question did ask about joining fighters rather than the c. 11 non-lethal groups. But we should note that – as well as the highly examined stream of people apparently mesmerised by evildoers to the extent of travelling to fight with them – there was also  a Channel 4 documentary a few months ago showing a different example of this. In it, we saw 3 former British soldiers who had decided to travel to Syria and join fighters – the ones who fight against IS. I do not know what religion, if any, those 3 soldiers followed – but is it possible someone might feel a little sympathy towards the likes of them?

It is not necessarily a good thing for a someone to be travelling abroad to join any of these groups with a view to violent conflict, and I am convinced that some people do travel to join the most abhorrent of groups.

But, the point is that, despite what the Sun wrote, the question did not mention IS or any of their evil tactics, and could have in theory suggested some sort of allegiance to very many other military-esque groups.

The question only asks whether the respondent has sympathy for these young Muslims who travel abroad.

To have sympathy for someone does not mean that you agree with the aims or tactics of the people they are persuaded to go and meet.

Google tells us that the definition of sympathy is:

feelings of pity and sorrow for someone else’s misfortune.

One can quite easily imagine a situation where, even if you believe these people are travelling to Syria specifically to become human weapons trying to mass target innocent victims you can have some sympathy for the young person involved.

It seems plausible to have some sympathy for a person that has been brainwashed, misguided, preyed on by evildoers and feels that they have such quality of life that the best option for their future is to go and join a group of fighters in a very troubled country. Their decisions may be absurd, perhaps they may even end up involved in some sick, awful, criminal act for which no excuses could possibly apply – but you could have some sympathy for a person being recruited to a twisted and deadly cause, whilst virulently disagreeing with the ideology and actions of a criminal group that tries to attract them.

And, guess what, the Sun handily left out some statistics that might suggest that is some of what is happening.

For every such survey that concentrates on the responses of a particular population, it’s always important to measure the base rate, or a control group rate. Otherwise, how do you know whether the population you are concentrating on is different from any other population? It’s very rare that any number is meaningful without some comparative context.

As it happens, a few months ago, the same survey group did a survey on behalf  of Sky News that asked the same question to non-Muslims. The results can be seen here, on page 8, question 5, reproduced below.

As the Sun didn’t bother providing these “control group” figures, we’re left to assume that no non-Muslim could ever “show sympathy”to the young Muslims leaving the UK to join fighters. But…

Response % selected
I have a lot of sympathy with young Muslims who leave the UK to join fighters in Syria 4.3%
I have some sympathy with young Muslims who leave the UK to join fighters in Syria 9.4%
I have no sympathy with young Muslims who leave the UK to join fighters in Syria 76.8%
Don’t know 9.6%

So, 14% of non-Muslims respond that they have a lot or some sympathy with this group of travellers. Or as the Sun might headline it: “1 in 7 Brit non-Muslims sympathy for jihadis” (just below the same picture of a lady in a bikini, obviously).

14% is less than 20% of course – but without showing the base rate the reader is left to assume that 20% is “20% compared to zero” which is not the case.

Furthermore in some segments of the surveyed population, the sympathy rates in non-Muslims are higher than in Muslims.

The Sun notes:

The number among young Muslims aged 18-34 is even higher at one in four.

Here’s the relevant figures for that age segment from a) the poll taken to support the Sun’s story, and b) the one that asked the same question to non-Muslims.

Response Muslims aged 18-34 Non-Muslims aged 18-34
I have a lot of sympathy with young Muslims who leave the UK to join fighters in Syria 6.9% 10.9%
I have some sympathy with young Muslims who leave the UK to join fighters in Syria 17.6% 19.2%
I have no sympathy with young Muslims who leave the UK to join fighters in Syria 66.2% 52.2%
Don’t know 9.3% 17.7%

So-called “Jihadi sympathisers” aged 18-34 make up a total of 24.5% of Muslims, and 30.1% of non-Muslims.

Another headline, dear Sun writers: “1 in 3 Brit non-Muslims youth sympathy for jihadis thankfully moderated by less sympathetic Muslim youth?”

A similar phenomenen can be seen when broken down by non-Islamic religions. Although some of these figures are masked due to small samples, one can infer from the non-Muslim survey that a greater than 20% proportion of the sample that classified themselves as some religion outside of Christian, Muslim and “not religious” were at least somewhat sympathetic to these young Muslims who go to Syria.

As a final cross-survey note, it so happens that the Muslim-focussed survey was also carried out targetting a Muslim population earlier in the year, in March, too, again with Survation and Sky News. Here’s the results of that one:

Response % selected
I have a lot of sympathy with young Muslims who leave the UK to join fighters in Syria 7.8%
I have some sympathy with young Muslims who leave the UK to join fighters in Syria 20.1%
I have no sympathy with young Muslims who leave the UK to join fighters in Syria 61.0%
Don’t know 11.1

Totting up the support for having some/a lot of sympathy for the young Muslims leaving the UK for Syria, we see that the proportion showing any form of sympathy fell from 27.9% in March to 19.8% now in November.

That’s a relatively sizeable fall, again not mentioned in the Sun’s article (because that would spoil the point they’re trying to make the reader conclude). Here’s another headline I’ll donate to the writers: ‘Dramatic 30% fall in Brit Muslims sympathy for jihadis‘.

Next up, time to talk about the method behind the latest survey.

Regular readers of the Sun might have noticed that normally the formal surveys they do are carried out by Yougov. Why was it Survation this time? The Guardian reports that Yougov declined the work because

it could not be confident that it could accurately represent the British Muslim population within the timeframe and budget set by the paper.

So rather than up the timeframe or the budget, the Sun went elsewhere to someone that would do it cheaper and quicker. Survation apparently complied.

Given most surveys cannot ask questions to every single person alive, there’s a complex science behind selecting who should be asked and how to process the results to make it representative of what is claimed.

Here’s the World Bank on how to pick respondents for a household survey. Here’s an article from the Research Methods Knowledge Base on selecting a survey method. There is far more research on best practice on polling, and this is one reason why resourcing using professional pollsters is often a necessity if you want vaguely accurate results, especially on topics that are so controversial.

However, none of the research I’ve seen has ever suggested that one should pick out a list of people whose surname “sounds Muslim” and ask them the questions, which is, incredibly, apparently the shortcut Survation used given they didn’t have the time or money to do the detailed and intricate work that would be necessary to generate a statistically representative sample of all British Muslims.

It might be that Survation did happen to choose a perfectly representative sample, but the problem is we just do not know. After talking to other pro-pollsters, the Guardian noted:

It cannot be determined how representative the Survation sample is because of a lack of various socioeconomic and demographic details.

Even if the rest of the story had been legit, then – being generous with the free headlines – the Sun would have been more accurate to write “1 in 5 out of the 1500 people we had the time to ring that had “Muslim sounding surnames” and did in fact agree that they were Muslim’s sympathy for jihadis“. But it’s a bit less catchy and even the non-pros might notice something methodologically curious there.

So, to summarise – the Sun article, which seems to me to tread dangerously near breaking various legislation on hate speech in principle if not practice, is misleadingly reporting on the results of a questionnaire:

  • that did not even ask the question that would be appropriate to make the claims it is headlining.
  • without providing the necessary comparative or historical perspective to make the results in any way meaningful.
  • that was carried out in inadequate, uncontrolled fashion with no view to producing reliable, generalisable results.

We have to acknowledge that, on the whole, people going away to join fighter groups is a bad, sad event and one that the world needs to pay attention to. Infinitely moreso, obviously, if they are travelling to commit atrocities with groups that can be fairly described as evil.

But for the Sun imply such dramatic and harmful allegations about a section of the British population to whom prejudice against is already very apparent (note the  300% increase in  UK anti-Muslim hate crime last week)  to its huge readership – who will now struggle to entirely forget the implication during their dinner-table conversation even if they wanted to –  is not only extremely poor quality data analysis, but also downright irresponsible and even dangerous.

The persuasiveness of dataviz

The intrinsic power of the the chart is highlighted nicely in a recent Harvard Business Review post.

In an experiment (*), Aner Tal et al. had a couple of groups read about a new medication that supposedly reduced the incidence of illness by 40%. This was clearly stated in the text the readers were given.

The only difference between the two groups was that one of them was reading a document that had a very simple bar chart below it. The chart just showed exactly the same claim; that the incidence of illness went down 40% with this new medication.


When they tried to measure, the presence of this chart didn’t seem to increase the understanding or the information retention of the people viewing it in comparison to the other group.

However, it did make a difference to what the readers believed.

97% of those who had seen the chart believed the medication would reduce illness, vs just 68% those who had read (and understood) the exact same information, but in purely text form.

This is clearly irrational, but shows the impact dataviz can have – even when it technically delivers no new information and does not make the existing information easier to understand.

But don’t think rationality will save you! The researchers found that those people who agreed most strongly with “I believe in science” (a strange thing to ask to be fair, but I assume at least somewhat positively associated with those people who consider themselves most rational) saw the largest boost to believing in the medication post-chart. They hypothesise some sort of “I believe in science -> charts mean science -> I believe in whatever the chart says” type halo phenomena.

I then wondered what would happen if one popped an entirely irrelevant chart under the article. Would that work as well? Not even delving into the wonderful world of how to lie with charts, but simply present a chart on an entirely different subject. Perhaps one from one of the hilarious faux-science hair product adverts would be a good choice (85% neutrilliam! 5.32948% aquaspheres!).

It didn’t seem like the researchers tried that one in the above study, but there is a similiarish study from 2012. No charts involved sadly, but it looked at the impact of putting a totally irrelevant mathematical equation on a document. That they called it “The Nonsense Math Effect” gives a clue as to the results.

Here they had groups of highly educated post-grads look at abstracts of two (real) journal papers, one on anthropology and one on sociology. They then had to rate their perception of the quality of the research on a scale from 1 to 100.

Both groups read the same abstracts, except one group would see the following sentence added on to the end of the abstract.

A mathematical model (TPP = T0 − fT0d2f −fTP d2f ) is developed to describe sequential effects.

This sentence was taken from a totally different paper, which concerned reaction times. There was no relationship between this sentence and the two abstracts the readers were given. In fact:

none of the original abstracts mention any sequential effects or anything that the symbols in the equation could reasonably correspond to

Can you guess what happened?

Of course, the net effect was that the group that read the abstracts with this meaningless sentence pasted on at the end rated the quality of the research significantly higher than those that didn’t (**). The research was indeed more highly regarded if a string of contextually meaningless characters that look a bit like complicated maths was written below it.

Remember, datavizzers, with great power comes great responsibility. Be sure to never abuse your profession.



(*) It’s not listed in the article, but I believe the published article they refer to is this one, although you’ll need a subscription to the “Public Understanding of Science” journal to get to the full paper.

(**) When broken down, there was one group of readers who didn’t fall into that trap: those who were experts in maths, science and technology (and those who studied medicine were perhaps not statistically significantly different).  Most of the world doesn’t hold post-graduate degrees in mathematics though.


More data is not always better data

Like a lot of data-fans, I have something of a tendency to “collect” data just in case it will become useful one day. Vendors are feeding that addiction with constant talk of swimming through blissful “data lakes” and related tools, notably Hadoop and its brethren.

Furthermore, as the production of data grows exponentially, the cost of storing it moves incrementally closer to zero. As an example, Amazon will store whatever you want for $0.007 per GB in its Glacier product, if you don’t need to retrieve it very often or very fast.

That is to say, that the same amount of data you would have needed 728 high density floppy disks to save it on within my relatively short memory, you can now store for about a British ha’penny. This is a coin so irrelevantly valueless in the grand scheme of buying things these days that it was withdrawn from circulation over 30 years ago, long before my memory begins.

But just because we can store all these 1s and 0s, should we? Below I have started to consider some downsides of data hoarding.

The too-obvious risks: large amounts of data expose you to more hacking and privacy risks

There are two huge risks, so obvious that I will pretty much skip over them. There is the risk of a hack, with the potential for some malevolent being to end up with your personal details. Just last week we heard the news that over 150,000 people’s personal details were accessed illegitimately from the computer systems of a mobile phone provider, including nearly 16k bank account details. A couple of years ago, Target famously had up to 70 million of its customers details stolen. If this data hadn’t been stored, the hack obviously couldn’t have taken place.

Often related to such behaviour, there’s also the issue of accidental privacy breaches. These might be simple mistakes – like the time that the UK government accidentally lost a couple of CDs that had personal data relating to every child in the UK on for instance – but they may have consequences not dissimilar to when the data has been hacked if it gets into the wrong hands.

Less dramatically, even in that most curated of safe software gardens, Apple’s app store, 256 apps were removed last month as it was found they were collecting and transmitting personal data that they should not have been.

These examples are not mean to imply there was something innately wrong in storing these particular parts of data. Maybe Target needs to store these customer details to runs its operations in the way it finds most productive, and the Government probably needs to store information on children (although perhaps not in the sorting offices of Royal Mail). However, every time data is stored that doesn’t yet have a use, one should bear in mind there is a non-zero risk it may leak.

Hacking and privacy breaches are easy and obvious concepts though. More interesting here are the other downsides of storing too much data that do not require the efforts of an evildoer to produce adverse consequences. What else might we consider?

Below I list a few. I’m sure I’ve missed many more. Important to consider in some of these is that the data itself may not be generating the majority of the risk, but rather it enables something negative which is later done with it. Methodological issues though are something even the most data-addicted analyst can address in their work.

Large amounts of data may encourage bad analysis

A lot of data analysis is targetted at understanding associations or causes of a target variable based on a set of input variables. Surely the more input variables we have, the better models we can produce and the better our model will reflect the world?

Well, it depends on how you measure it. Mathematically, yes, more variables do tend to lead to a model with a better fit to the data that was used to train it.

And guess what, we now have huge lakes full of crazy amounts of varied data and computer algorithms that can process it. If you want to predict your future sales, why not sack your data scientists and just have your supercomputer run your past sales data against every one of these billions of available variables until it deduces a model that predicts it perfectly?

This doesn’t work for a number of reasons, not least because for generating actual useful insights to support optimal decision making, using too many variables tends to lead to the curse of overfitting.

This is where you involve so many variables as predictors that your model is too specific to the precise historical data you trained it on, and therefore misleads you as to the real drivers behind your target predicted variables.

Wikipedia has a fun example:

As a simple example, consider a database of retail purchases that includes the item bought, the purchaser, and the date and time of purchase. It’s easy to construct a model that will fit the training set perfectly by using the date and time of purchase to predict the other attributes; but this model will not generalize at all to new data, because those past times will never occur again.

In other words, if you give me your past sales data at a customer/transaction level, then I can build you a model that will perfectly “predict” that past data. It would look something like:


Wow, look at my R-squared of 1. This is an unbeatable model. Horray!

At least until we actually want to use it to predict what will happen tomorrow when a new customer comes with a new name and a new date, at which point…catastrophe.

Although it’s a silly example, the point is that the problem came from the fact we exposed the model to too much data. Is it really relevant that someone’s name is Jane Doe? Did we need to store that for the purposes of the model? No – it actually made the resulting analysis worse.

Simplifying it to a univariate situation, the “correlation doesn’t imply causation” trope remains indefinitely true.

First, there’s the issue of how we measure correlation. A classic for linear models is Pearson’s ‘r’.

This gives you a single value between 0 and 1, where 1 is perfect positive correlation, 0 is no correlation and -1 is perfect negative correlation. You will often see results presented as “our analysis shows that X and Y are correlated with r of 0.8”. Sounds good, but what does this tell us?

Anscombe’s Quartet, possibly my favourite ever set of charts (doesn’t everyone have one?), shows us that it doesn’t tell us all that much.

All these charts have a “r” of around 0.8, and they all have the same mathematical linear regression trendline.

Anscombe's quartet

(thanks Wikipedia)

But do you believe that the linear correlation is appropriate for all four cases? Is the blue line a fair reflection of the reality of the trend in all datasets?

Anscombe’s Quartet is not actually about using too much data, but rather relying on statistical summarisation too much. It’s an example of where data visualisation is key to showing validity of a potentially computer-generated equation.

But visualisation alone isn’t sufficient either, when we are using too much data.

Let’s imagine we’re a public service that needs to come up with a way to predict deaths by drowning. So we throw all the data we could possibly obtain at it and see what sticks.

Here’s one positive result from the fabulous Spurious Correlations site.

Spurious correlations

OMG! The data shows Nicolas Cage causes drownings! Ban Nicolas Cage!

A quick dose of common sense suggests that Nicolas Cage films are probably not all versions of the videotape seen in The Ring, which kills you shortly after viewing it.

Nicolas Cage films and drowning statistics may mathematically fit, but, if we are taking our work seriously, there was little expected value in storing and using this data for this purpose. Collecting and analysing too much data led us to a (probably 🙂 ) incorrect theory.

Part of the issue is that predictive analytics is not even as simple as looking for a needle in a haystack. Instead we’re looking for a particular bit of hay in a haystack, which may look the same as any other bit of hay.

Most variables are represented as a sequence of numbers and, whether by coincidence or cross correlation, one meaningless sequence of numbers might look very like one very meaningful sequence of numbers in a particular sample being tested.

It doesn’t help that there are several legitimate statistical tests that arguably lose an aspect of their usefulness in a world where we have big data.

Back in the day, perhaps we could test 20 people’s reaction to a photo by getting them into a lab and paying them for their opinion. We would use inferential statistics to work out whether their opinions were meaningful enough to be extrapolated to a larger population – what can we infer from our 20-person results?

Now we can in theory test 1 billion people’s reaction to a photo by having Facebook perform an experiment (although that might not go down too well with some people). All other things being equal, we can infer a whole lot more from testing 1 billion people than testing 20.

There are therefore many hypothesis tests that are designed to check whether what looks like a difference between >=2 groups of “things”, for instance in the average score out of 10 women give a photo vs the average score men do, is indeed a real difference or just down to random chance. Classics include Student’s T Test, ANOVA, Mann-Whitney and many more.

The output of these often gives a measure of the probability of the result being seen being “real” and not due to some random fluctuations or noise. After all, most variables measured in real life have some natural variation – just because a particular woman is taller than a particular man we should not infer that all women are taller than all men. It could just be that you didn’t test enough women and men to get an idea of the reality of the situation given the natural variance in human height.

This output is often expressed as a “p value”, which is a decimal equivalent to the percentage probability that the differences you see in groups are down to chance.

P = 0.05 is a common benchmark, which would imply that whatever your hypothesis was, there’s a 95% probability that it is actually true (because the probability of it being down to chance is 0.05, = 5%).

Sounds good, but there are at least two issues to be wary of now you may be running these tests over vast numbers of rows or columns of data.

Hypothesis tests over a lot of rows (subjects)

Looking at the formula for these tests, you will see that the actual difference between groups necessary to meet that benchmark for “this is real” gets lower as your sample size gets bigger. Perhaps the test will tell you that a difference of 1 IQ point is significant when you’ve tested a hypothesis over a million subjects in your data, whereas it wouldn’t have done so if you had only tested it over 100.

So is the test wrong in either case? Should we artificially limit how many subjects we put into a test? No, of course not (unless there’s another good reason to do so), but some common sense is needed when interpreting it in a practical sense.

In medical circles, as well as statistical significance, there’s a concept of clinical significance. OK, we might have established that things that do X are more likely to do Y, but does that matter? Is that 1 IQ point difference actually something we care about or not? If your advert produces a £0.01 increase in average basket value, however “significant”, do you care?

Maybe you do, if you have a billion baskets a day. Maybe you don’t if you have 10. These are not questions you can answer in the abstract – but should be considered on a case by case basis by a domain expert.

Just because you had enough data available to detect some sort of effect does not mean that you should rush to your boss as soon as you see a p=0.05 result.

Hypothesis tests over a lot of columns (variables)

We saw above that a common standard is that we can claim our hypothesis is true when we are 95% certain it wasn’t due to chance.

To switch this around: we are happy to claim something is true even if there’s a 5% probability it was due to chance.

That seems fairly safe as a one-off, but imagine we’re testing 20 variables in this experiment, and apply the same rule to say we’re satisfied if any of them meet this criteria.

There’s a 5% chance you come back with the wrong answer on your first variable.
Then there’s 5% chance you come back with the wrong answer on your second variable.
…and so on until there’s a 5% chance you come back with the wrong answer on your tenth variable.

One can calculate that after 20 tests there is a 64% chance of a variable being shown to be significant, even if we know in advance for sure that none of them are.

Imagine if we had gone back to the idea of throwing a massive datalake’s worth of variables against two groups to try and find out, out of “everything we know about”, what explains the differences in these groups? A computer could quite happily throw a million variables against the wall in such an exercise.

If each test is allowed to produce a false positive 5% of the time, and you have a million such tests, one can quickly see that you will (almost) certainly get a bunch of results that – had they been conducted as single, well-constructed, hypothesis tests – show results that might have be considered as statistically significant – but in this scenario they are clearly often predictable-in-existence, but random-in-placement, false indicators.

This sort of behaviour is what people refer to as “p hacking”, and is seemingly somewhat rife within academic literature too where journals prefer to publish positive results (“X did cause Y”) rather than negative (“X did not cause Y”), even though both are often equally useful insights.

An article in Nature reports on a real-world example of this

…he..demonstrated that creative p-hacking, carried out over one “beer-fueled” weekend, could be used to ‘prove’ that eating chocolate leads to weight loss, reduced cholesterol levels and improved well-being (see They gathered 18 different measurements — including weight, blood protein levels and sleep quality — on 15 people, a handful of whom had eaten some extra chocolate for a few weeks. With that many comparisons, the odds were better than 50–50 that at least one of them would look statistically significant just by chance. As it turns out, three of them did — and the team cherry-picked only those to report.

(Unless of course chocolate does lead to weightloss, which would be pretty cool.)

The same article also refers to a similar phenomena as “Texas sharpshooting”, being based on a similarity to “an inept marksman who fires a random pattern of bullets at the side of a barn, draws a target around the biggest clump of bullet holes, and points proudly at his success.”

It’s quite possible to p-hack in a world where big data doesn’t exist. However, the relevance here is that one needs to avoid the temptation of collecting, storing and throwing a whole bunch of “miscellaneous” data at a problem when looking for correlations, models and so on, and then reporting that you found some genuine insight whenever a certain statistical test happens to reach a certain number.

There are explicit procedures and statistical tests designed to alert one you as to when you’re at risk of this, also coming from the field of statistical inference. These fall under the category of controlling the “familywise error rate” of a set of hypotheses, with perhaps the most famous being the Bonferroni correction.

Going back to the scientific method is also critical; replication can be critical. It’s good practice when building a statistical model to ensure you have some data that you can test it on, which has not been anywhere near the model-building process itself. If you have enough of this, then if you believe you’ve found that people born on a rainy day are more likely to buy high value items, then you can at least test this one particular theory specifically, instead of testing for an arbitrary amount of potential theories that happen to crop up due to spurious relationships within your training dataset.

Large amounts of data may reinforce the world’s injustices

A previous post on this site dealt with this issue in more detail, so we’ll skim it here. But suffice to say that there are common methodological flaws in data collection, processing, statistical models and the resulting “data driven” execution that result in suboptimal outcomes when measured in terms of fairness or justice by many reasonable people.

Two highlights to note are that:

Machine learning models…learn

Many statistical models are specifically designed to learn from historical reality. My past post had an example where a computer was fed data on historical hiring decisions – which resulted in it “innocently” learning to discriminate against women and people with foreign-sounding names.

Imagine if gender data hadn’t been stored for the applicants though – after all, should it really be a factor influencing applicant success in this way? No; so why was it collected for this task? Perhaps there was an important reason – but if not, then adding this data effectively provided the model with the tools to recreate gender discrimination.

The “foreign sounding names” point provides a warning though – perhaps the computer was not explicitly fed with ethnicity; but it was supplied with data that effectively proxied for it. Again, the computer has just learnt to do what humans did in the past (and indeed the present, which is why “name-blind recruitment” is a thing),

Implementing data driven models requires…data

“Data driven” decisions can only be made where data exists. Again, in my previous post, we saw that using “objective” data from phone apps or vehicles equipped to automatically transmit information as to where potholes that needed fixing were located, in fact produced a bias towards focussing such repair resources on areas that were full of affluent people – because poor areas were less like to have the same number of people with the resources and interest in fiddling around with the expensive Landrovers and fancy phones that collected this information.

This phenomenon is something like a tech/data version of the “WEIRD” issue that has been noted in psychology research – where a hugely disproportionate amount of human research has, for reasons other than deliberate malice, inadvertently concentrated on understanding Western, Educated people from Industrialised, Rich and Democratic countries – often American college students, who are probably not entirely representative of the entirety of humanity.

It could be argued that this one is a risk associated with having too little data rather than too much. But, given it is impossible to obtain data on literally every aspect of every entity in the world, these sort of examples should also be considered cases where using partial data for a variable risks producing worse results than using no data for that variable at all.

One should be wary of deciding that you might as well collect the data you happen to have access to as a method of creating analysis on something that will be extrapolated to people outside of your cohort’s parameters. Perhaps you will spend all that time and effort collecting data just because you can, just to end up with a model that produces worse real-world outcomes than not having that data at all.

Large amounts of data may destroy the planet

Dramatic perhaps, but for those of us who identify with the overwhelming scientific consensus that climate change is real, dangerous and affected by human activities, there is a mechanism here.

Storing lots of data requires lots of machines. Powering lots of machines (and especially their required cooling equipment) needs lots of electricity. Lots of electricity means lots of energy generation, which given – even in the developed world – is overwhelmingly generated in environmentally unfriendly ways, tends to mean lots of pollution and other such factors that could eventually contribute towards a potential planetary disaster.

Back in 2013, The Register reported on a study that suggested that “IT” was responsible for about 10% of electricity usage worldwide. These sort of measures are rather hard to do accurately in aggregate, and here includes the energy needed both to create the device and manage the distribution of information to it.

On the pure data side, the US Natural Resources Defense Council reported that, in 2013, data centres in the US alone used an estimated 91 billion kilowatt-hours of energy. They then went on to project some figures:

Data center electricity consumption is projected to increase to roughly 140 billion kilowatt-hours annually by 2020, the equivalent annual output of 50 power plants, costing American businesses $13 billion annually in electricity bills and emitting nearly 100 million metric tons of carbon pollution per year.

To be fair, some of the larger firms are trying to remedy this, whether for environmental purposes or cost cutting.

Facebook have a few famous examples, including building a data centre that is powered by hydroelectriciy, situated in a chilly sub-arctic area of Sweden which provides a big bonus of free cooling. This was expected to save 70% of the “normal” amount of energy for cooling such a place.

More recently, their engineers have been developing methods such that they can keep the vast amount of pointless photos that users uploaded years ago (and never looked at again) more offline, on machines that were powered down or even on Blueray disks.

Given that a lot of Facebook’s data isn’t being viewed by anyone at any single point in time, they can use people’s behavioural data in order to predict which other types of data might be needed live, leaving the rest turned off. It might sound a bit like Googling Google, but as Ars Technica reports:

“We have a system that allows 80-90 percent of disks turned off,” Weihl said. …you can predict when photos are going to be needed—like when a user is scrolling through photos chronologically, you can see you’ll need to load up images soon. You could make a decision to turn off not just the backups, but all the copies of older stuff, and keep only the primaries of recent stuff spinning.”

It’s not only the big players to consider though. The NRDC noted that the majority of data centre power needs were actually for “small, medium, and large corporate data centers as well as in the multi-tenant data centers to which a growing number of companies outsource their data center needs”, which on the whole are substantially less efficient than Facebook’s arctic zone.

So, every time you hoard data you don’t need, whether in big corporate Hadoops or Facebook photo galleries, you’re probably contributing towards unnecessary environmental pollution.

Optimum solution? Don’t save it. More realistic solution? Some interested organisations have done evaluations and produced consumer-friendly guides, such as the Greenpeace “Click Clean Scorecard” to suggest which storage service you might like to use if the environment is a key concern for you.

As a spoiler, Apple and Facebook did very well in the latest edition. Amazon Web Services did not.

Final thoughts

I’m a data-believer, I drink the Kool-Aid that claims the data can save the world, or at least, if not save it directly, then help us make decisions that produce better results. And in general, having more information is better than having less – as long as it is usable, used correctly – or at least there’s some conceptual reason to imagine one day it might be. But it’s an interesting exercise to play Devil’s Advocate and imagine why more data is not always a good thing.

Some of the issues above are less about the existence of the data, and more about the way that having access to it can tempt analysts or their managers into bad practice.

Other issues surround the size of the data, and the fact that it’s often simply not necessary to use as much as data as you can. An article on Wired suggests one should add “viability” and “value” to the 3 Vs of big data (velocity, volume and variety, for anyone who hasn’t lived through the painful cloud of buzzwords in recentish times).

An Information Age writer asks:

What’s the marginal impact on a predictive model’s accuracy if it runs on five million rows versus 10 billion rows?

The answer is often “very little”, other than it will:

  • use more resources (time, money, the planet) to gather and store,
  • take exponentially more time and CPU power to process; and
  • risk allowing an inordinate amount of overfitting, biased or otherwise poor models if best practice is not followed.