The Sun and its dangerous misuse of statistics

Here’s the (pretty abhorrent) front cover of yesterday’s Sun newspaper.

Capture

Bearing in mind that several recent terrorist atrocities are top of everyone’s mind at the moment, it’s clear what the Sun is implying here.

The text on the front page is even more overt:

Nearly one in five British Muslims have some sympathy with those who have fled the UK to fight for IS in Syria.

The implication is obviously a starkly ominous claim that 20% of Britain’s Muslims support the sort of sick action that IS have claimed responsibility for in Paris and other places that have experienced horrific, crazed, assaults in recent times.

This is pushed even more fervently by the choice to illustrate the article solely with a photo of Jihadi John, an inexcusably evil IS follower “famous” for carrying out sick, cruel, awful beheadings on various videos.

Given the fact that – most thankfully – there are far, far fewer people featuring on videos of beheadings than there are people travelling to Syria to join fighters, this is clearly not a representative or randomly chosen photo.

Sun writer Anila Baig elaborates in a later column:

It beggars belief that there are such high numbers saying that they agree with what these scumbags are doing in Syria. We’ve all seen the pictures of how innocent Yazidi girls are treated, how homosexuals are thrown off tall buildings. It’s utterly abhorrent.

The behaviours she describes are undoubtedly beyond abhorrent.

But of course, this 1 in 5 statistic isn’t true – even in the context of the small amount of data that they have used to support this story.

It is however a very dangerous claim to make in the current climate, where “Islamophobia” and other out-group prejudices make the lives of some people that follow Islam – or even look like a stereotype of someone that might – criminally unpleasant. Britain does have statutes that cover the topic of hate speech after all, and hate speech can come from many quarters.

Anyway, enough of the rants (kind of) and onto the statistics.

There are three main points I describe below, which can be summarised as:

  • The survey this headline is based on did not ask the question that the paper implies.
  • The paper deliberately misleads its readers by failing to give easily-available statistical context to the figures.
  • The methodology used to select respondents for the survey was anyway so inadequate that it’s not possible to tell how representative the results are.

 

Onwards:

The Sun’s claim that 1 in 5 British Muslims have sympathy for Jihadis (the headline) and those who fight for IS (in main text) comes from a survey they commissioned, which was executed by professional polling company Survation. You can read a summarised version of the quantitative results here.

The “20% sympathise with Isis” and those implications are based on responses to question 6 (page 9 of the PDF results above), which asked people to say which of a set of sentences they agreed with the most. The sentences were of the form “I have a [lot/some/no] of sympathy with young Muslims who leave the UK to join fighters in Syria”.

Results were as follows, and the Sun presumably added up the first 2 boxes to get to their 20%, which isn’t a bad approximation. Note though that an equally legitimate  headline would have been “95% of Muslim respondents do not have a lot of sympathy for jihadis”.

Response % selected
I have a lot of sympathy with young Muslims who leave the UK to join fighters in Syria 5.3%
I have some sympathy with young Muslims who leave the UK to join fighters in Syria 14.5%
I have no sympathy with young Muslims who leave the UK to join fighters in Syria 71.4%
Don’t know 8.8%

However, compare the claim that they have sympathy with those ‘who have fled the UK to fight for IS’ and ‘they agree with what these scumbags are doing…homosexuals thrown off tall buildings’ (even ignoring the implications of supporting the sort of criminal mass murder seen in Paris) with the question actually asked.

There was no mention of IS or any particular act of terrorism or crime against human rights in the question whatsoever.

The question asks about joining fighters in Syria. Wikipedia has a list of the armed groups involved in the Syrian war. At the time of writing, they have been segmented into 4 groups: the Syrian Arab Republic and allies; the Syrian Opposition + al-Qaeda network and allies; the Kurnodish self-administration and allies; and ISIL (aka IS) and allies.

There are perhaps in the order of 200 sub-groups (including supporters, divisions, allies etc.) within those divisions, of which the huge majority are not affiliated with IS. Even the UK is in the “non-lethal” list, having donated a few million to the Syrian opposition groups.

To be fair, the question did ask about joining fighters rather than the c. 11 non-lethal groups. But we should note that – as well as the highly examined stream of people apparently mesmerised by evildoers to the extent of travelling to fight with them – there was also  a Channel 4 documentary a few months ago showing a different example of this. In it, we saw 3 former British soldiers who had decided to travel to Syria and join fighters – the ones who fight against IS. I do not know what religion, if any, those 3 soldiers followed – but is it possible someone might feel a little sympathy towards the likes of them?

It is not necessarily a good thing for a someone to be travelling abroad to join any of these groups with a view to violent conflict, and I am convinced that some people do travel to join the most abhorrent of groups.

But, the point is that, despite what the Sun wrote, the question did not mention IS or any of their evil tactics, and could have in theory suggested some sort of allegiance to very many other military-esque groups.

The question only asks whether the respondent has sympathy for these young Muslims who travel abroad.

To have sympathy for someone does not mean that you agree with the aims or tactics of the people they are persuaded to go and meet.

Google tells us that the definition of sympathy is:

feelings of pity and sorrow for someone else’s misfortune.

One can quite easily imagine a situation where, even if you believe these people are travelling to Syria specifically to become human weapons trying to mass target innocent victims you can have some sympathy for the young person involved.

It seems plausible to have some sympathy for a person that has been brainwashed, misguided, preyed on by evildoers and feels that they have such quality of life that the best option for their future is to go and join a group of fighters in a very troubled country. Their decisions may be absurd, perhaps they may even end up involved in some sick, awful, criminal act for which no excuses could possibly apply – but you could have some sympathy for a person being recruited to a twisted and deadly cause, whilst virulently disagreeing with the ideology and actions of a criminal group that tries to attract them.

And, guess what, the Sun handily left out some statistics that might suggest that is some of what is happening.

For every such survey that concentrates on the responses of a particular population, it’s always important to measure the base rate, or a control group rate. Otherwise, how do you know whether the population you are concentrating on is different from any other population? It’s very rare that any number is meaningful without some comparative context.

As it happens, a few months ago, the same survey group did a survey on behalf  of Sky News that asked the same question to non-Muslims. The results can be seen here, on page 8, question 5, reproduced below.

As the Sun didn’t bother providing these “control group” figures, we’re left to assume that no non-Muslim could ever “show sympathy”to the young Muslims leaving the UK to join fighters. But…

Response % selected
I have a lot of sympathy with young Muslims who leave the UK to join fighters in Syria 4.3%
I have some sympathy with young Muslims who leave the UK to join fighters in Syria 9.4%
I have no sympathy with young Muslims who leave the UK to join fighters in Syria 76.8%
Don’t know 9.6%

So, 14% of non-Muslims respond that they have a lot or some sympathy with this group of travellers. Or as the Sun might headline it: “1 in 7 Brit non-Muslims sympathy for jihadis” (just below the same picture of a lady in a bikini, obviously).

14% is less than 20% of course – but without showing the base rate the reader is left to assume that 20% is “20% compared to zero” which is not the case.

Furthermore in some segments of the surveyed population, the sympathy rates in non-Muslims are higher than in Muslims.

The Sun notes:

The number among young Muslims aged 18-34 is even higher at one in four.

Here’s the relevant figures for that age segment from a) the poll taken to support the Sun’s story, and b) the one that asked the same question to non-Muslims.

Response Muslims aged 18-34 Non-Muslims aged 18-34
I have a lot of sympathy with young Muslims who leave the UK to join fighters in Syria 6.9% 10.9%
I have some sympathy with young Muslims who leave the UK to join fighters in Syria 17.6% 19.2%
I have no sympathy with young Muslims who leave the UK to join fighters in Syria 66.2% 52.2%
Don’t know 9.3% 17.7%

So-called “Jihadi sympathisers” aged 18-34 make up a total of 24.5% of Muslims, and 30.1% of non-Muslims.

Another headline, dear Sun writers: “1 in 3 Brit non-Muslims youth sympathy for jihadis thankfully moderated by less sympathetic Muslim youth?”

A similar phenomenen can be seen when broken down by non-Islamic religions. Although some of these figures are masked due to small samples, one can infer from the non-Muslim survey that a greater than 20% proportion of the sample that classified themselves as some religion outside of Christian, Muslim and “not religious” were at least somewhat sympathetic to these young Muslims who go to Syria.

As a final cross-survey note, it so happens that the Muslim-focussed survey was also carried out targetting a Muslim population earlier in the year, in March, too, again with Survation and Sky News. Here’s the results of that one:

Response % selected
I have a lot of sympathy with young Muslims who leave the UK to join fighters in Syria 7.8%
I have some sympathy with young Muslims who leave the UK to join fighters in Syria 20.1%
I have no sympathy with young Muslims who leave the UK to join fighters in Syria 61.0%
Don’t know 11.1

Totting up the support for having some/a lot of sympathy for the young Muslims leaving the UK for Syria, we see that the proportion showing any form of sympathy fell from 27.9% in March to 19.8% now in November.

That’s a relatively sizeable fall, again not mentioned in the Sun’s article (because that would spoil the point they’re trying to make the reader conclude). Here’s another headline I’ll donate to the writers: ‘Dramatic 30% fall in Brit Muslims sympathy for jihadis‘.

Next up, time to talk about the method behind the latest survey.

Regular readers of the Sun might have noticed that normally the formal surveys they do are carried out by Yougov. Why was it Survation this time? The Guardian reports that Yougov declined the work because

it could not be confident that it could accurately represent the British Muslim population within the timeframe and budget set by the paper.

So rather than up the timeframe or the budget, the Sun went elsewhere to someone that would do it cheaper and quicker. Survation apparently complied.

Given most surveys cannot ask questions to every single person alive, there’s a complex science behind selecting who should be asked and how to process the results to make it representative of what is claimed.

Here’s the World Bank on how to pick respondents for a household survey. Here’s an article from the Research Methods Knowledge Base on selecting a survey method. There is far more research on best practice on polling, and this is one reason why resourcing using professional pollsters is often a necessity if you want vaguely accurate results, especially on topics that are so controversial.

However, none of the research I’ve seen has ever suggested that one should pick out a list of people whose surname “sounds Muslim” and ask them the questions, which is, incredibly, apparently the shortcut Survation used given they didn’t have the time or money to do the detailed and intricate work that would be necessary to generate a statistically representative sample of all British Muslims.

It might be that Survation did happen to choose a perfectly representative sample, but the problem is we just do not know. After talking to other pro-pollsters, the Guardian noted:

It cannot be determined how representative the Survation sample is because of a lack of various socioeconomic and demographic details.

Even if the rest of the story had been legit, then – being generous with the free headlines – the Sun would have been more accurate to write “1 in 5 out of the 1500 people we had the time to ring that had “Muslim sounding surnames” and did in fact agree that they were Muslim’s sympathy for jihadis“. But it’s a bit less catchy and even the non-pros might notice something methodologically curious there.

So, to summarise – the Sun article, which seems to me to tread dangerously near breaking various legislation on hate speech in principle if not practice, is misleadingly reporting on the results of a questionnaire:

  • that did not even ask the question that would be appropriate to make the claims it is headlining.
  • without providing the necessary comparative or historical perspective to make the results in any way meaningful.
  • that was carried out in inadequate, uncontrolled fashion with no view to producing reliable, generalisable results.

We have to acknowledge that, on the whole, people going away to join fighter groups is a bad, sad event and one that the world needs to pay attention to. Infinitely moreso, obviously, if they are travelling to commit atrocities with groups that can be fairly described as evil.

But for the Sun imply such dramatic and harmful allegations about a section of the British population to whom prejudice against is already very apparent (note the  300% increase in  UK anti-Muslim hate crime last week)  to its huge readership – who will now struggle to entirely forget the implication during their dinner-table conversation even if they wanted to –  is not only extremely poor quality data analysis, but also downright irresponsible and even dangerous.

The persuasiveness of dataviz

The intrinsic power of the the chart is highlighted nicely in a recent Harvard Business Review post.

In an experiment (*), Aner Tal et al. had a couple of groups read about a new medication that supposedly reduced the incidence of illness by 40%. This was clearly stated in the text the readers were given.

The only difference between the two groups was that one of them was reading a document that had a very simple bar chart below it. The chart just showed exactly the same claim; that the incidence of illness went down 40% with this new medication.

Capture

When they tried to measure, the presence of this chart didn’t seem to increase the understanding or the information retention of the people viewing it in comparison to the other group.

However, it did make a difference to what the readers believed.

97% of those who had seen the chart believed the medication would reduce illness, vs just 68% those who had read (and understood) the exact same information, but in purely text form.

This is clearly irrational, but shows the impact dataviz can have – even when it technically delivers no new information and does not make the existing information easier to understand.

But don’t think rationality will save you! The researchers found that those people who agreed most strongly with “I believe in science” (a strange thing to ask to be fair, but I assume at least somewhat positively associated with those people who consider themselves most rational) saw the largest boost to believing in the medication post-chart. They hypothesise some sort of “I believe in science -> charts mean science -> I believe in whatever the chart says” type halo phenomena.

I then wondered what would happen if one popped an entirely irrelevant chart under the article. Would that work as well? Not even delving into the wonderful world of how to lie with charts, but simply present a chart on an entirely different subject. Perhaps one from one of the hilarious faux-science hair product adverts would be a good choice (85% neutrilliam! 5.32948% aquaspheres!).

It didn’t seem like the researchers tried that one in the above study, but there is a similiarish study from 2012. No charts involved sadly, but it looked at the impact of putting a totally irrelevant mathematical equation on a document. That they called it “The Nonsense Math Effect” gives a clue as to the results.

Here they had groups of highly educated post-grads look at abstracts of two (real) journal papers, one on anthropology and one on sociology. They then had to rate their perception of the quality of the research on a scale from 1 to 100.

Both groups read the same abstracts, except one group would see the following sentence added on to the end of the abstract.

A mathematical model (TPP = T0 − fT0d2f −fTP d2f ) is developed to describe sequential effects.

This sentence was taken from a totally different paper, which concerned reaction times. There was no relationship between this sentence and the two abstracts the readers were given. In fact:

none of the original abstracts mention any sequential effects or anything that the symbols in the equation could reasonably correspond to

Can you guess what happened?

Of course, the net effect was that the group that read the abstracts with this meaningless sentence pasted on at the end rated the quality of the research significantly higher than those that didn’t (**). The research was indeed more highly regarded if a string of contextually meaningless characters that look a bit like complicated maths was written below it.

Remember, datavizzers, with great power comes great responsibility. Be sure to never abuse your profession.

Capture

 

(*) It’s not listed in the article, but I believe the published article they refer to is this one, although you’ll need a subscription to the “Public Understanding of Science” journal to get to the full paper.

(**) When broken down, there was one group of readers who didn’t fall into that trap: those who were experts in maths, science and technology (and those who studied medicine were perhaps not statistically significantly different).  Most of the world doesn’t hold post-graduate degrees in mathematics though.

 

More data is not always better data

Like a lot of data-fans, I have something of a tendency to “collect” data just in case it will become useful one day. Vendors are feeding that addiction with constant talk of swimming through blissful “data lakes” and related tools, notably Hadoop and its brethren.

Furthermore, as the production of data grows exponentially, the cost of storing it moves incrementally closer to zero. As an example, Amazon will store whatever you want for $0.007 per GB in its Glacier product, if you don’t need to retrieve it very often or very fast.

That is to say, that the same amount of data you would have needed 728 high density floppy disks to save it on within my relatively short memory, you can now store for about a British ha’penny. This is a coin so irrelevantly valueless in the grand scheme of buying things these days that it was withdrawn from circulation over 30 years ago, long before my memory begins.

But just because we can store all these 1s and 0s, should we? Below I have started to consider some downsides of data hoarding.

The too-obvious risks: large amounts of data expose you to more hacking and privacy risks

There are two huge risks, so obvious that I will pretty much skip over them. There is the risk of a hack, with the potential for some malevolent being to end up with your personal details. Just last week we heard the news that over 150,000 people’s personal details were accessed illegitimately from the computer systems of a mobile phone provider, including nearly 16k bank account details. A couple of years ago, Target famously had up to 70 million of its customers details stolen. If this data hadn’t been stored, the hack obviously couldn’t have taken place.

Often related to such behaviour, there’s also the issue of accidental privacy breaches. These might be simple mistakes – like the time that the UK government accidentally lost a couple of CDs that had personal data relating to every child in the UK on for instance – but they may have consequences not dissimilar to when the data has been hacked if it gets into the wrong hands.

Less dramatically, even in that most curated of safe software gardens, Apple’s app store, 256 apps were removed last month as it was found they were collecting and transmitting personal data that they should not have been.

These examples are not mean to imply there was something innately wrong in storing these particular parts of data. Maybe Target needs to store these customer details to runs its operations in the way it finds most productive, and the Government probably needs to store information on children (although perhaps not in the sorting offices of Royal Mail). However, every time data is stored that doesn’t yet have a use, one should bear in mind there is a non-zero risk it may leak.

Hacking and privacy breaches are easy and obvious concepts though. More interesting here are the other downsides of storing too much data that do not require the efforts of an evildoer to produce adverse consequences. What else might we consider?

Below I list a few. I’m sure I’ve missed many more. Important to consider in some of these is that the data itself may not be generating the majority of the risk, but rather it enables something negative which is later done with it. Methodological issues though are something even the most data-addicted analyst can address in their work.

Large amounts of data may encourage bad analysis

A lot of data analysis is targetted at understanding associations or causes of a target variable based on a set of input variables. Surely the more input variables we have, the better models we can produce and the better our model will reflect the world?

Well, it depends on how you measure it. Mathematically, yes, more variables do tend to lead to a model with a better fit to the data that was used to train it.

And guess what, we now have huge lakes full of crazy amounts of varied data and computer algorithms that can process it. If you want to predict your future sales, why not sack your data scientists and just have your supercomputer run your past sales data against every one of these billions of available variables until it deduces a model that predicts it perfectly?

This doesn’t work for a number of reasons, not least because for generating actual useful insights to support optimal decision making, using too many variables tends to lead to the curse of overfitting.

This is where you involve so many variables as predictors that your model is too specific to the precise historical data you trained it on, and therefore misleads you as to the real drivers behind your target predicted variables.

Wikipedia has a fun example:

As a simple example, consider a database of retail purchases that includes the item bought, the purchaser, and the date and time of purchase. It’s easy to construct a model that will fit the training set perfectly by using the date and time of purchase to predict the other attributes; but this model will not generalize at all to new data, because those past times will never occur again.

In other words, if you give me your past sales data at a customer/transaction level, then I can build you a model that will perfectly “predict” that past data. It would look something like:

IF CUSTOMERNAME = 'Jane Doe' AND TRANSACTION DATE = '1/1/2015' THEN SALES = £100
ELSEIF CUSTOMERNAME = 'John Smith' AND TRANSACTION DATE = '2/1/2015' THEN SALES = £500
...
...

Wow, look at my R-squared of 1. This is an unbeatable model. Horray!

At least until we actually want to use it to predict what will happen tomorrow when a new customer comes with a new name and a new date, at which point…catastrophe.

Although it’s a silly example, the point is that the problem came from the fact we exposed the model to too much data. Is it really relevant that someone’s name is Jane Doe? Did we need to store that for the purposes of the model? No – it actually made the resulting analysis worse.

Simplifying it to a univariate situation, the “correlation doesn’t imply causation” trope remains indefinitely true.

First, there’s the issue of how we measure correlation. A classic for linear models is Pearson’s ‘r’.

This gives you a single value between 0 and 1, where 1 is perfect positive correlation, 0 is no correlation and -1 is perfect negative correlation. You will often see results presented as “our analysis shows that X and Y are correlated with r of 0.8”. Sounds good, but what does this tell us?

Anscombe’s Quartet, possibly my favourite ever set of charts (doesn’t everyone have one?), shows us that it doesn’t tell us all that much.

All these charts have a “r” of around 0.8, and they all have the same mathematical linear regression trendline.

Anscombe's quartet

(thanks Wikipedia)

But do you believe that the linear correlation is appropriate for all four cases? Is the blue line a fair reflection of the reality of the trend in all datasets?

Anscombe’s Quartet is not actually about using too much data, but rather relying on statistical summarisation too much. It’s an example of where data visualisation is key to showing validity of a potentially computer-generated equation.

But visualisation alone isn’t sufficient either, when we are using too much data.

Let’s imagine we’re a public service that needs to come up with a way to predict deaths by drowning. So we throw all the data we could possibly obtain at it and see what sticks.

Here’s one positive result from the fabulous Spurious Correlations site.

Spurious correlations

OMG! The data shows Nicolas Cage causes drownings! Ban Nicolas Cage!

A quick dose of common sense suggests that Nicolas Cage films are probably not all versions of the videotape seen in The Ring, which kills you shortly after viewing it.

Nicolas Cage films and drowning statistics may mathematically fit, but, if we are taking our work seriously, there was little expected value in storing and using this data for this purpose. Collecting and analysing too much data led us to a (probably 🙂 ) incorrect theory.

Part of the issue is that predictive analytics is not even as simple as looking for a needle in a haystack. Instead we’re looking for a particular bit of hay in a haystack, which may look the same as any other bit of hay.

Most variables are represented as a sequence of numbers and, whether by coincidence or cross correlation, one meaningless sequence of numbers might look very like one very meaningful sequence of numbers in a particular sample being tested.

It doesn’t help that there are several legitimate statistical tests that arguably lose an aspect of their usefulness in a world where we have big data.

Back in the day, perhaps we could test 20 people’s reaction to a photo by getting them into a lab and paying them for their opinion. We would use inferential statistics to work out whether their opinions were meaningful enough to be extrapolated to a larger population – what can we infer from our 20-person results?

Now we can in theory test 1 billion people’s reaction to a photo by having Facebook perform an experiment (although that might not go down too well with some people). All other things being equal, we can infer a whole lot more from testing 1 billion people than testing 20.

There are therefore many hypothesis tests that are designed to check whether what looks like a difference between >=2 groups of “things”, for instance in the average score out of 10 women give a photo vs the average score men do, is indeed a real difference or just down to random chance. Classics include Student’s T Test, ANOVA, Mann-Whitney and many more.

The output of these often gives a measure of the probability of the result being seen being “real” and not due to some random fluctuations or noise. After all, most variables measured in real life have some natural variation – just because a particular woman is taller than a particular man we should not infer that all women are taller than all men. It could just be that you didn’t test enough women and men to get an idea of the reality of the situation given the natural variance in human height.

This output is often expressed as a “p value”, which is a decimal equivalent to the percentage probability that the differences you see in groups are down to chance.

P = 0.05 is a common benchmark, which would imply that whatever your hypothesis was, there’s a 95% probability that it is actually true (because the probability of it being down to chance is 0.05, = 5%).

Sounds good, but there are at least two issues to be wary of now you may be running these tests over vast numbers of rows or columns of data.

Hypothesis tests over a lot of rows (subjects)

Looking at the formula for these tests, you will see that the actual difference between groups necessary to meet that benchmark for “this is real” gets lower as your sample size gets bigger. Perhaps the test will tell you that a difference of 1 IQ point is significant when you’ve tested a hypothesis over a million subjects in your data, whereas it wouldn’t have done so if you had only tested it over 100.

So is the test wrong in either case? Should we artificially limit how many subjects we put into a test? No, of course not (unless there’s another good reason to do so), but some common sense is needed when interpreting it in a practical sense.

In medical circles, as well as statistical significance, there’s a concept of clinical significance. OK, we might have established that things that do X are more likely to do Y, but does that matter? Is that 1 IQ point difference actually something we care about or not? If your advert produces a £0.01 increase in average basket value, however “significant”, do you care?

Maybe you do, if you have a billion baskets a day. Maybe you don’t if you have 10. These are not questions you can answer in the abstract – but should be considered on a case by case basis by a domain expert.

Just because you had enough data available to detect some sort of effect does not mean that you should rush to your boss as soon as you see a p=0.05 result.

Hypothesis tests over a lot of columns (variables)

We saw above that a common standard is that we can claim our hypothesis is true when we are 95% certain it wasn’t due to chance.

To switch this around: we are happy to claim something is true even if there’s a 5% probability it was due to chance.

That seems fairly safe as a one-off, but imagine we’re testing 20 variables in this experiment, and apply the same rule to say we’re satisfied if any of them meet this criteria.

There’s a 5% chance you come back with the wrong answer on your first variable.
Then there’s 5% chance you come back with the wrong answer on your second variable.
…and so on until there’s a 5% chance you come back with the wrong answer on your tenth variable.

One can calculate that after 20 tests there is a 64% chance of a variable being shown to be significant, even if we know in advance for sure that none of them are.

Imagine if we had gone back to the idea of throwing a massive datalake’s worth of variables against two groups to try and find out, out of “everything we know about”, what explains the differences in these groups? A computer could quite happily throw a million variables against the wall in such an exercise.

If each test is allowed to produce a false positive 5% of the time, and you have a million such tests, one can quickly see that you will (almost) certainly get a bunch of results that – had they been conducted as single, well-constructed, hypothesis tests – show results that might have be considered as statistically significant – but in this scenario they are clearly often predictable-in-existence, but random-in-placement, false indicators.

This sort of behaviour is what people refer to as “p hacking”, and is seemingly somewhat rife within academic literature too where journals prefer to publish positive results (“X did cause Y”) rather than negative (“X did not cause Y”), even though both are often equally useful insights.

An article in Nature reports on a real-world example of this

…he..demonstrated that creative p-hacking, carried out over one “beer-fueled” weekend, could be used to ‘prove’ that eating chocolate leads to weight loss, reduced cholesterol levels and improved well-being (see go.nature.com/blkpke). They gathered 18 different measurements — including weight, blood protein levels and sleep quality — on 15 people, a handful of whom had eaten some extra chocolate for a few weeks. With that many comparisons, the odds were better than 50–50 that at least one of them would look statistically significant just by chance. As it turns out, three of them did — and the team cherry-picked only those to report.

(Unless of course chocolate does lead to weightloss, which would be pretty cool.)

The same article also refers to a similar phenomena as “Texas sharpshooting”, being based on a similarity to “an inept marksman who fires a random pattern of bullets at the side of a barn, draws a target around the biggest clump of bullet holes, and points proudly at his success.”

It’s quite possible to p-hack in a world where big data doesn’t exist. However, the relevance here is that one needs to avoid the temptation of collecting, storing and throwing a whole bunch of “miscellaneous” data at a problem when looking for correlations, models and so on, and then reporting that you found some genuine insight whenever a certain statistical test happens to reach a certain number.

There are explicit procedures and statistical tests designed to alert one you as to when you’re at risk of this, also coming from the field of statistical inference. These fall under the category of controlling the “familywise error rate” of a set of hypotheses, with perhaps the most famous being the Bonferroni correction.

Going back to the scientific method is also critical; replication can be critical. It’s good practice when building a statistical model to ensure you have some data that you can test it on, which has not been anywhere near the model-building process itself. If you have enough of this, then if you believe you’ve found that people born on a rainy day are more likely to buy high value items, then you can at least test this one particular theory specifically, instead of testing for an arbitrary amount of potential theories that happen to crop up due to spurious relationships within your training dataset.

Large amounts of data may reinforce the world’s injustices

A previous post on this site dealt with this issue in more detail, so we’ll skim it here. But suffice to say that there are common methodological flaws in data collection, processing, statistical models and the resulting “data driven” execution that result in suboptimal outcomes when measured in terms of fairness or justice by many reasonable people.

Two highlights to note are that:

Machine learning models…learn

Many statistical models are specifically designed to learn from historical reality. My past post had an example where a computer was fed data on historical hiring decisions – which resulted in it “innocently” learning to discriminate against women and people with foreign-sounding names.

Imagine if gender data hadn’t been stored for the applicants though – after all, should it really be a factor influencing applicant success in this way? No; so why was it collected for this task? Perhaps there was an important reason – but if not, then adding this data effectively provided the model with the tools to recreate gender discrimination.

The “foreign sounding names” point provides a warning though – perhaps the computer was not explicitly fed with ethnicity; but it was supplied with data that effectively proxied for it. Again, the computer has just learnt to do what humans did in the past (and indeed the present, which is why “name-blind recruitment” is a thing),

Implementing data driven models requires…data

“Data driven” decisions can only be made where data exists. Again, in my previous post, we saw that using “objective” data from phone apps or vehicles equipped to automatically transmit information as to where potholes that needed fixing were located, in fact produced a bias towards focussing such repair resources on areas that were full of affluent people – because poor areas were less like to have the same number of people with the resources and interest in fiddling around with the expensive Landrovers and fancy phones that collected this information.

This phenomenon is something like a tech/data version of the “WEIRD” issue that has been noted in psychology research – where a hugely disproportionate amount of human research has, for reasons other than deliberate malice, inadvertently concentrated on understanding Western, Educated people from Industrialised, Rich and Democratic countries – often American college students, who are probably not entirely representative of the entirety of humanity.

It could be argued that this one is a risk associated with having too little data rather than too much. But, given it is impossible to obtain data on literally every aspect of every entity in the world, these sort of examples should also be considered cases where using partial data for a variable risks producing worse results than using no data for that variable at all.

One should be wary of deciding that you might as well collect the data you happen to have access to as a method of creating analysis on something that will be extrapolated to people outside of your cohort’s parameters. Perhaps you will spend all that time and effort collecting data just because you can, just to end up with a model that produces worse real-world outcomes than not having that data at all.

Large amounts of data may destroy the planet

Dramatic perhaps, but for those of us who identify with the overwhelming scientific consensus that climate change is real, dangerous and affected by human activities, there is a mechanism here.

Storing lots of data requires lots of machines. Powering lots of machines (and especially their required cooling equipment) needs lots of electricity. Lots of electricity means lots of energy generation, which given – even in the developed world – is overwhelmingly generated in environmentally unfriendly ways, tends to mean lots of pollution and other such factors that could eventually contribute towards a potential planetary disaster.

Back in 2013, The Register reported on a study that suggested that “IT” was responsible for about 10% of electricity usage worldwide. These sort of measures are rather hard to do accurately in aggregate, and here includes the energy needed both to create the device and manage the distribution of information to it.

On the pure data side, the US Natural Resources Defense Council reported that, in 2013, data centres in the US alone used an estimated 91 billion kilowatt-hours of energy. They then went on to project some figures:

Data center electricity consumption is projected to increase to roughly 140 billion kilowatt-hours annually by 2020, the equivalent annual output of 50 power plants, costing American businesses $13 billion annually in electricity bills and emitting nearly 100 million metric tons of carbon pollution per year.

To be fair, some of the larger firms are trying to remedy this, whether for environmental purposes or cost cutting.

Facebook have a few famous examples, including building a data centre that is powered by hydroelectriciy, situated in a chilly sub-arctic area of Sweden which provides a big bonus of free cooling. This was expected to save 70% of the “normal” amount of energy for cooling such a place.

More recently, their engineers have been developing methods such that they can keep the vast amount of pointless photos that users uploaded years ago (and never looked at again) more offline, on machines that were powered down or even on Blueray disks.

Given that a lot of Facebook’s data isn’t being viewed by anyone at any single point in time, they can use people’s behavioural data in order to predict which other types of data might be needed live, leaving the rest turned off. It might sound a bit like Googling Google, but as Ars Technica reports:

“We have a system that allows 80-90 percent of disks turned off,” Weihl said. …you can predict when photos are going to be needed—like when a user is scrolling through photos chronologically, you can see you’ll need to load up images soon. You could make a decision to turn off not just the backups, but all the copies of older stuff, and keep only the primaries of recent stuff spinning.”

It’s not only the big players to consider though. The NRDC noted that the majority of data centre power needs were actually for “small, medium, and large corporate data centers as well as in the multi-tenant data centers to which a growing number of companies outsource their data center needs”, which on the whole are substantially less efficient than Facebook’s arctic zone.

So, every time you hoard data you don’t need, whether in big corporate Hadoops or Facebook photo galleries, you’re probably contributing towards unnecessary environmental pollution.

Optimum solution? Don’t save it. More realistic solution? Some interested organisations have done evaluations and produced consumer-friendly guides, such as the Greenpeace “Click Clean Scorecard” to suggest which storage service you might like to use if the environment is a key concern for you.

As a spoiler, Apple and Facebook did very well in the latest edition. Amazon Web Services did not.

Final thoughts

I’m a data-believer, I drink the Kool-Aid that claims the data can save the world, or at least, if not save it directly, then help us make decisions that produce better results. And in general, having more information is better than having less – as long as it is usable, used correctly – or at least there’s some conceptual reason to imagine one day it might be. But it’s an interesting exercise to play Devil’s Advocate and imagine why more data is not always a good thing.

Some of the issues above are less about the existence of the data, and more about the way that having access to it can tempt analysts or their managers into bad practice.

Other issues surround the size of the data, and the fact that it’s often simply not necessary to use as much as data as you can. An article on Wired suggests one should add “viability” and “value” to the 3 Vs of big data (velocity, volume and variety, for anyone who hasn’t lived through the painful cloud of buzzwords in recentish times).

An Information Age writer asks:

What’s the marginal impact on a predictive model’s accuracy if it runs on five million rows versus 10 billion rows?

The answer is often “very little”, other than it will:

  • use more resources (time, money, the planet) to gather and store,
  • take exponentially more time and CPU power to process; and
  • risk allowing an inordinate amount of overfitting, biased or otherwise poor models if best practice is not followed.

Lib Dem leaflet chart fail

Coming up to the election, there’s no shortage of misleading statistics, charts and downright quantitative lies being flung around. One even made it through our letterbox today. It’s far from the worst available online, but such statistical slights always feel more personal when they get physically pushed into one’s abode.

Here goes the Liberal Democrats, being honest enough to admit that their main selling point around here is that they got more votes in our area last time than the next largest party did.

Lib Dem chart

For the avoidance of doubt – my research indicates that 28% is not usually more than twice the amount of 16% on a linear scale, so I have taken the liberty of correcting the chart proportions below for a somewhat more realistic look.

Lib Dem chart improved

Although the point that our constituency is traditionally very Conservative-with-a-big-C remains [sadly] true, the Yougov Nowcast is suggesting a very different result for place #2 at present, as shown here.

nowcast

Not that I (or Yougov) would claim that that’s a done deal – but what the Lib Dem leaflet fails to mention is that the last result does not always predict the next result.

The poor Liberal Democrats were apparently recently polling at a 25-year low, behind even the previously pretty numerically insignificant UKIP and Green parties. I think it’s safe to say that the Cleggmania-fuelled 2010 election is not the best model for the current Lib Dem performance, bless them…