Using R to run many hypothesis tests (or other functions) on subsets of your data in one go

It’s easy to run a basic hypothesis test in R, once you know how. For example, if you’ve a nice set of data that you know meets the relevant assumptions, then you can run a t test in the following sort of way . Here we’ll assume that you’re interested in comparing the differences in the number of apples eaten between people, based on which group they’re in, from an imaginary “foods eaten” experiment dataset.

t.test(apples_eaten ~ group, data = foods_eaten)

The above uses the default values for hypothesis type, expected difference, confidence level et al. but these are all configurable by various parameters should you need to do so.

The default output looks something like this:

Welch Two Sample t-test

data: apples by group
t = 0.033192, df = 76.123, p-value = 0.9736
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.950128 3.050128
sample estimates:
mean in group a mean in group b 
9.05 9.00

Here we can see that the difference in means does not seem significant by pretty much any definition; the most obvious indicator being that the p-value is way above the traditional – if increasingly controversial –  “significance” cutoff of 0.05.

This output nice enough for interactive use if you’re analyzing a single variable from an experiment or two. But, let’s say your dataset actually includes the results of several experiments, several outcome variables, or some other scenario where you want to conduct multiple hypothesis tests in one go. Do you really want to run the above command many times and manually collate the results into a nice table to let you see a quick overview of results? Well, of course not! R, being your servant rather than your master, should run them all for you and output the results into a nice clean table, right?

First up though, you should probably consider why you’re doing this. Running multiple unadjusted t-tests for different groups within the same data is often a very bad idea. Per usual, xkcd provides an awesome explanation as to why.


If you’re looking at the differences between multiple subgroups in one experiment, you should often prefer, for example, an ANOVA test to several t-tests. This helps to avoid inflating the chances of a type 1 error, and falsely concluding you found an effect when in reality there isn’t one. I have also seen some instances of people running (unadjusted) multiple t-tests after an ANOVA has reported an effect in order to try and determine between which of the groups in the ANOVA the difference is significant in –  but this holds the same sort of danger. In that case, there’s plenty of standard post-hoc tests it would be more legitimate to use instead – Tukey’s HSD test for instance is similar to a t test, but contains adjustments to reduce the error rate.

Anyway, enough with the theoretical warnings. Perhaps you do have a valid use case. Or perhaps you realise that the below technique really allows you to run many different kind of functions that aren’t t tests on a dataset. These may be other types of hypothesis tests, or functions entirely unrelated to such testing. If this is the case, you can think of this as a template.

Let’s think through the mechanics of what we want to do. Let’s take a contrived (and fake) scenario in which we conducted 2 experiments, each involving 2 sets of people, a control and a test group. We measured how many of various different types of fruits and vegetable they ate. We’re interested in whether each group within an individual experiment differed in terms of the mean average of each type of fruit eaten.

Here’s a few rows of the sample data this non-existent experiment generated.

experiment_number group person_number apples bananas carrots dandelions eggplant
Experiment 1 a 1 1 3 3 3 1
Experiment 1 a 3 1 1 3 3 5
Experiment 1 b 103 2 5 3 5 2
Experiment 1 b 104 8 1 3 1 2
Experiment 2 c 7 19 5 8 0 3
Experiment 2 c 12 4 5 6 1 0
Experiment 2 d 111 16 3 6 0 5
Experiment 2 d 113 12 3 4 4 0

Our goal here, for the sake of argument, is to perform a t test for each experiment (identified by the “experiment number” column), comparing the 2 relevant groups of participants (the “group” column) in terms of how many apples they ate. Then, separately, on how many bananas they ate, and so on through the other edibles. Then we want to move on to experiment 2 and repeat the same exercise, until we have a hypothesis test result for each experiment on each of the foodstuffs. Yes, I know, if this was a real analysis, we’d need to revisit the above xkcd cartoon and think hard about whether this is the appropriate design for analysing the experiment.

But, that aside, this means we need to:

  1. Select only the group of data that has the first experiment_number, i.e. anything with experiment_number = Experiment 1 here.
  2. Select the first outcome variable of interest (apples)
  3. Run a t test on it, comparing participant groups against each other (here group a vs group b).
  4. Record the results of that t test (let’s say the t statistic, the p value, and the estimated difference in means) as a row in a results table
  5. Go back to step 2, but moving to the next outcome variable, here examining bananas instead of apples, until we’ve tested all the outcome variables.
  6. Go back to step 1, moving to the next experiment number, i.e. Experiment 2, and repeating the whole process again.

In this dataset, 2 experiments with 5 outcomes, that’s 10 t-tests. I don’t want to look at 10 of the default outputs for R’s t.test function, thank you. I’d prefer a result that looks something like this…

experiment_number food_type estimated mean of first group estimated mean of second  group t statistic p value
Experiment 1 apples 9.05 9 0.03 0.97
Experiment 1 bananas 2.5 2.875 -1.02 0.31
Experiment 2 dandelions 2.16 2.84 -1.76 0.08
Experiment 2 eggplant 3.2 3.66 -0.86 0.39

…where we’ve a table, or “dataframe” as R calls it, showing 1 row per t test with the most relevant statistics.

OK, let’s go, building it up step by step. Being R, of course there’s a multitudinous variety of ways we could do this. Being a tidyverse fan, I’m going with what could be described as a dplyr approach. We’re going to need to install and load the following libraries to use this method.

  • tidyr: to get the data into a “tidy” format.
  • dplyr: to manipulate the data into the groups you wish to hypothesis test on.
  • broom: to tidy up the results of the default R t.test output into a nice tabular format

So, load up these libraries, remembering to install the packages first if you’ve not already done so, as below.

install.packages("tidyr") # only needed if you didn't ever install tidyr
install.packages("dplyr") # only needed if you didn't ever install dplyr
install.packages("broom") # only needed if you didn't ever install broom

library(tidyr) # to get data into "tidy" format
library(dplyr) # to manipulate data
library(broom) # to tidy up the output of the default t test function.

Step 2 is to transform the data table of experiment results from a “wide” format to a “long” format.

Many tidyverse functions, and much of R in general, works better with long data. We know our lowest level of grouping – i.e. the granularity we want to conduct our tests at – is at a per experiment per outcome variable level. For example, we’ll need to group all of the “apples” data for “experiment 1” together so we can analyse it with an individual test.

If we’re going to group then the experiment number and the outcome variable type should probably be individual columns. The experiment_number already is. But the outcome variables aren’t – they each have their own column. We thus want to get from this situation: 1 row per person who participated in the experiment…

experiment_number person_number group apples bananas carrots
1 1 a 1 2 3
2 7 c 19 24 6

…to this format – 1 row per outcome variable per person. Note how the experiment number, group, and person number are repeated down the rows, in order to keep track of whose apples outcome the row concerns.

experiment_number group person_number food_type count_eaten
Experiment 1 a 1 apples 1
Experiment 1 a 1 bananas 2
Experiment 1 a 1 carrots 3
Experiment 2 c 7 apples 19
Experiment 2 c 7 bananas 24
Experiment 2 c 7 bananas 6

We’re using a “food_type” column to record which outcome variable we’re talking about (apples vs bananas), and the “person_number” column to ensure we keep track of which entries belong to the same person:

This “unpivoting” is a perfect use-case for dplyr’s gather() function. Here we go:

results <- gather(data, key = "food_type", value = "count_eaten", -experiment_number, -person_number, -group)

The "key" parameter of the command lets us set the name of the column that will store the name of the outcome variable we're looking at. The "value" parameter lets us set the name of the column that will store the actual result obtained by the participant for the specified outcome variable.

Where we see "-experiment_number, -person_number, -group" at the end of the function call, that's to tell it that – because experiment number, person number and group are not outcome variables – we don’t want to collapse their contents into the same column as the count of apples, bananas etc. We need these variables to be replicated down the list to know which experiment, person and group the respective outcome variable was obtained from.

Note the minuses in front of those variable names. This means “don’t unpivot these columns’. If you only had  a few columns to unpivot, you could instead write the names of the columns you do want to unpivot there, without the minuses.

Let’s check what “results” now looks like, by typing head(results).


OK, that looks very much like what we were aiming for. Onto the next step!

Now we need to ensure our analysis is only carried out using the data from a single experiment for a single outcome. Our outcomes here are the food type counts. For example, we want to consider the data from Experiment 1’s apples outcome separately from both Experiment 1’s bananas outcome, and Experiment 2’s apples outcome. In other words, we want to group the data by which experiment it’s from and which food type it’s measuring.

Handily, dplyr has a group_by function. Pass in the data you’re working with, and the names of the columns you want to group it by.

results <- group_by(results, experiment_number, food_type)

Checking the  head of the results now looks pretty similar to the previous step, only this time the output tells you, quite faintly, near the top, that we've grouped by the appropriate variables.


Now, it’s time to do the actual t tests!

First of all, remember that we don’t want a bunch of those verbose default t.test outputs to work with. We want a nice table summarising the most important parts of the test results. We thus need to make the key parts of t test output “tidy“, so it can be put into a table, aka dataframe.

How to tidy the results of a hypothesis test? Use broom. Horray for R package names that attempt some kind of pun.

Let’s first take a moment to see how tidying works with a single t test, run in the conventional way.

Here’s the output of the default t.test function looking at the apples outcome of experiment 1. First I constructed a dataframe “exp1_apples_data”, which contains only data about apples, from experiment 1. Then I ran the standard t.test – being explicit about many of the options available, although if you like the defaults you don’t really need to be.

exp1_apples_data %>%
  select(group, person_number, apples)

t_test_results <- t.test(apples ~ group,
       data = exp1_apples_data,
        alternative = "two.sided",
        mu = 0,
        paired = FALSE,
        var.equal = FALSE,
        conf.level = 0.95)


This outputs:


Now, if we apply the tidy() function from the broom package to the results of the t test, what do we get?




Yes, a nice tidy 1-row summary for the results of the t test. Comparing it to the default output above, you can see how:

  • field “estimate1” corresponds to the mean in group a
  • “estimate2” corresponds to the mean in group b (and the “estimate” field corresponds to the difference between the mean in group a and b)
  • “statistic” corresponds to the t statistic
  • “p.value” corresponds to the p-value
  • “parameter” corresponds to the degrees of freedom
  • “conf.low” and “conf.high” correspond to the confidence interval
  • “method” and “alternative” record the type & nature of the test options selected

So now we have a set of grouped data, and a way of producing t test results in in a tabular format. Let’s combine!

For this, we can use the dplyr function “do“. This applies what the documentation refers to as an “arbitrary computation” (i.e. most any function) to a set of data, treating each of the groups we already set up separately.

The first parameter is the name of the (grouped) dataset to use. The second is the function you want to perform on it.

results <- do(results, tidy(t.test(.$count_eaten ~ .$group,
                 alternative = "two.sided",
                 mu = 0,
                 paired = FALSE,
                 var.equal = FALSE,
                 conf.level = 0.95


Note that you have to express the t test formula as .$count_eaten ~ .$group rather than the more usual count_eaten ~ group. Inside, the do() function, the . represents the current group of data, and the $ is the standard R way of accessing the relevant column of a dataframe.

What do we have now? Here's the head of the table.


Success! Here we see a row for each of the tests we did –  1 per experiment per food type –  along with the key outputs from the t tests themselves.

This is now just a regular data frame, so you can sort, filter, manipulate or chart as you would with any other dataset. Figuring out which of your results have  p value of 0.05 or lower for instance could be done via:

filter(results, p.value <= 0.05)

Below is a version where the commands are all pieced together, using the pipe operator.  That’s likely easier to copy and paste should you or I have a need to do so. Do remember that this is applicable, and perhaps more often actually theoretically valid, to the world outside of t tests.

The tidy() command itself is capable of tidying up many types of hypothesis tests and models so that their results appear in a useful table format. Beyond that, many many other functions that will return something coercible into a row of a dataframe or as an item in a R list, so will work.

results %>%
  group_by(experiment_number, food_type) %>%
  do(tidy(t.test(.$count_eaten ~ .$group,
                 alternative = "two.sided",
                 mu = 0,
                 paired = FALSE,
                 var.equal = FALSE,
                 conf.level = 0.95

My favourite R package for: correlation

R is a wonderful, flexible, if somewhat arcane tool for analytics of all kinds. Part of its power, yet also its ability to bewilder, comes from the fact that there are so many ways of doing the same, or similar, things. Many of these ways are instantly available thanks to many heroes of the R world creating and distributing free libraries to refine existing functionality and add new abilities.

Looking at a list of one from the most popular sources for these packages, CRAN, shows that their particular collection gets new entries on a several-times-per-day basis, and there are 11,407 of them at the time of writing.

With that intimidating stat in mind, I will keep a few notes on this blog as to my current favourite base or package-based methods for some common analytics tasks. Of course these may change regularly, based on new releases or my personal whims. But for now, let’s tackle correlations. Here I mean simple statistical correlations between 2 sets of data, the most famous one of which is likely the Pearson correlation coefficient, aka Pearson’s R.

What would I like to see in my ideal version of a correlation calculator? Here’s a few of my personal preferences in no particular order.

  1. Can deal with multiple correlation tests at once. For example, maybe I have 5 variables and I’d like to see the correlation between each one of them with each of the other 4 variables).
  2. Visualises the results nicely, for example in a highlighted correlation matrix. Default R often produces informative but somewhat uninspiring text output. I have got spoiled with the luxury of data visualisation tools so after a heavy day’s analysis I prefer to take advantage of the many ways dataviz can make analytic output easier to decipher for humans.
  3. If the output is indeed a dataviz, I have a slight preference for it to use ggplot charting all other things being equal. Ending up with a proper ggplot object is nice both in terms of the default visual settings vs some other forms of R chart, and also that you can then in theory use ggplot methods to adapt or add to it.
  4. Can produce p values, confidence intervals, or some other way of suggesting whether any correlations found are statistically significant or not.
  5. Can handle at least a couple of types of correlation calculations, the most common of which are probably Pearson correlation coefficient and Spearman’s rank correlation coefficient.

Default R has a couple of correlation commands built in to it. The most common is probably “cor“. Here’s an example of what it produces, using a test dataset named test_data of 5 variables, named a, b, c, d and e (which are in columns .



So, it does multiple tests at once, and can handle Pearson, Spearman and Kendall correlation calculations, via changing the “method” parameter (which defaults to Pearson if you don’t specify, as in my example). But it doesn’t show the statistical significance of the correlations, and a potentially large text table of 8-place decimal numbers is not the sort of visualised output that would help me interpret the results in a particularly efficient way.

A second relevant default R command is “cor.test“. This one only allows you to make a single correlation comparison, for example between variable a and variable b.

cor.test(test_data$a, test_data$b)


So here we see it does return both a p value and a confidence interval to help us judge the significance of the correlation size noted. You can change the alternative hypothesis and confidence interval range via parameters. It can also do the same 3 types of correlation that “cor” supports. But, as noted, it can only compare two variables at once without further commands. And the output is again a bunch of text. That is really OK here, as you are focusing only on one comparison. But it’s going to be pretty tedious to run and decipher if you want to compare each one of a few variables against each of the others.

So, is there a package solution that makes me happy? As you might guess, yes, there’s actually a few contenders. But my current favourite is probably “ggcorrplot“. The manual is here, and there’s a handy usage guide here.

Suffice to say:

  1. It allows you to compare several variables against each other at the same time.
  2. It visualises the variables in a colour-coded correlation matrix
  3. The visualisation is a ggplot
  4. It can produce p values, using the accompanying function cor_pmat(), which can then be shown on the visualisation in various ways.
  5. It uses the results from the built in cor() function, so can handle the same 3 types of correlation.

There’s a bunch of options to select from, but here’s the default output

# calculate correlations using cor and store the results
corrdata <- cor(test_data)

# use the package's cor_pmat function to calculate p-values for the correlations
p.mat <- cor_pmat(test_data)

# produce a nice highlighted correlation matrix
ggcorrplot(corrdata, title = "Correlation matrix for test data")

The results look like:


You can see it produces a correlation matrix, colour coded as to the direction and strength of the correlations. It doesn’t show anything about the statistical significance. Kind of pretty for an overview glance, but it could be rather more informative.

I much prefer to use a couple of options that show the actual correlation values and the significance; the ones I most commonly use probably being this set.

ggcorrplot(corrdata, title = "Correlation matrix for test data", lab=TRUE, p.mat = p.mat, sig.level = .05)


Here, the correlation coefficients are superimposed on the grid, so you can check immediately the strength of the correlation rather than try and compare to the colour scale.

You can also see that some of the cells are crossed out (for example the correlation between variable c and variable e in the above). This means that the correlation detected is not considered to be significant at the 0.05 level. That level can be changed, or the insignificant correlations be totally hidden if you prefer to not get distracted by them in the first place.

The Datasaurus: a monstrous Anscombe for the 21st century

Most people trained in the ways of data visualisation will be very familiar with Anscombe’s Quartet. For the uninitiated, it’s a set of 4 fairly simple looking X-Y scatterplots that look like this.

Anscombe's Quartet

What’s so great about those then? Well, the reason data vizzers get excited starts to become clear when you realise that the dotted grey lines I have superimposed on each small chart are in fact the mean average of X and Y in each case. And they’re basically the same for each chart.

The identikit summary stats go beyond mere averages. In fact, the variance of both X and Y (and hence the standard deviation) is also the pretty much the same in every chart. As is the correlation coefficient of X and Y, and the regression line that would be the line of best fit if were you to generate a linear model based on each of those 4 datasets.

The point is to show the true power of data visualisation. There are a bunch of clever-sounding summary stats (r-squared is a good one) that some nefarious statisticians might like to baffle the unaware with – but they are oftentimes so summarised that they can lead you to an entirely misleading perception, especially if you are not also an adept statistician.

For example, if someone tells you that their fancy predictive model demonstrates that the relationship between x and y can be expressed as “y = 3 + 0.5x” then you have no way of knowing whether the dataset the model was trained on was that from Anscombe 1, for which it’s possible that it may be a good model, or Anscombe 2, for which it is not, or Anscombe 3 and 4 where the outliers are make that model sub-par in reality, to the point where a school child issued with a sheet of graph paper could probably make a better one.

Yes analytics end-users, demand pictures! OK, there are so many possible summary stats out there that someone expert in collating and mentally visualising the implication of a combination of a hand-picked collection of 30 decimal numbers could perhaps have a decent idea of the distribution of a given set of data – but, unless that’s a skill you already have (clue: if the word “kurtosis” isn’t intuitive to you, you don’t, and it’s nothing to be ashamed of), then why spend years learning to mentally visualise such things, when you could just go ahead and actually visualise it?

But anyway, the quartet was originally created by Mr Anscombe in 1973. Now, a few decades later, it’s time for an even more exciting scatterplot collection, courtesy of Justin Matejka and George Fitzmaurice, take from their paper “Same Stats, Different Graphs“.

They’ve taken the time to create the Datasaurus Dozen. Here they are:

Datasaurus Dozen.png

What what? A star dataset has the same summary statistics as a bunch of lines, an X, a circle or a bunch of other patterns that look a bit like a migraine is coming on?

Yes indeed. Again, these 12 charts all have the same (well, extremely similar) X & Y means, the same X & Y standard deviations and variances, and also the same X & Y linear correlations.

12 charts are obviously more dramatic than 4, and the Datasaurus dozen certainly has a bunch of prettier shapes, but why did they call it Datasaurus? Purely click-bait? Actually no (well, maybe, but there is a valid reason as well!).

Because the 13th of the dozen (a baker’s dozen?) is the chart illustrated below. Please note that if you found Jurassic Park to be unbearably terrifying you should probably close your eyes immediately.

Datasaurus Main

Raa! And yes, this fearsome vision from the distant past also has a X mean of 54.26, and Y mean of 47.83, and X standard deviation of 16.76, a Y standard deviation of 26.93 and a correlation co-efficient of -0.06, just like his twelve siblings above.

If it’s hard to believe, or you just want to play a bit, then the individual datapoints that I put into Tableau to generate the above set of charts is available in this Google sheet – or a basic interactive viz version can be found on Tableau Public here.

Full credit is due to Wikipedia for the Anscombe dataset and Autodesk Research for the Datasaurus data.

Simpson’s paradox and the importance of segmentation

Here’s a classic business analysis scenario, which I’d like to use to illustrate one of my favourite mathematical curiosities.

Your marketers have sent out a bunch of direct mail to a proportion of your previous customers, and deliberately withheld the letters from the rest of them so that they can act as a control group.

As analyst extraordinaire, you get the job of totting up how many of these customers came back and bought something. If the percentage is higher in the group that received the mail, then the marketers will be very happy to take credit for increasing this year’s revenue.

However, once the results are back, it’s not looking great. Aggregating and crosstabulating, you realise that actually the people who were not sent the mail were a little more likely to return as customers.

Sent marketing? Count of previous customers in group Count of customers returning to store Success rate
No 300 40 13%
Yes 300 32 11%

You go to gently break the bad news to the marketing team, only to find them already in fits of joy. Some other rival to your analytical crown got there first, and showed them that their mailing effort in fact attracted a slightly higher proportion of both men and women to come back to your shop. A universally appealing marketing message – what could be better? Bonuses all round!

Ha, being the perfect specimen of analytical talent that you are, you’ve got to assume that your inferior rival messed up the figures. That’s going to embarrass them, huh?

Let’s take a look at your rival’s scrawlings.

Gender Sent marketing? Count of previous customers in group Count of customers returning to store Success rate
Female No 200 37 19%
Female Yes 100 21 21%
Male No 100 3 3%
Male Yes 200 11 6%
  • So, a total of 100+200 people were sent the letter. That matches your 300. Same for the “not-sent” population.
  • 21 + 11 people who were sent the letter returned, that matches your 32.
  • 37 + 3 people who were not sent the letter returned, again that matches your 40.

What what what??!

This is an example of “Simpson’s Paradox”, which luckily has nothing to do with Springfield’s nuclear reactor blowing up and rupturing a hole in the very fabric of mathematical logic.

Instead, the disparities in the sample size and propensity of the two gender populations to return as customers are coming into play.

Whether they received the mailing or not, the results show that women were much more likely to return and shop at the store again anyway. Thus, whilst the marketing mail may have had a little effect, the “gender effect” here was much stronger.

Gender could therefore be considered a confounding variable. This could have been controlled for when setting up the experiment, had it been tested and known how important gender was with regard to the rate of customer return beforehand.

But apparently no-one knew about that or thought to test the basic demographic hypotheses . As it happened, with whatever sample selection method was employed, only half as many women happened to be in the group that was sent the mailing,  vs the proportion of men who were sent the mailing.

So, whilst the mailing was marginally successful in increasing the propensity of both men and women to return to the store, the aggregate results of this experiment hid it because men – who were intrinsically far less likely to return to the store than women – were over-represented in the people chosen to receive the mailing

New website launch from the Office of National Statistics

Yesterday, the UK Office of National Statistics, the institution that is “responsible for collecting and publishing statistics related to the economy, population and society”, launched its new website.

As well as a new look, they’ve concentrated on improving the search experience and making it accessible to mobile device users.

The front page is a nice at-a-glance collection of some of the major time series ones sees in the news (employment rate, CPI, GDP growth etc.) . And there’s plenty of help-yourself downloadable data; they claim to offer 35,000 time series which you can explore and download with their interactive time series explorer tool.

The Sun and its dangerous misuse of statistics

Here’s the (pretty abhorrent) front cover of yesterday’s Sun newspaper.


Bearing in mind that several recent terrorist atrocities are top of everyone’s mind at the moment, it’s clear what the Sun is implying here.

The text on the front page is even more overt:

Nearly one in five British Muslims have some sympathy with those who have fled the UK to fight for IS in Syria.

The implication is obviously a starkly ominous claim that 20% of Britain’s Muslims support the sort of sick action that IS have claimed responsibility for in Paris and other places that have experienced horrific, crazed, assaults in recent times.

This is pushed even more fervently by the choice to illustrate the article solely with a photo of Jihadi John, an inexcusably evil IS follower “famous” for carrying out sick, cruel, awful beheadings on various videos.

Given the fact that – most thankfully – there are far, far fewer people featuring on videos of beheadings than there are people travelling to Syria to join fighters, this is clearly not a representative or randomly chosen photo.

Sun writer Anila Baig elaborates in a later column:

It beggars belief that there are such high numbers saying that they agree with what these scumbags are doing in Syria. We’ve all seen the pictures of how innocent Yazidi girls are treated, how homosexuals are thrown off tall buildings. It’s utterly abhorrent.

The behaviours she describes are undoubtedly beyond abhorrent.

But of course, this 1 in 5 statistic isn’t true – even in the context of the small amount of data that they have used to support this story.

It is however a very dangerous claim to make in the current climate, where “Islamophobia” and other out-group prejudices make the lives of some people that follow Islam – or even look like a stereotype of someone that might – criminally unpleasant. Britain does have statutes that cover the topic of hate speech after all, and hate speech can come from many quarters.

Anyway, enough of the rants (kind of) and onto the statistics.

There are three main points I describe below, which can be summarised as:

  • The survey this headline is based on did not ask the question that the paper implies.
  • The paper deliberately misleads its readers by failing to give easily-available statistical context to the figures.
  • The methodology used to select respondents for the survey was anyway so inadequate that it’s not possible to tell how representative the results are.



The Sun’s claim that 1 in 5 British Muslims have sympathy for Jihadis (the headline) and those who fight for IS (in main text) comes from a survey they commissioned, which was executed by professional polling company Survation. You can read a summarised version of the quantitative results here.

The “20% sympathise with Isis” and those implications are based on responses to question 6 (page 9 of the PDF results above), which asked people to say which of a set of sentences they agreed with the most. The sentences were of the form “I have a [lot/some/no] of sympathy with young Muslims who leave the UK to join fighters in Syria”.

Results were as follows, and the Sun presumably added up the first 2 boxes to get to their 20%, which isn’t a bad approximation. Note though that an equally legitimate  headline would have been “95% of Muslim respondents do not have a lot of sympathy for jihadis”.

Response % selected
I have a lot of sympathy with young Muslims who leave the UK to join fighters in Syria 5.3%
I have some sympathy with young Muslims who leave the UK to join fighters in Syria 14.5%
I have no sympathy with young Muslims who leave the UK to join fighters in Syria 71.4%
Don’t know 8.8%

However, compare the claim that they have sympathy with those ‘who have fled the UK to fight for IS’ and ‘they agree with what these scumbags are doing…homosexuals thrown off tall buildings’ (even ignoring the implications of supporting the sort of criminal mass murder seen in Paris) with the question actually asked.

There was no mention of IS or any particular act of terrorism or crime against human rights in the question whatsoever.

The question asks about joining fighters in Syria. Wikipedia has a list of the armed groups involved in the Syrian war. At the time of writing, they have been segmented into 4 groups: the Syrian Arab Republic and allies; the Syrian Opposition + al-Qaeda network and allies; the Kurnodish self-administration and allies; and ISIL (aka IS) and allies.

There are perhaps in the order of 200 sub-groups (including supporters, divisions, allies etc.) within those divisions, of which the huge majority are not affiliated with IS. Even the UK is in the “non-lethal” list, having donated a few million to the Syrian opposition groups.

To be fair, the question did ask about joining fighters rather than the c. 11 non-lethal groups. But we should note that – as well as the highly examined stream of people apparently mesmerised by evildoers to the extent of travelling to fight with them – there was also  a Channel 4 documentary a few months ago showing a different example of this. In it, we saw 3 former British soldiers who had decided to travel to Syria and join fighters – the ones who fight against IS. I do not know what religion, if any, those 3 soldiers followed – but is it possible someone might feel a little sympathy towards the likes of them?

It is not necessarily a good thing for a someone to be travelling abroad to join any of these groups with a view to violent conflict, and I am convinced that some people do travel to join the most abhorrent of groups.

But, the point is that, despite what the Sun wrote, the question did not mention IS or any of their evil tactics, and could have in theory suggested some sort of allegiance to very many other military-esque groups.

The question only asks whether the respondent has sympathy for these young Muslims who travel abroad.

To have sympathy for someone does not mean that you agree with the aims or tactics of the people they are persuaded to go and meet.

Google tells us that the definition of sympathy is:

feelings of pity and sorrow for someone else’s misfortune.

One can quite easily imagine a situation where, even if you believe these people are travelling to Syria specifically to become human weapons trying to mass target innocent victims you can have some sympathy for the young person involved.

It seems plausible to have some sympathy for a person that has been brainwashed, misguided, preyed on by evildoers and feels that they have such quality of life that the best option for their future is to go and join a group of fighters in a very troubled country. Their decisions may be absurd, perhaps they may even end up involved in some sick, awful, criminal act for which no excuses could possibly apply – but you could have some sympathy for a person being recruited to a twisted and deadly cause, whilst virulently disagreeing with the ideology and actions of a criminal group that tries to attract them.

And, guess what, the Sun handily left out some statistics that might suggest that is some of what is happening.

For every such survey that concentrates on the responses of a particular population, it’s always important to measure the base rate, or a control group rate. Otherwise, how do you know whether the population you are concentrating on is different from any other population? It’s very rare that any number is meaningful without some comparative context.

As it happens, a few months ago, the same survey group did a survey on behalf  of Sky News that asked the same question to non-Muslims. The results can be seen here, on page 8, question 5, reproduced below.

As the Sun didn’t bother providing these “control group” figures, we’re left to assume that no non-Muslim could ever “show sympathy”to the young Muslims leaving the UK to join fighters. But…

Response % selected
I have a lot of sympathy with young Muslims who leave the UK to join fighters in Syria 4.3%
I have some sympathy with young Muslims who leave the UK to join fighters in Syria 9.4%
I have no sympathy with young Muslims who leave the UK to join fighters in Syria 76.8%
Don’t know 9.6%

So, 14% of non-Muslims respond that they have a lot or some sympathy with this group of travellers. Or as the Sun might headline it: “1 in 7 Brit non-Muslims sympathy for jihadis” (just below the same picture of a lady in a bikini, obviously).

14% is less than 20% of course – but without showing the base rate the reader is left to assume that 20% is “20% compared to zero” which is not the case.

Furthermore in some segments of the surveyed population, the sympathy rates in non-Muslims are higher than in Muslims.

The Sun notes:

The number among young Muslims aged 18-34 is even higher at one in four.

Here’s the relevant figures for that age segment from a) the poll taken to support the Sun’s story, and b) the one that asked the same question to non-Muslims.

Response Muslims aged 18-34 Non-Muslims aged 18-34
I have a lot of sympathy with young Muslims who leave the UK to join fighters in Syria 6.9% 10.9%
I have some sympathy with young Muslims who leave the UK to join fighters in Syria 17.6% 19.2%
I have no sympathy with young Muslims who leave the UK to join fighters in Syria 66.2% 52.2%
Don’t know 9.3% 17.7%

So-called “Jihadi sympathisers” aged 18-34 make up a total of 24.5% of Muslims, and 30.1% of non-Muslims.

Another headline, dear Sun writers: “1 in 3 Brit non-Muslims youth sympathy for jihadis thankfully moderated by less sympathetic Muslim youth?”

A similar phenomenen can be seen when broken down by non-Islamic religions. Although some of these figures are masked due to small samples, one can infer from the non-Muslim survey that a greater than 20% proportion of the sample that classified themselves as some religion outside of Christian, Muslim and “not religious” were at least somewhat sympathetic to these young Muslims who go to Syria.

As a final cross-survey note, it so happens that the Muslim-focussed survey was also carried out targetting a Muslim population earlier in the year, in March, too, again with Survation and Sky News. Here’s the results of that one:

Response % selected
I have a lot of sympathy with young Muslims who leave the UK to join fighters in Syria 7.8%
I have some sympathy with young Muslims who leave the UK to join fighters in Syria 20.1%
I have no sympathy with young Muslims who leave the UK to join fighters in Syria 61.0%
Don’t know 11.1

Totting up the support for having some/a lot of sympathy for the young Muslims leaving the UK for Syria, we see that the proportion showing any form of sympathy fell from 27.9% in March to 19.8% now in November.

That’s a relatively sizeable fall, again not mentioned in the Sun’s article (because that would spoil the point they’re trying to make the reader conclude). Here’s another headline I’ll donate to the writers: ‘Dramatic 30% fall in Brit Muslims sympathy for jihadis‘.

Next up, time to talk about the method behind the latest survey.

Regular readers of the Sun might have noticed that normally the formal surveys they do are carried out by Yougov. Why was it Survation this time? The Guardian reports that Yougov declined the work because

it could not be confident that it could accurately represent the British Muslim population within the timeframe and budget set by the paper.

So rather than up the timeframe or the budget, the Sun went elsewhere to someone that would do it cheaper and quicker. Survation apparently complied.

Given most surveys cannot ask questions to every single person alive, there’s a complex science behind selecting who should be asked and how to process the results to make it representative of what is claimed.

Here’s the World Bank on how to pick respondents for a household survey. Here’s an article from the Research Methods Knowledge Base on selecting a survey method. There is far more research on best practice on polling, and this is one reason why resourcing using professional pollsters is often a necessity if you want vaguely accurate results, especially on topics that are so controversial.

However, none of the research I’ve seen has ever suggested that one should pick out a list of people whose surname “sounds Muslim” and ask them the questions, which is, incredibly, apparently the shortcut Survation used given they didn’t have the time or money to do the detailed and intricate work that would be necessary to generate a statistically representative sample of all British Muslims.

It might be that Survation did happen to choose a perfectly representative sample, but the problem is we just do not know. After talking to other pro-pollsters, the Guardian noted:

It cannot be determined how representative the Survation sample is because of a lack of various socioeconomic and demographic details.

Even if the rest of the story had been legit, then – being generous with the free headlines – the Sun would have been more accurate to write “1 in 5 out of the 1500 people we had the time to ring that had “Muslim sounding surnames” and did in fact agree that they were Muslim’s sympathy for jihadis“. But it’s a bit less catchy and even the non-pros might notice something methodologically curious there.

So, to summarise – the Sun article, which seems to me to tread dangerously near breaking various legislation on hate speech in principle if not practice, is misleadingly reporting on the results of a questionnaire:

  • that did not even ask the question that would be appropriate to make the claims it is headlining.
  • without providing the necessary comparative or historical perspective to make the results in any way meaningful.
  • that was carried out in inadequate, uncontrolled fashion with no view to producing reliable, generalisable results.

We have to acknowledge that, on the whole, people going away to join fighter groups is a bad, sad event and one that the world needs to pay attention to. Infinitely moreso, obviously, if they are travelling to commit atrocities with groups that can be fairly described as evil.

But for the Sun imply such dramatic and harmful allegations about a section of the British population to whom prejudice against is already very apparent (note the  300% increase in  UK anti-Muslim hate crime last week)  to its huge readership – who will now struggle to entirely forget the implication during their dinner-table conversation even if they wanted to –  is not only extremely poor quality data analysis, but also downright irresponsible and even dangerous.

Kruskal Wallis significance testing with Tableau and R

Whilst Tableau has an increasing number of advanced statistical functions – a case in point being the newish analytics pane from Tableau version 9 – it is not usually the easiest tool to use to calculate any semi-sophisticated function that hasn’t yet been included.

Various clever people have tried to work some magic aroud this, for instance by attempting “native” Z-testing. But in general, it’s sometimes a lot of workaroundy effort for little gain. However, a masterstroke of last year’s Tableau version was that is included the ability to interface with R.

R is a statistical programming language that really does have every statistical or modelling function you’ll ever need. When someone invents new ones in the future, I have no doubt an R implementation will be forthcoming. It’s flexibility and comprehensiveness (and price…£0) is incredible. However it is not easy to use for the average person. For a start, the default interface is a command line, and – believe it or not – there are people alive today who never had the singular joy of the command line being the only way to use a computer.

It must be said that in some ways Tableau has not really made R super easy to use. You still need to know the R language in order to use R functionality inside Tableau. But using the Tableau interface into R has 2 huge benefits.

  1. It integrates into your exist Tableau workspace. The Tableau interface will push any data you have in Tableau, with all your filters, calculations, parameters and so on to R and wait for a result. There’s no need to export several files with different filters etc. and read.csv(“myfile1.csv”) them all into R. Once you’ve set up the R script, it reruns automatically whenever your data changes in Tableau.
  2. It visualises the results of the R function immediately and dynamically, just like any other Tableau viz. R has a lot of cool advanced graphics stuff but it can be hard to remember quite how to use and the most of the basic components offer no interactivity. No drag and drop pills or “show me” to be seen there!

So recently when I needed to perform a Kruskal Wallis significance test over data I was already analysing in Tableau, this seemed the obvious way to do it.

Kruskal Wallis is fully explained in Wikipedia .To over-simplify for now, it allows you to detect whether there is a difference in data when it comes from multiple groups – like ANOVA more famously does, but KW doesn’t require data to be parametric.

This is classic hypothesis testing. With my questionnaire, we start with the hypothesis that there is no difference in the way that different groups answer the same question, and see if we can disprove that.

In honesty, the fact it’s a KW test is not all that important to this post. Using R to perform a certain test in Tableau is pretty much the same as it would be for any other test, aside from needing to know the name and syntax of the test in R. And Google will never let you down on that front.

In this case, I wanted to analyse questionnaire results scored on a Likert-type scale to determine on several dimensions whether the answering patterns were significantly different or not. For instance, is there a real difference in how question 1 is answered based on the respondents age? Is there a real difference in how question 2 is answered based on respondents country of origin? How confident can we be that any difference in averages truly represents a real difference and is not the results of random noise?

I wanted to do this for about 30 questions over 5 different dimensions; a total of 150 Krusal-Wallis tests, and I didn’t overly fancy doing it manually.

Had I got only a copy of pure R to hand, this might be how I would have tackled the question – assuming that I output the results of question 1 dimensioned by groups in a CSV file before, in a format like this:

RespondentID Group Score
R1 Group A 4
R2 Group B 5
R3 Group B 5
R4 Group A 3
R5 Group C 1

q1_dataframe<-setNames(data.frame(q1_data[3], q1_data[2]), c('scores','groups'))
kruskal.test(scores ~ groups, data=q1_dataframe)

This would give me output like the below.

Kruskal-Wallis rank sum test
data: scores by groups
Kruskal-Wallis chi-squared = 2.8674, df = 2, p-value = 0.2384

To understand the syntax and what it returns, see the man page for kruskal.test here.

But to pick a singular point, it shows case shows that the “p-value”, the probability of the differences in scores between the groups being purely down to the random fluctuations of chance, is 0.2384, aka 24%. Numbers above 5-10% are often held to mean that any difference seen has not been statistically proven. Here there’s about a 1 in 4 chance that any differences in these answers are purely due to random chance.

There are endless discussions/controversies as to the use of P values, and statistical vs “clinical” differences – but this is the end point we are aiming for right now.

I could then have done the above process for the other 149 combinations I wanted to test and constructed a table of P results (or scripted to do the same), but it seemed more fun to use Tableau.

First you have to set up Tableau to interface with R correctly. This means running an R-server (called Rserve, handily enough) which is of course also free, and can be started from within the R interface if you have it open. Here’s Tableau’s handy instructions  but it can be boiled down to:

A few notes:

  • It’s fine to run the Rserve on the same computer as Tableau is on.
  • You only have to enter the first line the first time you use Rserve.
  • Capitalisation of Rserve matters for some commands!
  • If you are going to use Rserve a lot more than you use R you can start it from an exe file instead, as detailed in the above Tableau article.
  • Workbooks using R will not work in Tableau Reader or Tableau Online.

Then carry on the process of setting up R as per the Tableau article. Notably you have to look in the Help menu (a strange place for it perhaps) for “Settings and performance” then “Manage R connection” and then type in the name of your Rserve host. If you’re running it on your computer then choose “localhost” with no user/password.

R connection

Next, assuming you have your Tableau workbook already set up with your questionnaire data, create a new worksheet which filters to show only a single question, and has each unique person that answered that question and the grouping variable on the rows. This layout is important, as will be discussed later.


Now it’s time to write the formula that interfaces between Tableau and R.

This Tableau blog post explains in detail how to do that in general – but the essence is to use a calculated field with one of the Tableau SCRIPT_ functions that corresponds to the type of result (string, integer etc.) that you want to retrieve from R. You then pass it literal R code and then use placeholder variables .arg1, .arg2…argN to tell it which data you want Tableau to send to R.

The calculated field will then send the R script with the relevant data inserted as a R vector into the placeholders to the R server, and wait for a result to be returned. The result can then be used inside Tableau as, just as any other table calculation can .

For the Kruskal Wallis test, when we retrieve the P value from a KW test, we saw above that it is in a decimal number format. This is what Tableau calls a “real” number, so we use the SCRIPT_REAL function.

Within that function we enter R code resembling the last line in my R-only attempt, which was the one that actually did the calculations.

You can put more than one R code line in here if you want to. However note that a Tableau R function can only handle returning a single value or vector by default. Here I have therefore specifically asked it to return just the p value as it would be hard to handle the general table of results and associated text. Read the R documentation for any function to understand which other types of value are possible to return from that function.

So, in Tableau, here’s the calculated field to create.

Kruskal Wallis p score:

kruskal.test(.arg1~.arg2, data=data.frame(.arg1,.arg2 ))$p.value

If you compare it to my first R script, you can see that I want it to replace “.arg1” with the first argument I gave after the R script (AVG([SCORE]) and “.arg2” with the GROUP. It’s safe, and necessary, to use ATTR() as you have to pass it an aggregation. Because we set the layout to by 1 row per respondent we know each row will have only a single group associated with that respondent (assuming the relationship between respondents and groups is 1:many, which it needs to be for a successful KW test).

You could use .arg3, arg4… etc. if you needed to pass more data to R, but that’s not necessary for the Kruskal Wallis test.

You can then double click your new calculated field and it’ll appear in the text pill and display the p value we saw above (remember, that 0.2384 number from above? Horray, it matches!) repeatedly, once per respondent. The nature of the KW test is that it returns one value that describes the whole population.

KW test
Being Tableau we can filter to show another single (see below) question, change the respondent pool, highlight, exclude and do all sorts of other things to the data and the new P value for the given breakdown is immediately shown (subject to the rules of table calculations…remember that SCRIPT_… Functions are table calculations!)

But what if we want the Kruskal P value for all the questions in the survey in one table, and not have to iterate manually through every question?

We can’t just unfilter or add questions willy-nilly, because a single p value will be calculated for the whole dataset of ( question 1 + question 2 ) responses, instead of one being calculated for ( question 1) and another for ( question 2 ). This is unlikely to be useful.

KW whole population

See how there’s no 0.2384 value in the above. Instead the same number is repeated for every single question and respondent. It’s compares the scores as though they were all for the same question.

By virtue of the SCRIPT_ Tableau functions being a table calculation though, this one is relatively easy to solve.

Right clicking on the R calculation field gives the normal Edit Table Calculation option. Pick that, and then select “Advanced” under Compute Using.

Simplistically, we want to partition the P values that are calculated and returned to 1 per survey question, so put “Question” in the partition box and the other fields into the addressing box.

Advanced compute by

Hit OK, Tableau refreshes its R stuff, and all is good. That 0.2384 is back (and the 0 for question two is the result of rounding a number so small that the p value shows significance at all practical levels.)
1 row per response

However, we’re getting 1 row for every respondent to every question instead of a nice summary of one row per question.

We can’t just remove the respondent ID from the Tableau view because, whilst this gives the layout we want, the SCRIPT_REAL calculation will pass only the aggregated values of the respondent data to R, i.e. one average value per question in this case, not 1 per person answering the question.

This will give either an error or a wrong answer, depending on which R function you’re trying to call. In this case it would be passing the average response to the question to R to perform the KW test on, whereas R needs each individual response to the question to provide its answers.

The solution to this issue is the same as the classic trick they teach in Tableau Jedi school that allows data to be dynamically hidden yet remain within scope of a table calculation. Namely that when you use a table calculation as a filter, the calculations are mostly done before a filter is applied, in contrast to a non-table calculation where the calculations are done on the post-filtered data.

So, create a simple dummy table calculation that returns something unique for the first row. For instance, create a calculated field just called “Index” that has the following innocuous function in it.


Now if you put that on filters and set it to only show result “1” (the first result) you will indeed get 1 result shown – and the 0.2384 P value remains correct.

1 row

But we actually wanted the first result per question, right, not just the single first result?

Luckily our index function is also a table calculation, so you can do exactly the same compute-by operation we did about to our R-based field.

Right click the index filter pill, go back to “Edit table calculation”, and fill the Compute Using -> Advanced box in in the same way as we previously did to our R calculation function, putting “Question” into the partition box and the rest into addressing.


1 row per question

In the interests of neatness we can hide the useless respondent ID and group columns as they don’t tell us anything useful. Right click the respective Rows pill and click ‘Show header” to turn the display off, whilst leaving it in the pills section and hence level of detail.

Clean table

Bonus tips:

  • You can send the values of Tableau parameters through the R function, controlling either the function itself or the data it sends to R – so for instance allowing a novice user to change the questions scored or the groups they want to compare without having to mess with any calculated fields or filters.
  • The R-enabled calculation can be treated as any other table calculation is in Tableau – so you can use it in other fields, modify it and so on. In this example, you could create a field that highlights questions where the test showed that the differences were significant with at 95% certainty like this.

Highlight 95% significance:

IF [Kruskal Wallis test p value] <= 0.05 THEN "Significant" ELSE "Not significant" END

Pop it on the colours card and you get:

Highlight significance

Now it’s easy to see that, according to Kruskal-Wallis, there was a definite difference in the way that people answered question 2 depending on which group they were in, whereas at this stringent level, there was no proven difference in how they answered question 1.

You can download the complete workbook here.