Do Millennials have no friends?

I recently read an article claiming that 22% of Millennials say they have no friends. And then many other articles with the same figure. This made me feel sad. Some of the articles further distinguished between “close” and “best” friends, so here we’re presumably talking about just any friend of any level at all.

Sure, being a human I have felt peaks and troughs of loneliness over my life-history to date, but I’m not sure I’d cope well if I felt like I had no-one in the ‘friend’ category at all. The thought that nearly a quarter of the young-but-definitely-adult generation feel that way today was quite shocking and depressing to me.

Outside of my personal feelings, it is an increasingly well established fact that loneliness – which I have to imagine strongly associates with having no friends – is not only extremely unpleasant for most folk, but actually harmful to ones physical and mental health. People with stronger social connections may literally live longer. One can go too far with the ‘you may as well take up smoking’ type headlines, but there does seem to be something potentially life-and-death within this subject.

But before getting too upset for the local young adults, I did want to check in on the data itself. Millennials do get a bad rap. Most famously perhaps, we’re all supposed to believe that the reason young people don’t own houses is nothing to do with the fact houses cost an insane amount of money, and everything to do with their high expenditure on avocado toast. Somehow a stereotype seems to have developed in some quarters that the reason not every millennial has a job is because they’re lazy (nothing to do with the supply side of the job market, naturally), they’re selfish, narcissistic and constantly going around maliciously killing various industries and destroying other venerable and much-loved institutions, including DUIs, divorce and porn. After all of that, headlines that involve the word “Millennials” do tend to induce a slight level of skepticism in me.

Reassuringly, it turns out friendship data was from a survey conducted by a reputable enough survey company, YouGov. And they were good enough to publish the full, albeit heavily aggregated, results of the survey itself. OK, in a horrible PDF format, but it didn’t take too long to extract the details of the folk who responded ‘zero’ to the “How many friends do you have?” question in a way conducive to constructing a few breakdowns of these folks below, and satisfying a bit of personal curiosity.

TLDR: Yes, 22% of Millennials did say they had no friends, the highest of all surveyed generations. But it’s not clear to me at all that it’s because they’re Millennials. For example, 27% of black people said the same. And what does ‘friend’ even mean in this survey?

Have some reading time on your hands? Well, in accordance to the guidance given in the original data file, any groups where the number of participants surveyed was less than 50 will not be shown, because these very small samples are considered by YouGov to be statistically unreliable. I will however note what the ‘missing’ categories are, in case it helps clarify who is or isn’t in each category.

Unfortunately there didn’t seem to be a whole lot of other statistical significance info in the data file; no confidence intervals or the like. So it’s not clear to me to what extent small percentage differences should be considered “real”. But they are a reputable enough company who have at least taken the time to re-weight the respondents to represent a base of all US adults and talk about the limitations of too-small samples, so I’m going to go wild and assume that we might care about at least the larger differences.

By generation

Missing categories:
– Gen Z (people born in the year 2000 and later)
– Pre-Silent generation (1927 and earlier)

So this is the data the articles focussed on. Sure enough, Millennials were more likely to report having no friends than the other groups. Are we seeing a uniquely lonely generation? Well, it’s possible. However, to be honest, it’s not possible to tell if Millennials are “special” here from this data.

There are other potential explanations, including that – by definition – each generation here must have been a different age when they were surveyed.

The survey was carried out in 2019, and YouGov here defines a Millennial as being someone born between 1982 and 1999 (the exact definition varies depending on who you ask – so always best to check the data source!). So these folk were between 20 and 37 years old. Compare that to the seemingly more friend-enabled ‘Silent Generation’, who in this analysis would have been between 74 and 91 years old.

Perhaps – and I’m not presenting any evidence here to suggest you should believe this over any other hypothesis – it’s just normal that older people are less likely to report having no friends than younger people.

Are changes in the number of people reporting having no friends really a ‘cohort effect’, which is what a lot of the headlines about this survey imply? More data-digging would be needed to determine that, as opposed to whether this is, for instance, an aging effect.

An aging effect is a change in variable values which occurs among all cohorts independently of time period, as each cohort grows older.

A cohort effect is a change which characterizes populations born at a particular point in time, but which is independent of the process of aging.

A period effect is a change which occurs at a particular time, affecting all age groups and cohorts uniformly.

Source: Distinguishing aging, period and cohort effects in longitudinal studies of elderly populations

By gender

Not a whole lot to say here. Males seem slightly more likely to report having no friends than females, but the gender differences are much less than between generations. Without knowing the confidence intervals of the responses it’s also hard to know how significant these differences are.

By region

Again, only relatively small differences are seen when the respondents are split up into what region of the US they live in.

By race

OK, here are some large differences again!

The difference between Black and White respondents – 16 percentage points – is actually the same level of difference as between Millennials and the generation with the very lowest % of people reporting having no friends.

It’s interesting that many of the articles reporting on this survey focused on the generation as opposed to the race. There may be a legitimate reason why, but it’s not self-evident to me. It seems if we’re worried that Millennials may be lonely, the same concern might be needed for non-white folk too.

By education level

The big differences keep on coming! At a glance, this looks like a strong positive link between having higher levels of education and having a friend.

By income

Can money buy you friends? Traditionally we tend to say no. But having a higher income sure does seem to reduce the likelihood of you feeling like you have no friends at all.

(I am not sure exactly what income was asked for – from the values, I’d assume this is something like annual household income, but should verify before stating that to be the case!)

By urbanity of area lived in

The traditionalist’s view that city living includes being surrounded by hordes of other people, but feeling personally lonely, seems directionally borne out in these results. Univariately at least, urban dwellers are more likely to report having no friends.

By marital status

Missing categories
– Civil partnership
– In a relationship, not living together
– Separated
– Other
– Prefer not to say

So is getting married the end of all friendships, with the happy couple dumping their pals so as to get on with journeying their way through mortgages, careers and other misc adulting? Seemingly not. People who are, or were, once married were a lot less likely to report having no friends than those who were not.

By whether being a parent or guardian of any children

Missing categories:
– Don’t know / Prefer not to say

Likewise, parenting doesn’t appear to remove your entire friendship circle (or at least if it does, maybe you end up replacing them all with new parenty-friends over the years). Having kids, especially ones who are now adults, seems to make you less likely to report having no friends.

So, to summarise:

Are millenials more likely than other well-represented generations to report having no friends in this survey? Yes, they are. But we need more data to understand if this is a “Millennial generation” phenomenon vs a “being in your 20-30s” phenomenon. After all, a 40-year-old has had longer to find a friend out there in the wilderness!

Millennials aren’t the only group to report having no friends

Applying the same analysis to the rest of the survey results shows the existence of several other ‘risk factors’ for reporting no friends.

Excluding the variables that show only a couple of percentage point differences between categories, these added-risk groups include:

  1. not being white
  2. having a low level of education
  3. having a low income
  4. living in an urban area
  5. not being or having been married
  6. not having children, especially of adult age

So there’s potential for confounding here. Let’s imagine a world where being born in the 1980-90s did not actually affect your friend count. If any of the above factors are over-represented in Millennials in comparison to other groups, we could still see the same overall effect.

I don’t intend to dig up all the stats correlating age with the 6 bullet points above for this post, but even a modicum of websearching reveals sensible-sounding sources with claims like:

Relative to members of earlier generations, millennials are more racially diverse, more educated, and more likely to have deferred marriage; these comparisons are continuations of longer-run trends in the population. Millennials are less well off than members of earlier generations when they were young, with lower earnings, fewer assets, and less wealth.

Source: Are Millennials Different?

So that’s risk factors 1, 3 and 5 confounding away, mitigated perhaps by reverse-risk factor 2.

For reasons of age alone, it’s unlikely many millennials have adult children yet, and they don’t seem to be in a particular hurry to have any children at all. All this, whilst enjoying urban life, if they’re able to.

So, how to differentiate the root cause? Well, with the level of data published – and don’t get me wrong YouGov, I’m grateful any was! – it’s not really possible to. A more complex analysis using data at the individual person level, allowing us to look at the effect of generation controlling for other variables, and ideally comparing also with previous time series, would be the obvious start. Whilst that type of observational study is usually not able to prove causation beyond doubt, we might get closer towards understanding the likely fundamentals.

It didn’t escape my notice – although of course I am not going to prove this to be true here – that many of the higher risk groups are those that society has often appeared to value less highly – poor people, non-white people, less-educated people, the unmarried; think of the sections of society sometimes defined by or over-represented in low “socioeconomic status” groups. Perhaps policies designed to assist those our current social structure apparently does not help so much may have a bonus side effect in the realm of strengthening social connections – and all the health and life benefits that go alongside that.

What is a friend anyway?

After discussing this statistic with a friend (see what I did there? True story though, and thank you, correspondent) I was reminded that the definition of a friend is itself rather woolly. One person’s friend is another person’s acquaintance, colleague or window-cleaner.

“Friendship is difficult to describe,” said Alexander Nehamas, a professor of philosophy at Princeton, who in his latest book, “On Friendship,” spends almost 300 pages trying to do just that. 

Source: Do Your Friends Actually Like You?

So another scenario in which millennials could be more likely than others to report having no friends – even if in reality they had the same level of social connections – would be if they define ‘friend’ differently, especially more stringently, than others. Some more qual-side digging into whether disparate generations define friends differently to each other would be useful to look into that hypothesis.

Related to definitions, there could also be something in how the question was asked. I did not see the original survey, but the Yougov write-up implies the question asked was

“Excluding your partner and any family members, how many of each of the following do you have?”

followed by a list comprising of “acquaintances”, “friends”, “close friends” and “best friends”.

Most of the articles focus only on the “friends” result, where the 22% zero figure is seen. OK, fine – taken in isolation, “close friends” and “best friends” sound like subsets of friends, right?

But if presented with each of those options on the same screen, perhaps respondents might categorise each person they know into an exclusive category. So if you pop your pal Jimmy into the “best friends” box, perhaps you don’t also add him into the basic “friends” box.

Perhaps you feel close to all your friends, so have 10 close friends but no “non-close” friends, except those you categorise as acquaintances. In this way, a very close-friend-fulfilled person might be included in the no friends bucket when analysed one question at a time.

If this was the case, whilst the articles aren’t reporting anything untrue; it may be misleading when taken in isolation. All this is only a theory of course, as I haven’t seen the precise flow of the original survey. Perhaps the questions were asked in a a way less likely to cause this issue. But we can tell that the 4 categories aren’t being used entirely as subsets of each other, as 25% of Millennials report having no acquaintances, vs only 22% having no friends; i.e. more Millennials have friends than acquaintances.

(Out of interest, 27% of Millennials reported no close friends, and 30% having no best friends – and yes, pedants, some people did report having more than one ‘best’ friend).

Anyhow, wild hypothesising aside: More knowledge does give us a higher chance of developing effective remedies, if remedies are indeed needed. Which they likely are, in my opinion, no matter what the precise count of Millennials involved is or why, given the dramatic impact of loneliness on people’s lives. This is an important line of research that should likely be pursued with the full resources and rigour that serious issues around health and well-being deserve.

But in the mean time, associations between loneliness and all sorts of negative health and well-being effects have been repeatedly demonstrated. So if we’ve societal levers to pull, or personal practices to enact that have the potential to reduce any level of friendlessness, let’s get on and do it.

How to be happy: the data driven answer (part 1)

A fundamental goal for many people, explicit or otherwise, is to be maximally happy. Easily said, not always so easily done. So how might we set about raising our level of happiness? OK, at some level, we’re all individuals with our own set of wishes and desires. But, at a more macro level, there are underlying patterns in many human attributes that lead me to believe that learning what typically makes other people happy might be useful with regards to understanding what may help maintain or improve our own happiness.

With that in mind, let’s see if we can pursue a data driven answer to this question: what should our priorities or behaviours likely be if we want to optimise for happiness?

For a data driven answer, we’re going to need some data. I settled on the freely available   HappyDB corpus (thanks, researchers!).

The corpus contains around 100,000 free text answers, crowdsourced from members of the public who had signed up to Mechanical Turk,  to this question:

What made you happy today? Reflect on the past {24 hours|3 months}, and recall three actual events that happened to you that made you happy. Write down your happy moment in a complete sentence.

Write three such moments.

Examples of happy moments we are NOT looking for (e.g events in distant past, partial sentence):
– The day I married my spouse
– My Dog

It should be noted that the average Mechanical Turk user is not representative of the average person in the world and hence any findings may have limited generalisability. However, the file does contain some demographics which we consider digging into later in case there’s any correspondence between one’s personal characteristics and what makes them happy. Perhaps most notably, around 86% of respondents were from the USA, so there will clearly be a geographic / cultural bias at play. However, being from the UK, a similarly “westernised country”, this may be less of a problem for me if I personally wish to act on the results. A more varied geographical or cultural comparison on drivers of happiness would however be a fascinating exercise.

I’m most interested in eliciting the potential drivers of happiness in a manner conducive to informing my mid-to-longer term goals. With that in mind, here I’m only looking at the longer term variant of the question – what made people happy in the past 3 months? A followup could certainly be whether or not the same type of things that people report having made them happy over a 3 month timeframe correspond to the things that people report making them happy on a day-to-day basis.

So, having downloaded the dataset , the first step is to read it into our analysis software. Here I will be using R. The researchers supply several few data files, which are described on their Github page. Here I decided to use the file where they’d generously cleaned up a few of the less valid looking entries, for instance if they were blank, had only a single word, or seemed to be misspelled, called “cleaned_hm.csv“.

Whilst some metadata is also included in that file, for the first part of this exercise I am only interested in these columns:

  • hmid: the unique ID number for the “happy moment”
  • wmid: an ID number that tells us which “worker” i.e. respondent answered the question. One person may have responded to the question several times. It’s also a way to link up the demographic attributes of the respondent in future if we want to.
  • cleaned_hm: the actual (cleaned up) text of the response the user gave to the happiness question

We’ll also use the “reflection_period” field to filter down to only the responses to the 3 month version of the question. This field contains “3m” in the row for these 3-month responses.


happy_db <- read_csv(".\\happy_db_data\\rit-public-HappyDB-b9e529e\\happydb\\data\\cleaned_hm.csv")

happy_data <- filter(happy_db, reflection_period == "3m") %>%
select(hmid, wid, cleaned_hm)

Let’s take a look at a few rows of the data to ensure everything’s looking OK.


OK, that looks good, and perhaps already gives us a little preview of what sort of things  make people happy,

Checking the recordcount – nrow(happy_data) – shows we have 50,704  happy moments to look at after the filters have been applied. Reading them all one by one isn’t going to be much fun, so let’s load up some text analysis tools!

First, I wanted to tokenise the text. That is to say, extract each individual word from the each response to the happiness question and put it onto its own row. This makes certain types of data cleaning and analysis way easier, especially if, like me,  you’re a tidyverse advocate. I may want to be able to reconstruct the tokenised sentences in future, so will keep track of which word comes from which happy moment, by logging the relevant happy moment ID, “hmid”, on each row.

Past experience (and common sense) tells me that some types of words may be more interesting to analyse than others. I won’t learn a lot from knowing how many respondents used the word “and”.

One approach could be to classify words as to their “part of speech” type – adjectives, verbs, nouns and so on. My intuition is that people’s happiness might be influenced by encountering certain objects (in the widest possible sense – awkwardly incorporating people into that classification) or engaging in specific activities. This feels like a potential close fit to the nouns and verb speech parts. So let’s begin by extracting each word and classifying it as to whether it’s a noun, verb or something else.

One R option for this is the cleanNLP library. cleanNLP includes functions to tokenise and “annotate” each word with respect to what part of speech it is. It can actually use a variety of natural language processing backends to do this, some more fancy than others.  Some require installing non-R software like Python or Java, which, to keep it simple here, I’m going to avoid. So I’m going to use the udpipe library, which is pure R, and hence can be installed with the standard install.packages(“udpipe”) command if necessary.

Once cleanNLP is installed, we need to ask it to split up and tag our text as to whether each word is a noun, verb and so on. First we initialise the backend we want to use, udpipe. Then we use the cnlp_annotate command to perform the tokenisation and annotation, passing it:

  • the name of the dataframe containing the question responses: happy_data.
  • the name of the field which contains the actual text of the responses, as text_var.
  • the name of the field that identifies each individual unique response as doc_var.

This process can take a long time to complete, so don’t start it if you’re in a hurry.

Finally, we’ll take the resulting “annotation” object and extract the table of annotated words (aka tokens) from it, to make it easy to analyse with standard tidy tools.


# initialise udpipe annotation engine


# annotate text

happy_data_annotated <- cnlp_annotate(happy_data, as_strings = TRUE, 
                                   text_var = "cleaned_hm", doc_var = "hmid")

# extract tokens into a data frame

happy_terms <- happy_data_annotated %>% 

OK, let’s see what we have, comparing this output to the original response data for the first entry.




Lovely. In our annotated token data, we can see the id field matches the hmid field of the response file, allowing us to keep track of which words came from which response.

Each word of the response is now a row, in the field “word”. Furthermore, the word has also been converted to its lemma in the lemma field – a lemma being the base “dictionary” version of the word – for instance the words “studying” and “studies” both have the lemma “study“. This lemmatisation seems useful if I’m interested in a general overview of the subjects people mention in response to the happiness question, insomuch as it may group words together that have the same basic meaning.

Of course it can’t always be perfect – the above example in fact has a type of error, in that the word “nowing” has been lemmatised as “now”, whereas one has to assume the respondent actually meant “knowing”, with a lemma more like “know”. However, this is user error – they misspelt their response. I don’t fancy going through each response to fix up these types of errors, so for the purposes of this quick analysis I’ll just assume them to be relatively rare.

We can also see the in the field “upos” that the words have been classified into their grammatical function – including the verbs and nouns we’re looking for.  upos stands for “universal part of speech”, and you can see what the abbreviations mean here.

The “pos” field divides these categories up further, and can be deciphered as being Penn part of speech tags. In the above example, you can see some verbs are categorised as VBG and others as VB. This is distinguishing between the base form of a verb and its present participle. I’m not going to concern myself with these differences, so will stick to the upos field.

Now we have this more usable format, there’s still a little more data cleansing I want to do.

  • One part of speech (hereafter “POS”) category is called PUNCT, meaning punctuation. I don’t really want to include punctuation marks in my analysis, so will remove any words that are classified as PUNCT.
  • Same goes for the category SYM, which are symbols.
  • I also want to remove stopwords. These are words like “the”, “a” and other very common words that are typically not too informative when it comes to analysing sentences at a per-word level. Here I used the snowball list to define which words are in that category. This list is included in the excellent tidytext package,  the usage of which is documented in its companion book.
  • I’m also going to remove words that directly correspond to “happy”. Given the question itself is “what makes you happy?” I feel safe to assume that all the responses should relate to happiness. Knowing people used the word “happy” a lot won’t add much to my understanding.

Here’s how I did it:

happy_terms <- filter(happy_terms, upos != "PUNCT" & upos != "SYM") %>%
  anti_join(filter(stop_words, lexicon == "snowball"), by = c("lemma" = "word")) %>%
  filter(!(lemma %in% c("happy", "happiness", "happier")))

OK, now we have a nice cleanish list of the dictionary words from the responses. Let’s take a look at what they are! We’ll start with the most obvious type of analysis, a frequency count. What are the most common words people used in their responses?

Let’s start with “things”, i.e. nouns. Here are the top 20 nouns used in the answers to the happiness question.


filter(happy_terms, upos == "NOUN") %>%
  count(lemma, sort = TRUE) %>%
  arrange(desc(n)) %>%
  slice(1:20) %>%
  ggplot(aes(x = reorder(lemma, n, mean), y = n)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(title = "Most common nouns in responses to happiness question", x = "", y = "Count")

OK, there’s already some strong themes emerging there! Many nouns relate to people (friend, family, etc.), there’s a mention of jobs, and some more “occasion” based terms like birthday, event etc.

There’s also a lot of time frame indicators. Whilst it may eventually be interesting that there are more mentions of day than month, and more month mentions than year, for a simple first pass I’m not sure they add a lot. Let’s exclude them, and plot a larger sample of nouns. This time we’ll use a word cloud, where the size of the words are relative to the frequency of usage. Larger words are those that are used more often in the responses. For this, we can use the appropriately named library “wordcloud“.

It should be noted that, whilst wordclouds are more visually appealing than bar charts to many people, they are certainly harder to interpret in a precise way. However, here I’m more looking for the major themes, so I’ll live with that and go with the prettier option.


# create a count of nouns, excluding some time based ones

happy_token_frequency_nlp_noun <- 
  filter(happy_terms, upos == "NOUN" & !(lemma %in% c("day", "time", "month", "week", "year", "today"))) %>%
  count(lemma, sort = TRUE)

# open a png graphics device so wordcloud gets saved to disk as a png

png("wordcloud_packages.png", width=12,height=8, units='in', res=300) 

# create the wordcloud

wordcloud(words = happy_token_frequency_nlp_noun$lemma, freq = happy_token_frequency_nlp_noun$n, max.words = 150, colors = c("grey60", "darkgoldenrod1", "tomato"))

# close the png device 

A wordcloud of happy nouns:


A cornucopia of happiness-inducing things! In this format, some commonalities truly stand out. It seems that friends make people particularly happy. Families are key too, although the number of times “family” is directly mentioned is lower than “friend”. That said, parts of family-esque structures such as son, daughter, wife, husband, girlfriend and boyfriend are also specifically mentioned with relatively high frequency. Basically, people are made happy by other people. Or even by our furry relatives – dog and cat are both in there.

Next most common, perhaps less intuitively, are words around what is probably employment – job and work.

There’s plenty of potentially “event” type words in there – event itself, but also birthday, date, dinner, game, movie et al. Perhaps these show what type of occasions are most associated with happy times. I can’t help but note that many of them sound again like opportunities for socialising.

There’s a few actual “things” in the more conventional sense too; insentient objects that you could own.  They’re less prevalent in terms of mentions of any individual one, but car, computer, bike and phone some relevant examples that appear.

And then some perhaps less controllable phenomena – weather, season and surprise.

It must be noted these interpretations have to rely on assumptions based on preexisting knowledge of how people usually construct sentences and the norms for answering questions. We’re looking at single words in isolation. There’s no context.

If people are actually writing “Everything except my friend made me happy”  then the word “friend” would still appear in the wordcloud. Intuitively though, this seems unlikely to be the main driver of it featuring so strongly. We may however dig deeper into context later.  For now, we can also increase our confidence in the most straightforward interpretation by noting that there’s a fair bit of external research out there that suggests a positive connection between friendship and happiness, and even health, that would support these results. Happify produced a nice infographic summarising the results of a few such studies.

Next up, let’s repeat the exercise using verbs, or as I vaguely recall learning at primary school, “doing words”. What kind of actions do people take that results in them having a happy memory?

The code is very similar to the nouns example. Firstly, let’s filter and plot the most common verbs we found in the responses.

filter(happy_terms, upos == "VERB") %>%
  count(lemma, sort = TRUE) %>%
  arrange(desc(n)) %>%
  slice(1:20) %>%
  ggplot(aes(x = reorder(lemma, n, mean), y = n)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(title = "Most common verbs in responses to happiness question", x = "", y = "Count")

The top 3 there – get, go and make – seem be particularly prevalent compared to the rest. Getting things, going to places and the satisfaction of creating things are all things we might intuit are pleasing to the average human. However they’re somewhat vague, which makes me curious as to how exactly they’re being used in sentences. Are there particular things that show up as being got, gone to or made that generate happy memories?

To understand that, we’re going to have look at a bit of, at least simplistic, context. We’ll start by looking at the most common phrases those words appear in. To do this, we’ll use the tidytext library to generate “bigrams” – that is to say, each 2-word combination within the response sentences that those specific words are used in.

For example, if there’s a sentence ‘I love to make cakes’, then the bigrams involving ‘make’ here are:

  • to make
  • make cakes

If we calculate and count every such bigram, then, if enough people enjoy making cakes, that might pop out in our data.

First up though, remember that our analysis so far is actually showing the lemmas of the words used by the respondents, rather than the exact words themselves. So we’ll need to determine which words in our dataset are represented by the lemmas “get”, “go” and “make” in order to comprehensively find the appropriate bigrams in the text.

We can do that by looking for all the distinct combinations of lemma and word in our tokenised dataset. Here’s one way to see which unique real-world words in our dataset are represented by the lemma “get”:

filter(happy_terms, lemma == "get") %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%

This means that we should look in the raw response text for any bigrams involving got, get, getting, gets or gotten if we want to see what the commonest contexts for the “get” responses are. To automate this a little, let’s create a vector for each of the 3 highly represented lemmas – get, go, and make – that include all the words that are grouped into that lemma.

get_words <- filter(happy_terms, lemma == "get") %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%
  distinct() %>%

go_words <- filter(happy_terms, lemma == "go") %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%
    distinct() %>%

make_words <- filter(happy_terms, lemma == "make") %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%
    distinct() %>%

The next step is to generate the bigrams themselves. Here I’m also removing the common, generally tedious, stopwords as we did before. This simplistic approach to stopwords does have some potentially problematic side-effects – including that if an interesting single word is surrounded entirely by stopwords then it will be excluded from this analysis. However, in terms of detecting a few interesting themes as opposed to creating a detailed linguistic analysis, I’m OK with that for now.

# create the bigrams

happy_bigrams <-  unnest_tokens(happy_data, bigram, cleaned_hm, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!(word1 %in% filter(stop_words, lexicon == "snowball")$word) & !(word2 %in% filter(stop_words, lexicon == "snowball")$word))

# create a count of how many time each bigram was found - sorting from most to least frequent

bigram_counts <- happy_bigrams %>% 
  count(word1, word2, sort = TRUE)             
# show the top 20 bigrams that use the 'get' lemma
filter(bigram_counts, word1 %in% get_words | word2 %in% get_words) %>%
    head(20) %>%
    ggplot(aes(x = reorder(paste(word1, word2), n, mean), y = n)) +
    geom_col() +
    coord_flip() +
    theme_bw() +
    labs(title = "Most common bigrams involving the 'get' lemma", x = "", y = "Count")

What do people like to get?


“finally got” may not tell us much about what was gotten, but evidently achieving something one has waited for for some time is particularly pleasing in this context. Other obvious themes there are mentions again of other people – son got, wife got, friend got, get together etc. Likewise there’s a few life events there – getting married, and work stuff like getting promoted or hired.

But my curiosity remains unabated. What sort of things are people happy that they “finally got”? To get some idea, I decided to split the responses up into five-word n-grams –  think of these as being like bigrams, but looking at groups of 5 consecutive words rather than 2.

Groups of 5 words here will tend to produce low counts as people don’t necessarily express the same idea using precisely the same words – we may revisit this limitation in future! Nonetheless, it may give us a clue as to what some of the bigger topics are. We can use the unnest_tokens command from the tidytext library again. This is exactly what we did above to generate the bigrams, but this time specifying an n of 5, to specific that we want sequences of 5 words.

# make the 5-grams, keeping only those that were found at least twice
happy_fivegrams <-  unnest_tokens(happy_data, phrase, cleaned_hm, token = "ngrams", n = 5) %>%
  separate(phrase, c("word1", "word2", "word3", "word4", "word5"), sep = " ") %>%
  group_by(word1, word2, word3, word4, word5) %>%
  filter(n() >= 2 ) %>%

# count how many times each 5-gram was used
fivegram_counts <- happy_fivegrams %>% 
  count(word1, word2, word3, word4, word5, sort = TRUE)   
 # show the top 15 that started with "finally got" 
filter(fivegram_counts, word1 == "finally" & word2 == "got" ) %>%

Only a couple of people used most of the precise same phrases, but the most obvious feature that stands out here related to employment. People “finally” getting new jobs, or progressing towards the jobs they want, are occasions that make them happy.

Back now to drilling down into the common verb lemma bigrams: where are people going when they make happy memories?

# show the top 20 bigrams that use the 'go' lemma   
filter(bigram_counts, word1 %in% go_words | word2 %in% go_words) %>%
    head(20) %>%
    ggplot(aes(x = reorder(paste(word1, word2), n, mean), y = n)) +
    geom_col() +
    coord_flip() +
    theme_bw() +
    labs(title = "Most common bigrams involving 'go' lemma", x = "", y = "Count")  

Here, going shopping wins the frequency competition – although it’s certainly not usually an experience I personally enjoy! There’s a few references to family and friends again, and a predilection for going home. A couple of specific outdoor leisure activities – hiking and fishing – seem to suit some folk well. A less expected top 20 entry there was Pokemon Go, the “gotta-catch-em-all” augmented reality game, which, sure enough, does  have the word “go” in its title. The human urge to collect apparently persists in the virtual world 🙂

Finally, what are people making that pleases them?

# show the top 20 bigrams that use the 'make' lemma 

filter(bigram_counts, word1 %in% make_words | word2 %in% make_words) %>%
    head(20) %>%
    ggplot(aes(x = reorder(paste(word1, word2), n, mean), y = n)) +
    geom_col() +
    coord_flip() +
    theme_bw() +
    labs(title = "Most common bigrams involving 'make' lemma", x = "", y = "Count")

It’s not all about making in the sense of arts and crafts here. Making plans appeals to some – perhaps creating something that they can then look forward to occurring in the future. References to “moments” are pretty vague but perhaps describe happiness coming from specific events that could later be drilled down into.

There’s the ever-present social side of wives, sons, girlfriends et al featuring in the commonest bigrams. The hedonistic pleasures of money and sex also creep into the top 20.

OK, now we’ve dug a little into the 3 verb lemmas that dominate the most common verbs used, let’s take a look at the next most frequently used selection of verbs. Here again, we’ll use the (controversial) wordcloud, and hope it allows us to elucidate at least some common themes.

Removing “get”, “make” and “go” from the tokenised verb list, before wordclouding the most common of the remaining verb lemmas:

# filter the terms to show only verbs that are not get, make or go.

happy_token_frequency_nlp_verb_filtered <- filter(happy_terms, upos == "VERB" & !lemma %in% c("get", "make", "go")) %>%
count(lemma, sort = TRUE)

# open a png graphics device so wordcloud gets saved to disk as a png

png("wordcloud_verb_filtered.png", width=12,height=8, units='in', res=300) 

# create the wordcloud

wordcloud(words = happy_token_frequency_nlp_verb_filtered$lemma, freq = happy_token_frequency_nlp_verb_filtered$n, max.words = 150, colors = c("grey60", "darkgoldenrod1", "tomato"))

# close the png device 

A wordcloud of happy verbs:


The highest volumes here are seen in experiential words – see and feel, vague as they might seem in this uni-word analysis. Perhaps though they hint towards the conclusions of  some external research that suggests, at least after a certain point of privilege, people tend to gain more happiness from spending their money on experiences as opposed to objects.

Beyond that, some hints towards – suprise, surprise – social interactions appear. Give, receive, take, visit, love, meet, tell, say, play, talk, share, participate, listen, speak, invite and call, amongst other words, may all potentially fall into that category.

There’s “spend”, a further analysis of which could differentiate between whether we’re typically talking about spending time, spending money, or both. A hint towards the former existing at least to some extent is given by “buy” being a relatively common verb. “Work” also features, along with some almost-by-definition happiness inducers such as “win” and “celebrate”.

In terms of specific activities, we see hints at some intellectual pursuits: read, book, learn, know, graduate – together with some potentially fitness related activities: walk, run and move. Per the famous saying, eating and drinking also makes some people merry.

So, in conclusion:

That’s the end of our first step here, essentially variations on a frequency analysis of the words used by respondents when they’re asked to recall what made them happy in the past 3 months. Did it produce any insights?

My main takeaway here is really the key criticality of the social. Happiness seekers should usually prioritise people, where there’s an option to do so.

Yes, shopping, cars and phones and a few other inanimate objects of potential desire popped up. We also saw a few recreational activities that may or may not be social. Here I’m thinking of walking, running, fishing and hiking. It’s perhaps also notable that those examples happen to be potentially physically active activities that often take place outside. There’s external research extolling the well-being benefits of both being outdoors and physical exercise.

Words associated with education, learning and employment also appeared with relatively high frequency, so a focus on optimising those aspects of life might also be worthwhile for improving satisfaction.

But references to friends, family and occasions that may involve other people dominated the discourse. Thus, when making life plans, it’s probably wise to consider the social – meaningful interaction with real live human beings – as a particularly high priority.

In our noun analysis above, the word “friend” stuck out. Whilst as life goes on, and often gets busier, it can sometimes be hard to find time to focus on these wider relationships, there’s external evidence that placing high a value on friendship is associated with improved well-being – potentially even moreso than the variation in valuing family relationships.

These effects seem to associate with even the most dramatic measures of well-being, with several studies suggesting that people who maintain strong social relationships of any kind tend to be happier, healthier and even live longer.

Average age at menarche by country

A question came up recently about variations in the age at menarche – the first occurrence of menstruation for a female human –  with regards to the environment. A comparison by country seemed like a reasonable first step in noting whether there were in fact any significant, potentially environmental, differences in this age.

A quick Google search seemed to suggest a nice reliable chart summarising this data by country was surprisingly (?) hard to find. However, delving more into academic publishing world, I did find a relevant paper called “International variability of ages at menarche and menopause: patterns and main determinants” (Thomas et al, in Hum Biol. 2001 Apr;73(2):271-90), which stated that the purpose of their study was:

to review published studies on the variability of age at menarche and age at menopause throughout the world, and to identify the main causes for age variation in the timing of these events.

Score! And sure enough, it does contain a lengthyish datatable showing the average age of menarche by country from a survey of prior literature that they did. However, it’s not super easy to get a immediately get a sense of the magnitude of any variation by looking at a datatable from a scanned PDF that was printed over 3 pages. Instead, I decided to extract the data from that table and visualise it in a couple of charts which are shown below, in case anyone else has a similar curiosity about the subject.

Firstly a simple bar chart. You may wish to click through to get a larger copy with more readable axes:

Bar chart

And now plotted geographically:

Age at menarche by country map


There’s also an interactive dashboard version available if you click here, where you can filter, highlight, hover and otherwise interact with the data.

If you have a use for the underlying data yourself, I uploaded it in a machine-readable format to, whereby you can connect directly to it from various analytics tools, use the querying or visualisation tools on the site itself, or download it for your own use.  Again, this data is all consolidated by the authors of the Human Biology study above, so full credit is due to them.

A couple of notes to be aware of:

Firstly, the study I took the data from was published in 2001 i.e. is not particularly recent. The menarche age averages shown were collected from other studies, which obviously took place before 2001 – some even decades before, and in countries that technically no longer exist! They therefore may not reflect current values.

It’s generally considered that menarche ages have changed over time to some extent – for example a study concerning self-reported age of menarche in the US “Has age at menarche changed? Results from the National Health and Nutrition Examination Survey (NHANES) 1999-2004.” (McDowell et al, J Adolesc Health. 2007 Mar;40(3):227-31) concluded that “Mean age of menarche declined by .9 year overall in women born before 1920 compared to women born in 1980-84” (and also that ages and changes differed by ethnicity, showing that country averages may mask underlying structural differences). A few other such studies I skimmed, usually focusing on individual countries, also tended to show trends in the same direction. One might hypothesize that countries where the general population has been through even more radical living-condition shifts than residents of the US over the past few decades may have seen larger changes than those reported by McDowell et al.

Secondly, some of the averages reported are means, others are medians. These may not be directly comparable. The bar chart I’ve shown above differentiates between those based on the colour of the bar. In the interactive version there’s the ability to filter to only show results based on a single one of those aggregation types.

So now onto the interesting theoretical part – OK, menarche age may differ between countries, but why? As noted, that was part of the question driving the authors of the source study, so you should certainly read their whole study to get their take. In summary though, they created a sequence of linear models.

The first one shows a negative association between life expectancy and menarche age.  OK, but what factors drive life expectancy in the first place, that also correlate with menarche?

They produced 2 alternative models to investigate that. The first was a combination of illiteracy rate (with a positive correlation) and vegetable calorie consumption (negative). The second kept illiteracy in, but switched in the country’s average gross national product for vegetable calorie consumption.

Vegetable calorie consumption is perhaps somewhat intuitive – it’s been previously found that what their paper describes as “good nutritional conditions” tend to lower the age of menarche, with the body’s fat:mass ratio being potentially involved. There are many papers on that topic – not being remotely a subject matter expert here I’ll pick one at random – “Body weight and the initiation of puberty” (Baker, Clin Obstet Gynecol. 1985 Sep;28(3):573-9.), which concluded that malnutrition may “retard the onset of menarche”.

But the larger influence in the model was the country’s illiteracy rate. Does reading books induce menstruation? Well, hey, probably not directly, so it likely proxies for something that may more plausibly affect menarche. The theory the researchers present here is that societies with higher illiteracy also tend to have a higher incidence of child labour. This labour is often physical in nature, implying a higher rate of energy expenditure. Citing previous evidence that excessive exercise may affect the fat balance, again altering menarche, they form a new hypothesis that it’s in fact the energy balance within a person that may affect menarche.

Books I read in 2017

Long term readers (hi!) may recall my failure to achieve the target I had of reading 50 books in 2016. I had joined the 2016 Goodreads reading challenge, logged my reading activity, and hence had access to the data needed track my progress at the end of the year. It turns out that 41 books is less than 50.

Being a glutton for punishment, I signed up again in 2017, with the same cognitively terrifying 50 book target – basically one a week, although I cannot allow myself to think that way. It is now 2018, so time to review how I did.

Goodreads allows you to log which books you are reading and when you finished them. The finish date is what counts for the challenge. Nefarious readers may spot a few potential exploits here, especially if competing for only 1 year. However, I tried to play the game in good faith (but did I actually do so?  Perhaps the data will reveal!).

As you go through the year, Goodreads will update you on how you are doing with your challenge. Or for us nerd types, you can download a much more detailed and useful CSV. There’s also a the Goodreads API to explore, if that floats your boat.

Similarly to last year, I went with the CSV.  I did have to hand-edit the CSV a little, both to fill in a little missing data that appears to be absent from the Goodreads dataset, and also to add couple of extra data fields that I wanted to track that Goodreads doesn’t natively support. I then popped the CSV into a Tableau dashboard, which you can explore interactively by clicking here.

Results time!

How much did I read

Joyful times! In 2017 I got to, and even exceeded, my target! 55 books read.

In comparison to my 2016 results, I got ahead right from the start of the year, and widened the gap notably in Q2. You can see a similar boost to that witnessed in 2016 around the time of the summer holidays, weeks 33-35ish. Not working is clearly good for one’s reading obligations.

What were the characteristics of the books I read?

Although page count is a pretty vague and manipulable measure – different books have different physical sizes, font sizes, spacing, editions – it is one of the few measures where data is easily available so we’ll go with that. In the case of eBooks or audio books (more on this later) without set “pages” I used the page count of the respective paper version. I fully acknowledge this rigour of this analysis as falling under “fun” rather than “science”.

So the first revelation is that this year’s average pages per read book was 300, a roughly 10% decrease from last year’s average book. Hmm. Obviously, if everything else remains the same,  the target of 50 books is easier to meet if you read shorter books! Size doesn’t always reflect complexity or any other influence around time to complete of course.

I hadn’t deliberately picked short books – in fact, being aware of this incentive I had tried to be conscious of avoiding doing this, and concentrate on reading what I wanted to read, not just what boosts the stats. However, even outside of this challenge, I (most likely?) only have a certain number of years to live, and hence do feel a natural bias towards selecting shorter books if everything else about them was to be perfectly equal. Why plough through 500 pages if you can get the same level of insight about a topic in 150?

The reassuring news is that, despite the shorter average length of book, I did read 20% more pages in total. This suggests I probably have upped the abstract “quantity” of reading, rather than just inflated the book count by picking short books. There was also a little less variation in page count between books this year than last by some measures.

In the distribution charts, you can see a spike of books at around 150 pages long this year which didn’t show up last year. I didn’t note a common theme in these books, but a relatively high proportion of them were audio books.

Although I am an avid podcast listener, I am not a huge fan of audio books as a rule. I love the idea as a method to acquire knowledge whilst doing endless chores or other semi-mindless activities. I would encourage anyone else with an interest of entering book contents into their brain to give them a whirl. But, for me, in practice I struggle to focus on them in any multi-tasking scenario, so end up hitting rewind a whole lot. And if I am in a situation where I can dedicate full concentration to informational intake, I’d rather use my eyes than my ears. For one, it’s so much faster, which is an important consideration when one has a book target!  With all that, the fact that audio books are over-represented in the lower page-counts for me is perhaps therefore not surprising. I know my limits.

I have heard tell that some people may consider audio books as invalid for the book challenge. In defence, I offer up that Goodreads doesn’t seem to feel this way in their blog post on the 2018 challenge. Besides, this isn’t the Olympics – at least no-one has sent me a gold medal yet – so everyone can make their own personal choice. For me, if it’s a method to get a book’s contents into my brain, I’ll happily take it. I just know I have to be very discriminating with regards to selecting audio books I can be sure I will be able to focus on. Even I would personally regard it cheating to log a book that happened to be audio-streaming in the background when I was asleep. If you don’t know what the book was about, you can’t count it.

So, what did I read about?

What did I read

Book topics are not always easy to categorise. The categories I used here are mainly the same as last year, based entirely on my 2-second opinion rather than any comprehensive Dewey Decimal-like system. This means some sort of subjectivity was necessary. Is a book on political philosophy regarded as politics or philosophy? Rather than spend too much time fretting about classification, I just made a call one way or the other. Refer to above comment re fun vs science.

The main changes I noted were indeed a move away from pure philosophical entries towards those of a political tone. Likewise, a new category entrant was seen this year in “health”. I developed an interest in improving one’s mental well-being via mindfulness and meditation type subjects, which led me to read a couple of books on this, as well as sleep, which I have classified as health.

Despite me continuing to subjectively feel that I read the large majority of books in eBook form, I actually moved even further away from that being true this year. Slightly under half were in that form. That decrease has largely been taken up by the afore-mentioned audio books, of which I apparently read (listened?) 10 this year. Similarly to last year, 2 of the audio entries were actually “Great Courses“, which are more like a sequence of university-style lectures, with an accompanying book containing notes and summaries.

My books have also been slightly less popular with the general Goodreads-rating audience this year, although not dramatically so.

Now, back to the subject of reading shorter books in order to make it easier to hit my target: the sheer sense of relief I felt when I finished book #50 and hence could go wild with relaxed, long and slow reading, made me concerned as to whether I had managed to beat that bias or not. I wondered whether as I got nearer to my target, the length of the books I selected might have risen, even though this was not my intention.

Below, the top chart shows that average page count by book completed on a monthly basis, year on year.

Book length ofer time


The 2016 data risks producing somewhat invalid conclusions, especially if interpreted without reference to the bottom “count of books” chart, mainly because of the existence of a  September 2016, a month where I read a single book that happened to be over 1,000 pages long.

I also hadn’t actually decided to participate in the book challenge at the start of 2016. I was logging my books, but just for fun (imagine that!). I don’t remember quite when it was suggested I should explicitly join then challenge, but before then it’s less likely I felt pressure to read faster or shorter.

Let’s look then only at 2017:

Book length ofer time2Sidenote: What happened in July?! I only read one book, and it wasn’t especially long. I can only assume Sally Scholz’s intro to feminism must have been particularly thought-provoking.

For reference, I hit book #50 in November this year. There does seem some suggestion in the data that indeed that I did read longer books as time went on, despite my mental disavowal of doing such.

Stats geeks might like to know that the line of best fit shown in the top chart above could be argued to represent that 30% of the variation in book length over time, with each month cumulatively adding on an estimate of an extra 14 pages above a base of 211 pages.  It should be stated that I didn’t spend too long considering the best model or fact-checking the relevant assumptions for this dataset. Instead just pressed “insert trend line” in Tableau and let it decide :).

I’m afraid the regression should not be considered as being traditionally statistically significant at the 0.05 level though, having a p-value of – wait for it – 0.06. Fortunately, for my intention to publish the above in Nature :), I think people are increasingly aware of the silliness of uncontextual hardline p-value criteria and/or publication bias.

Nonetheless, as I participate in the 2018 challenge – now at 52 books, properly one a week – I shall be conscious of this trend and double-up my efforts to keep reading based on quality rather than length. Of course, I remain very open – some might say hopeful! – that one sign of a quality author is that they can convey their material in a way that would be described as concise. You generous readers of my ramblings may detect some hypocrisy here.

For any really interested readers out there, you can once more see the full list of the books I read, plus links to the relevant Goodreads description pages, on the last tab of the interactive viz.

The Datasaurus: a monstrous Anscombe for the 21st century

Most people trained in the ways of data visualisation will be very familiar with Anscombe’s Quartet. For the uninitiated, it’s a set of 4 fairly simple looking X-Y scatterplots that look like this.

Anscombe's Quartet

What’s so great about those then? Well, the reason data vizzers get excited starts to become clear when you realise that the dotted grey lines I have superimposed on each small chart are in fact the mean average of X and Y in each case. And they’re basically the same for each chart.

The identikit summary stats go beyond mere averages. In fact, the variance of both X and Y (and hence the standard deviation) is also the pretty much the same in every chart. As is the correlation coefficient of X and Y, and the regression line that would be the line of best fit if were you to generate a linear model based on each of those 4 datasets.

The point is to show the true power of data visualisation. There are a bunch of clever-sounding summary stats (r-squared is a good one) that some nefarious statisticians might like to baffle the unaware with – but they are oftentimes so summarised that they can lead you to an entirely misleading perception, especially if you are not also an adept statistician.

For example, if someone tells you that their fancy predictive model demonstrates that the relationship between x and y can be expressed as “y = 3 + 0.5x” then you have no way of knowing whether the dataset the model was trained on was that from Anscombe 1, for which it’s possible that it may be a good model, or Anscombe 2, for which it is not, or Anscombe 3 and 4 where the outliers are make that model sub-par in reality, to the point where a school child issued with a sheet of graph paper could probably make a better one.

Yes analytics end-users, demand pictures! OK, there are so many possible summary stats out there that someone expert in collating and mentally visualising the implication of a combination of a hand-picked collection of 30 decimal numbers could perhaps have a decent idea of the distribution of a given set of data – but, unless that’s a skill you already have (clue: if the word “kurtosis” isn’t intuitive to you, you don’t, and it’s nothing to be ashamed of), then why spend years learning to mentally visualise such things, when you could just go ahead and actually visualise it?

But anyway, the quartet was originally created by Mr Anscombe in 1973. Now, a few decades later, it’s time for an even more exciting scatterplot collection, courtesy of Justin Matejka and George Fitzmaurice, take from their paper “Same Stats, Different Graphs“.

They’ve taken the time to create the Datasaurus Dozen. Here they are:

Datasaurus Dozen.png

What what? A star dataset has the same summary statistics as a bunch of lines, an X, a circle or a bunch of other patterns that look a bit like a migraine is coming on?

Yes indeed. Again, these 12 charts all have the same (well, extremely similar) X & Y means, the same X & Y standard deviations and variances, and also the same X & Y linear correlations.

12 charts are obviously more dramatic than 4, and the Datasaurus dozen certainly has a bunch of prettier shapes, but why did they call it Datasaurus? Purely click-bait? Actually no (well, maybe, but there is a valid reason as well!).

Because the 13th of the dozen (a baker’s dozen?) is the chart illustrated below. Please note that if you found Jurassic Park to be unbearably terrifying you should probably close your eyes immediately.

Datasaurus Main

Raa! And yes, this fearsome vision from the distant past also has a X mean of 54.26, and Y mean of 47.83, and X standard deviation of 16.76, a Y standard deviation of 26.93 and a correlation co-efficient of -0.06, just like his twelve siblings above.

If it’s hard to believe, or you just want to play a bit, then the individual datapoints that I put into Tableau to generate the above set of charts is available in this Google sheet – or a basic interactive viz version can be found on Tableau Public here.

Full credit is due to Wikipedia for the Anscombe dataset and Autodesk Research for the Datasaurus data.

Lessons from what happened before Snow’s famous cholera map changed the world

Anyone who studies any amount of the history of, or the best practice for, data visualisation will almost certainly come across a handful of “classic” vizzes. These specific transformations of data-into-diagram have stuck with us through the mists of time in order to become examples that teachers, authors, conference speakers and the like repeatedly pick to illustrate certain key points about the power of dataviz.

A classic when it comes to geospatial analysis is John Snow’s “Cholera map”. Back in the 1850s, it was noted that some areas of the country had a lot more people dying from cholera than other places. At the time, cholera’s transmission mechanism was unknown, so no-one really knew why. And if you don’t know why something’s happening, it’s usually hard to take action against it.

Snow’s map took data that had been gathered about people who had died of cholera, and overlaid the locations where these people resided against a street map of a particularly badly affected part of London. He then added a further data layer denoting the local water supplies.


(High-resolution versions available here).

By adding the geospatial element to the visualisation, geographic clusters showed up that provided evidence to suggest that use of a specific local drinking-water source, the now-famous Broad Street public well, was the key common factor for sufferers of this local peak of cholera infection.

Whilst at the time scientists hadn’t yet proven a mechanism for contagion, it turned out later that the well was indeed contaminated, in this case with cholera-infected nappies. When locals pumped water from it to drink, many therefore tragically succumbed to the disease.

Even without understanding the biological process driving the outbreak – nobody knew about germs back then –  seeing this data-driven evidence caused  the authorities to remove the Broad Street pump handle, people could no longer drink the contaminated water, and lives were saved. It’s an example of how data visualisation can open ones’ eyes to otherwise hidden knowledge, in this case with life-or-death consequences.

But what one hears a little less about perhaps is that this wasn’t the first data-driven analysis to confront the same problem. Any real-world practising data analyst might be unsurprised to hear that there’s a bit more to the story than a swift sequence of problem identification -> data gathering -> analysis determining the root cause ->  action being taken.

Snow wasn’t working in a bubble. Another gentleman, by the name of William Farr, whilst working at the General Register Office, had set up a system that recorded people’s deaths along with their cause. This input seems to have been a key enabler of Snow’s analysis.

Lesson 1: sharing data is a Very Good Thing. This is why the open data movement is so important, amongst other reasons. What if Snow hadn’t been able examine Farr’s dataset – could lives have been lost? How would the field of epidemiology have developed without data sharing?

In most cases, no single person can reasonably be expected to both be the original source of all the data they need and then go on to analyse it optimally. “Gathering data” does not even necessarily involve the same set of skills as “analysing data” does – although of course a good data practitioner should usually understand some of the theory of both.

As it happens, William Farr had gone beyond collecting the data. Being of a statistical bent, he had actually already used the same dataset himself to analytically tackle the same question – why are there relatively more cholera deaths in some places than others? He’d actually already found what appeared to be an answer. It later turned out that his conclusion wasn’t correct – but it certainly wasn’t obvious at the time. In fact, it likely seemed more intuitively correct than Snow’s theory back then.

Lesson 2: Here then is a real life example then of the value of analytical iteration. Just because one person has looked at a given dataset doesn’t mean that it’s worthless to have someone else re-analyse it – even if the former analyst has established a conclusion. This is especially important when the stakes are high, and the answer in hand hasn’t been “proven” by virtue of any resulting action confirming the mechanism. We can be pleased that Snow didn’t just think “oh, someone’s already looked at it” and move on to some shiny new activity.

So what was Farr’s original conclusion? Farr had analysed his dataset, again in a geospatial context, and seen a compelling association between the elevation of a piece of land and the number of cholera deaths suffered by people who live on it. In this case, when the land was lower (vs sea level for example) then cholera deaths seemed to increase.

In June 1852, Farr published a paper entitled “Influence of Elevation on the Fatality of Cholera“. It included this table:


The relationship seems quite clear; cholera deaths per 10k persons goes up dramatically as the elevation of the land goes down.

Here’s the same data, this time visualised in the form of a linechart, from a 1961 keynote address on “the epidemiology of airborne infection”, published in Bacteriology Reviews. Note the “observed mortality” line.


Based on that data, his elevation theory seems a plausible candidate, right?

You might notice that the re-vizzed chart also contains a line concerning the calculated death rate according to “miasma theory”, which seems to have an outcome very similar on this metric to the actual cholera death rate. Miasma was a leading theory of disease-spread back in the nineteenth century, with a pedigree encompassing many centuries. As the London Science Museum tells us:

In miasma theory, diseases were caused by the presence in the air of a miasma, a poisonous vapour in which were suspended particles of decaying matter that was characterised by its foul smell.

This theory was later replaced with the knowledge of germs, but at the time the miasma theory was a strong contender for explaining the distribution of disease. This was probably helped because some potential actions one might take to reduce “miasma” evidently would overlap with those of dealing with germs.

After analysing associations between cholera and multiple geo-variables (crowding, wealth, poor-rate and more), Farr’s paper selects the miasma explanation as the most important one, in a style that seems  quite poetic these days:

From an eminence, on summer evenings, when the sun has set, exhalations are often seen rising at the bottoms of valleys, over rivers, wet meadows, or low streets; the thickness of the fog diminishing and disappearing in upper air. The evaporation is most abundant in the day; but so long as the temperature of the air is high, it sustains the vapour in an invisible body, which is, according to common observation, less noxious while penetrated by sunlight and heat, than when the watery vapour has lost its elasticity, and floats about surcharged with organic compounds, in the chill and darkness of night.

The amount of organic matter, then, in the atmosphere we breathe, and in the waters, will differ at different elevations; and the law which regulates its distribution will bear some resemblance to the law regulating the mortality from cholera at the various elevations.

As we discover later, miasma theory wasn’t correct, and it certainly didn’t offer the optimum answer to addressing the cluster of cholera cases Snow examined.But there was nothing impossible or idiotic about Farr’s work. He (as far as I can see at a glance) gathered accurate enough data and analysed them in a reasonable way. He was testing a hypothesis that was based on the common sense at the time he was working, and found a relationship that does, descriptively, exist.

Lesson 3: correlation is not causation (I bet you’ve never heard that before 🙂 ). Obligatory link to the wonderful Spurious Correlations site.

Lesson 4: just because an analysis seems to support a widely held theory, it doesn’t mean that the theory must be true.

It’s very easy to lay down tools once we seem to have shown that what we have observed is explained by a common theory. Here though we can think of Karl Popper’s views of scientific knowledge being derived via falsification. If there are multiple competing theories in play, the we shouldn’t assume certainty that the dominant one is correct until we have come up with a way of proving the case either way. Sometimes, it’s a worthwhile exercise to try to disprove your findings.

Lesson 5: the most obvious interpretation of the same dataset may vary depending on temporal or other context.

If I was to ask a current-day analyst (who was unfamiliar with the case) to take a look at Farr’s data and provide a view with regards to the explanation of the differences in cholera death rates, then it’s quite possible they’d note the elevation link. I would hope so. But it’s unlikely that, even if they used precisely the same analytical approach, they would suggest that miasma theory is the answer. Whilst I’m hesitant to claim there’s anything that no-one believes, for the most part analysts will probably place an extremely low weight on discredited scientific theories from a couple of centuries ago when it comes to explaining what data shows.

This is more than an idealistic principle – parallels, albeit usually with less at stake, can happen in day-to-day business analysis. Preexisting knowledge changes over time, and differs between groups. Who hasn’t seen (or had of being) the poor analyst who revealed a deep, even dramatic, insight into business performance predicated on data which was later revealed to have been affected by something entirely different.

For my part, I would suggest to learn what’s normal, and apply double-scepticism (but not total disregard!) when you see something that isn’t. This is where domain knowledge is critical to add value to your technical analytical skills. Honestly, it’s more likely that some ETL process messed up your data warehouse, or your store manager is misreporting data, than overnight 100% of the public stopped buying anything at all from your previously highly successful store for instance.

Again, here is an argument for sharing one’s data, holding discussions with people outside of your immediate peer group, and re-analysing data later in time if the context has substantively changed. Although it’s now closed, back in the deep depths of computer data viz history (i.e. the year 2007), IBM launched a data visualisation platform called “Many Eyes”. I was never an avid user, but the concept and name rather enthralled me.

Many Eyes aims to democratize visualization by providing a forum for any users of the site to explore, discuss, and collaborate on visual content…

Sadly, I’m afraid it’s now closed. But other avenues of course exist.

In the data-explanation world, there’s another driving force of change – the development of new technologies for inferring meaning from datapoints. I use “technology” here in the widest possible sense, meaning not necessarily a new version of your favourite dataviz software or a faster computer (not that those don’t help), but also the development of new algorithms, new mathematical processes, new statistical models, new methods of communication, modes of thought and so on.

One statistical model, commonplace in predictive analysis today, is logistic regression. This technique was developed in the 1950s, so was obviously unavailable as a tool for Farr to use a hundred years beforehand. However, in 2004, Bingham et al. published a paper that re-analysed Farr’s data, but this time using logistic regression. Now, even here they still find a notable relationship between elevation and the cholera death rate, reinforcing the idea that Farr’s work was meaningful – but nonetheless conclude that:

Modern logistic regression that makes best use of all the data, however, shows that three variables are independently associated with mortality from cholera. On the basis of the size of effect, it is suggested that water supply most strongly invited further consideration.

Lesson 6: reanalysing data using new “technology” may lead to new or better insights (as long as the new technology is itself more meritorious in some way than the preexisting technology, which is not always the case!).

But anyway, even without such modern-day developments, Snow’s analysis was conducted, and provided evidence that a particular water supply was causing a concentration of cholera cases in a particular district of London. He immediately got the authorities to remove the handle of the contaminated pump, hence preventing its use, and hundreds of people were immediately saved from drinking its foul water and dying.

That’s the story, right? Well, the key events themselves seem to be true, and it remains a great example of that all-too-rare phenomena of data analysis leading to direct action. But it overlooks the point that, by the time the pump was disabled, the local cholera epidemic had already largely subsided.

The International Journal of Epidemiology published a commentary regarding the Broad Street pump in 2002, which included a chart using data taken from Whitehead’s “Remarks on the outbreak of cholera in Broad Street, Golden Square, London, in 1854” paper, which was published in 1867. The chart shows, quite vividly, that by the date that the handle of the pump was removed, the local cholera epidemic that it drove was likely largely over.


As Whitehead wrote:

It is commonly supposed, and sometimes asserted even at meetings of Medical Societies, that the Broad Street outbreak of cholera in 1854 was arrested in mid-career by the closing of the pump in that street. That this is a mistake is sufficiently shown by the following table, which, though incomplete, proves that the outbreak had already reached its climax, and had been steadily on the decline for several days before the pump-handle was removed

Lesson 7: timely analysis is often vital – but if it was genuinely important to analyse urgently, then it’s likely important to take action on the findings equally as fast.

It seems plausible that if the handle had been removed a few days earlier, many more lives could have been saved. This was particularly difficult in this case, as Snow had the unenviable task of persuading the authorities too take action based on a theory that was counter to the prevailing medical wisdom at the time. At least any modern-day analysts can take some solace in the knowledge that even our highest regarded dataviz heroes had some frustration in persuading decision makers to actually act on their findings.

This is not at all to reduce Snow’s impact on the world. His work clearly provided evidence that helped lead to germ theory, which we now hold to be the explanatory factor in cases like these. The implications of this are obviously huge. We save lives based on that knowledge.

Even in the short term, the removal of the handle, whilst too late for much of the initial outbreak, may well have prevented a deadly new outbreak. Whitehead happily acknowledged this in his article.

Here I must not omit to mention that if the removal of the pump-handle had nothing to do with checking the outbreak which had already run its course, it had probably everything to do with preventing a new outbreak; for the father of the infant, who slept in the same kitchen, was attacked with cholera on the very day (Sept. 8th) on which the pump-handle was removed. There can be no doubt that his discharges found their way into the cesspool, and thence into the well. But, thanks to Dr. Snow, the handle was then gone.

Lesson 8: even if it looks like your analysis was ignored until it was too late to solve the immediate problem, don’t be too disheartened –  it may well contribute towards great things in the future.

The Tableau #MakeoverMonday doesn’t need to be complicated

For a while, a couple of  key members of the insatiably effervescent Tableau community, Andy Cotgreave and Andy Kriebel, have been running a “Makeover Monday” activity. Read more and get involved here – but a simplistic summary would be that they distribute a nicely processed dataset on a topic of the day that relates to someone else’s existing visualisation, and all the rest of us Tableau fans can have a go at making our own chart, dashboard or similar to share back with the community so we can inspire and learn from each other.

It’s a great idea, and generates a whole bunch of interesting entries each week. But Andy K noticed that each Monday’s dataset was getting way more downloads than the number of charts later uploaded, and opened a discussion as to why.

There are of course many possible reasons, but one that came through strongly was that, whilst they were interested in the principle, people didn’t think they had the time to produce something comparable to some of the masterpieces that frequent the submissions. That’s a sentiment I wholeheartedly agree with, and, in retrospect – albeit subconsciously – why I never gave it a go myself.

Chris Love, someone who likely interacts with far more Tableau users than most of us do, makes the same point in his post on the benefits of Keeping It Simple Stupid. I believe it was written before the current MakeoverMonday discussions began in earnest, but was certainly very prescient in its applications to this question.

Despite this awesome community many new users I speak to are often put off sharing their work because of the high level of vizzes out there. They worry their work simply isn’t up to scratch because it doesn’t offer the same level of complexity.


To be clear, the original Makeover Monday guidelines did include the guideline that it was quite proper to just spend an hour fiddling around with it. But firstly, after a hard day battling against the dark forces of poor data quality and data-free decisions at work, it can be a struggle to keep on trucking for another hour, however fun it would be in other contexts.

And that’s if you can persuade your family that they should let you keep tapping away for another hour doing what, from the outside, looks kind of like you forgot to finish work. In fact a lot of the worship I have for the zens is how they fit what they do into their lives.

But, beyond that, an hour is not going to be enough to “compete” with the best of what you see other people doing in terms of presentation quality.

I like to think I’m quite adept with Tableau (hey, I have a qualification and everything :-)), but I doubt I could create and validate something like this beauty using an unfamiliar dataset on an unfamiliar topic in under an hour.


It’s beautiful; the authors of this and many other Monday Makeovers clearly have an immense amount of skill and vision. It is fascinating to see both the design ideas and technical implementation required to coerce Tableau into doing certain non-native things. I love seeing this stuff, and very much hope it continues.

But if one is not prepared to commit the sort of time needed to do that regularly to this activity, then one has to try and get over the psychological difficulty of sharing a piece of work which one perceives is likely to be thought of as “worse” than what’s already there. This is through no fault of the MakeoverMonday chiefs, who make it very clear that producing a NYT infographic each week is not the aim here – but I certainly see why it’s a deterrent from more of the data-downloaders uploading their work. And it’s great to see that topic being directly addressed.

After all, for those of us who use Tableau for the day-to-day joys of business, we probably don’t rush off and produce something like this wonderful piece every time some product owner comes along to ask us an “urgent” question.

Instead, we spend a few minutes making a line chart, that gives them some insight into the answer to their question. We upload an interactive bar chart, with default Tableau colours and fonts, to let them explore a bit deeper and so on. We sit in a meeting and dynamically provide an answer to enable live decision-making that before we had tools like this would have had to wait a couple of weeks to get a csv report on. Real value is generated, and people are sometimes even impressed, despite the fact that we didn’t include hand-drawn iconography, gradient-filled with the company colours.

Something like this perhaps:

Yes, it’s “simple”, it’s unlikely to go Tableau-viral, but it makes a key story held within that data very clear to see. And its far more typical of the day-to-day Tableau use I see in the workplace.

For the average business question, we probably do not spend a few hours researching and designing a beautiful colour scheme in order to perform the underlying maths needed to make a dashboard combining a hexmap, a Sankey chart and a network graph in a tool that is not primarily designed to do any of those things directly.

No-one doubts that you can cajole Tableau into such artistry, and there is sometimes real value obtainable by doing so,  or that those who carry it out may be creative geniuses -but unless they have a day job that is very different than that of mine and my colleagues, then I suspect it’s not their day-to-day either. It’s probably more an expression of their talent and passion for the Tableau product.

Pragmatically, if I need to make, for instance, a quick network chart for “business”, then, all other things being equal, I’m afraid I’m more likely I get out a tool that’s designed to do that rather than take a bit more time to work out how to implement it in Tableau, no matter how much I love it (by the way, Gephi is my tool of choice for that – it is nowhere near as user friendly as Tableau, but it is specifically designed for that sort of graph visualisation; also recent versions of Alteryx can do the basics). Honestly, it’s rare for me that these more unusual charts need to be part of a standard dashboard; our organisation is simply not at a level of viz-maturity where these diagrams are the most useful for most people in the intended audience, if indeed they are for many organisations.

And if you’re a professional whose job is creating awesome newspaper style infographics, then I suspect that you’re not using Tableau as the tool that provides the final output either, more often than not. That’s not its key strength in my view; that’s not how they sell it – although they are justly proud of the design-thought that does go into the software in general. But if paper-WSJ is your target audience, you might be better of using a more custom design-focused tool, like Adobe Illustrator (and Coursera will teach you that specific use-case, if you’re interested).

I hope nothing here will cause offence. I do understand the excitement and admire anyone’s efforts to push the boundaries of the tool – I have done so myself, spending way more time than is strictly speaking necessary in terms of a theoretical metric of “insights generated per hour” to make something that looks cool, whether in or out of work. For a certain kind of person it’s fun, it is a nice challenge, it’s a change from a blue line on top of an orange line, and sometimes it might even produce a revelation that really does change the world in some way.

This work surely needs to be done; adherents to (a bastardised version of) Thomas Kuhn’s theory of scientific revolutions might even claim this “pushing to the limits” as one of the ways of engendering the mini-crisis necessary to drive forward real progress in the field. I’m sure some of the valuable Tableau “ideas“, that feed the development of the software in part, have come from people pushing the envelope, finding value, and realising there should be an easier way to generate it.

There’s also the issue of engagement: depending on your aim, optimising your work for being shared worldwide may be more important to you than optimising it for efficiency, or even clarity and accuracy. This may sound like heresy, and it may even touch on ethical issues, but I suspect a survey of the most well-known visualisations outside of the data community would reveal a discontinuity with the ideals of Stephen Few et al!

But it may also be intimidating to the weary data voyager when deciding whether to participate in these sort of Tableau community activities if it seems like everyone else produces Da Vinci masterpieces on demand.

Now, I can’t prove this with data right now, sorry, but I just think it cannot be the case. You may see a lot of fancy and amazing things on the internet – but that’s the nature of how stuff gets shared around; it’s a key component of virality. If you create a default line chart, it may actually be the best answer to a given question, but outside a small community who is actively interested in the subject domain at hand, it’s not necessarily going to get much notice. I mean, you could probably find someone who made a Very Good Decision based even on those ghastly Excel 2003 default charts with the horrendous grey background if you try hard enough.


Never forget…


So, anyway, time to put my money where my mouth is and actually participate in MakeoverMonday. I don’t need to spend even an hour making something if I don’t want to, right?  (after all, I’ve used up all my time writing the above!)

Tableau is sold with emphasis on its speed of data sense-marking, claiming to enable producing something reasonably intelligible 10-100x faster than other tools. If we buy into that hype, then spending 10 minutes of Tableau time (necessitating making 1 less cup of tea perhaps) should enable me to produce something that it could have taken up to 17 hours to produce in Excel.

OK, that might be pushing the marketing rather too literally, but the point is hopefully clear. For #MakeoverMonday, some people may concentrate on how far can they push Tableau outside of its comfort zone, others may focus on how they can integrate the latest best practice in visual design, whereas here I will concentrate on whether I can make anything intelligible in the time that it takes to wait for a coffee in Starbucks (on a bad day) – the “10 minute” viz.

So here’s my first “baked in just 10 minutes” viz on the latest MakeoverMonday topic – the growth of the population of Bermuda. Nothing fancy, time ran out just as I was changing fonts, but hey, it’s a readable chart that tells you something about the population change in Bermuda over time. Click through for the slightly interactive version – although of course, it, for instance, has the nasty default tooltips, thanks to the 10 minutes running out just as I was changing the font for the chart titles…

Bermuda population growth.png



The EU referendum: voting intention vs voting turnout

Next month, the UK is having a referendum on the question of whether it should remain in the European Union, or leave it. All us citizens are having the opportunity to pop down to the ballot box to register our views. And in the mean time we’re subjected to a fairly horrendous  mishmash of “facts” and arguments as to why we should stay or go.

To get the obvious question out of the way, allow me to volunteer that I believe remaining in the EU is the better option, both conceptually and practically. So go tick the right box please! But I can certainly understand the level of confusion amongst the undecided when, to pick one example, one side says things like “The EU is a threat to the NHS” (and produces a much ridiculed video to “illustrate” it) and the other says “Only staying in Europe will protect our NHS”.

So, what’s the result to be? Well, as with any such election, the result depends on both which side each eligible citizen actually would vote for, and the likelihood of that person actually bothering to turn out and vote.

Although overall polling is quite close at the moment, different sub-groups of the population have been identified that are more positive or more negative towards the prospect of remaining in the EU. Furthermore, these groups range in likelihood with regards to saying they will go out and vote (which it must be said is a radically different proposition to actually going out and voting – talk is cheap – but one has to start somewhere).

Yougov recently published some figures they collected that allow one to connect certain subgroups in terms of the % of them that are in favour of remaining (or leaving, if you prefer to think of it that way around) with the rank order of how likely they are to say they’ll actually go and vote. Below, I’ve taken the liberty of incorporating that data into a dashboard that allows exploration of the populations for which they segmented for, their relative likelihood to vote “remain” (invert it if you prefer “leave”), and how likely they are to turn out and vote.

Click here or on the picture below to go and play. And see below for some obvious takeaways.

Groups in favour of remaining in the EU vs referendum turnout intention

So, a few thoughts:

First we should note that the ranks on the slope chart perhaps over-emphasise differences. The scatterplot helps integrate the idea of what the actual percentage of each population that might vote to remain in Europe is, as opposed to the simple ranking. Although there is substantial variation, there’s no mind-blowing trend in terms of the % who would vote remain and the turnout rank (1 = most likely to claim they will turn out to vote).

Remain support % vs turnout rank

I’ve highlighted the extremes on the chart above. Those most in favour to remain are Labour supporters; those least in favour are UKIP supporters. Although we might note that there’s apparently 3% of UKIP fans who would vote to remain. This is possibly a 3% that should get around to changing party affiliation, given that UKIP was largely set up to campaign to get the UK out of Europe, and its current manifesto rants against “a political establishment that wants to keep us enslaved in the Euro project”.

Those claiming to be most likely to vote are those who say they have a high interest in politics, those least likely are those that say they have a low interest. This makes perfect sense – although it should be noted that one’s personal interest in politics of course does not entirely affect the impact of other people’s political decisions that will then be imposed upon you.

So what? Well, in a conference I went to recently, I was told that a certain US object d’ridicule Donald Trump has made effective use of data in his campaign (or at least his staff did). To paraphrase, they apparently realised rather quickly that no amount of data science would result in the ability to make people who do not already like Donald Trump’s senseless, dangerous, awful policies become fans of him (can you guess my feelings?). That would take more magic than even data could bring.

But they realised that they could target quite precisely where the sort of people who do already tend to like him live, and hence harangue them to get out and vote. And whether that is the reason that this malevolent joker is still in the running or not I wouldn’t like to say – but it looks like it didn’t hurt.

So, righteous Remainers, let’s do likewise. Let’s look for some populations that are already the very favourable to remaining in the EU, and see whether they’re likely to turn out unaided.

Want to remain

Well, unfortunately all of the top “in favour to remain” groups seem to be ranked lower in terms of turnout than in terms of pro-remain feeling, but one variable sticks out like a sore thumb: age. It appears that people at the lower end of the age groups, here 18-39, are both some of the most likely subsections of people to be pro-Remain, and some of the least likely to say they’ll go and vote. So, citizens, it is your duty to go out and accost some youngsters; drag’em to the polling booth if necessary. It’s also of interest to note that if leaving the EU is a “bad thing”, then, long term, it’s the younger members of society who are likely to suffer the most (assuming it’s not over-turned any time soon).

But who do we need to nobble educate? Let’s look at the subsections of population that are most eager to leave the EU:

Want to leave.png

OK, some of the pro-leavers also rank quite low in terms of turnout, all good. But a couple of lines rather stand out.

One is age based again; here the opposite end of the spectrum, 60+ year-olds, are some of the least likely to want to remain in Europe and some of the most likely to say they’ll go and vote (historically, the latter has indeed been true). And, well, UKIP people don’t like Europe pretty much by definition – but they seem worryingly likely to claim they’re going to turn up and vote. Time to go on a quick affiliation conversion mission – or at least plan a big purple-and-yellow distraction of some kind…?


There’s at least one obvious critical measure missing from this analysis, and that is the respective sizes of the subpopulations. The population of UKIP supporters for instance is very likely, even now, to be smaller than the number of 60+ year olds, thankfully – a fact that you’d have to take into account when deciding how to have the biggest impact.

Whilst the Yougov data published did not include these volumes, they did build a fun interactive “referendum simulator” that, presumably taking this into account, lets you simulate the likely results based on your view of the likely turnout, age & class skew based on their latest polling numbers.

Unsafe abortions: visualising the “preventable pandemic”

In the past few weeks, I was appalled to read that an UK resident was given a prison sentence for the supposed “crime” of having an abortion. This happened because she lives in Northern Ireland, a country where having an abortion is in theory punishable by a life sentence in jail – unless the person in need happens to be rich enough to arrange an overseas appointment for the procedure, in which case it’s OK.

Abortion rights have been a hugely contentious issue over time, but for those of us who reside in a wealthy country with relatively progressive laws on the matter, and the medical resources needed to perform such procedures efficiently, it’s not always easy to remember what the less fortunate may face in other jurisdictions.

In 2016, can it really still be the case that any substantial number of women face legal or logistic issues in their right to choose what happens to their body, under conditions where the huge scientific consensus is against the prospect of any other being suffering? How often do abortions occur – over time, or in different parts of the world? Is there a connection between more liberal laws and abortion rates? And what are the downsides of illiberal, or medically challenged, environments? These, and more, are questions I had that data analysis surely could have a part in answering.

I found useful data in two key places; a 2012 paper published in the Lancet, titled “Induced abortion: incidence and trends worldwide from 1995 to 2008” and from various World Health Organisation publications on the subject.

It should be noted that abortion incidence data is notoriously hard to gather accurately. Obviously, medical records are not sufficient given the existence of illegal or self-administered procedures noted above. It is also not the case that every women has been interviewed about this subject. Worse yet, even where they have been, abortion remains a topic that’s subject to discomfort, prejudice, fear, exclusion, secrecy or even punishment. This occurs in some situations more than others, but the net effect is that it’s the sort of question where straightforward, honest responses to basic survey questions cannot always be expected.

I would suggest to read the 2012 paper above and its appendices to understand more about how the figures I used were modelled by the researchers who obtained them. But the results they show have been peer reviewed, and show enough variance that I believe they tell a useful, indeed vital, story about the unnecessary suffering of women.

It’s time to look into the data. Please click through below and explore the story points to investigate those questions and more. And once you’ve done that -or if you don’t have the inclination to do so – I have some more thoughts to share below.


Thanks for persisting. No need to read further if you were just interested in the data or what you can do with it in Tableau. What follows is simply commentary.

This blog is ostensibly about “data”, the use of which some attribute notions of cold objectiveness to; a Spock-like detachment coming from seeing an abstract number versus understanding events in the real world. But, in my view, most good uses of data necessarily result in the emergence of a narrative; this is a (the?) key skill of a data analyst. The stories data tells may raise emotions, positive or negative. And seeing this data did so in me.

For those that didn’t decide to click through, here is a brief summary of what I saw. It’s largely based on data about the global abortion rate, most often defined here as the number of abortions divided by the number of women aged 15-44. Much of the data is based on 2008. For further source details, please see the visualisation and its sources (primarily this one).

  • The abortion rate in 2008 is pretty similar to that in 2003, which followed a significant drop from 1995. Globally it’s around 28 abortions per 1,000 women aged 15-44. This equates to nearly 44 million abortions per year. This is a process that affects very many women who go through it, affecting also the network of people that love, care for or simply know them.
  • Abortions can be safe or unsafe. The World Health Organisation defines unsafe abortions as being those that consist of:

a procedure for terminating an unintended pregnancy either by individuals without the necessary skills or in an environment that does not conform to minimum medical standards, or both.

  • In reality, this translates to a large variety of sometimes disturbing methods, from ingestion of toxic substances, inappropriate use of medicines, physical trauma to the uterus (the use of a coathanger is the archetypal image for this, so much so that protesters against the criminalisation of abortion have used them as symbols) – or less focussed physical damage; such as throwing oneself down stairs, or off roofs.
  • Appallingly, the proportion of abortions that were unsafe in 2008 has gone up from previous years.
  • Any medical procedure is rarely 100% safe, but a safe, legal, medically controlled abortion contains a pretty negligible chance of death. Unsafe abortions are hundreds of times more likely to be fatal to the recipient. And for those that aren’t, literally millions of people suffer consequences so severe they have to seek hospital treatment afterwards – and these are the “lucky” ones for whom hospital treatment is even available. This is to say nothing of the damaging psychological effects.
  • Therefore, societies that enforce or encourage unsafe abortions should do so in the knowledge that their position is killing women.
  • Some may argue that abortion, which few people of any persuasion could think of as a happy or desirable occurrence, is encouraged where it is freely legally available. They are wrong. There is no suggestion in this data that stricter anti-abortion laws decrease the incidence of abortions.

    WHO report concurs:

Making abortion legal, safe, and accessible does not appreciably increase demand. Instead, the principal effect is shifting previously clandestine, unsafe procedures to legal and safe ones.

  • In fact, if anything, in this data the association runs the other way. Geopolitical regions with a higher proportion of people living in areas where abortions are illegal actually, on the whole, see a higher rate of abortion. I am not suggesting here that more restrictive laws cause more abortions directly, but it is clearly not the case that making abortion illegal necessarily makes it happen less frequently.
  • But stricter laws do, more straightforwardly, lead to a higher proportion of the abortions that take place anyway being unsafe. And thus, on average, to more women dying.

Abortion is a contentious issue and it will no doubt remain so, perhaps mostly for historic, religious or misogynistic reasons. There are nonetheless valid physical and psychological reasons why abortion is, and should be, controlled to some extent. No mainstream view thinks that one should treat the topic lightly or wants to see the procedure becoming a routine event. As the BBC notes, even ardent “pro-choice” activists generally see it as the least bad of a set of bad courses of action available in a situation that noone wanted to occur in the first place, and surely no-one that goes through it is happy it happened. But it does happen, it will happen, and we know how to save thousands of lives.

Seeing this data may well not change your mind if you’re someone who campaigns against legal abortion. It’s hard to shift a world-view that dramatically, especially where so-called moral arguments may be involved.

But – to paraphrase the Vizioneer, paraphrasing William Wilberforce, with his superb writeup after visualising essential data on the atrocities of modern-day human trafficking – once you see the data then you can no longer say you do not know.

The criminalisation-of-abortion lobby are often termed “pro-lifers”. To me, it now seems that that nomenclature has been seized in a twisted, inappropriate way. Once you know that the policies you campaign for will unquestionably lead to the harming and death of real, conscious, living people – then you no longer have the right to label yourself pro-life.

The 2016 UK Budget – what does the data say?

On March 16th 2016, our Chancellor George Osborne set out the cavalcade of new policies that contribute towards this year’s UK budget. Each results in either a cost or saving to the public funds, which has to be forecast as part of the budget release.

Given the constant focus on “austerity”, seeing what this Government chooses to spend its money on and where it makes cuts can be instructive in understanding the priorities of elected (?)  representatives.

Click through this link (or the image below) to access a visualisation to help understand and explore what the budget contains-  what money George spends on which policies, how he saves funds and who it affects most.

Budget 2016