Analysing your 23andme genetic data in R part 2: exploring the traits associated with your genome

In part one of this mini-series, you heroically obtained and imported your 23andme raw genome data into R. Fun as that was, let’s see if we can learn something interesting from it.  After all, 23andme does automatically provide several genomic analysis reports, but – for many sensible reasons – it is certainly limited in what it can show when compared to the entire literature of this exciting field.

It would be tiresome for me to repeat all the caveats you can find in part one, and all over the more responsible parts of the internet. But do remember that in the grand scheme of things we likely know only somewhere between “nothing” and “a little” so far about the implications of most of the SNPs you imported into R in part one.

Whilst there are rare exceptions, for the most part it seems like many “interesting” real-world human traits are a product of many variations in different SNPs, each of which have a relatively tiny effect size. In these cases, if you happen to have a T somewhere rather than an A, it’s usually unlikely on its own to produce to any huge “real world” outcome you’d notice alone. Or if it does, we don’t necessarily know it yet. Or perhaps it could, but only if you have certain other bases in certain other positions, many of which 23andme may not report on – even if we knew which ones were critical. That was a long and meandering sentence which could be summarised as “things are complicated”.

Note also that 23andme do provide a disclaimer on the raw data, as follows:

This data has undergone a general quality review however only a subset of markers have been individually validated for accuracy. As such, the data from 23andMe’s Browse Raw Data feature is suitable only for research, educational, and informational use and not for medical or other use.

So, all in all, what follows should be treated as a fun-only investigation. You should first seek the services of genetic medical professionals, including the all-important genetic counsellor, if you have any real concerns about what you might find.

But, for the data thrill-seekers,  let’s start by finding some info as which genotypes at which SNPs are thought to have some potential association with a known trait. This is part of what 23andme does for you in their standard reporting. However, they have legal obligations and true experts working on this. I, on the other hand, recently read “Genetics for Dummies”, so tread carefully.

How do we find out about the associations between your SNP results and any subsequent phenotypes? The obvious place, especially if you’re interested in a particular trait, is the ever-growing published scientific literature on genetics. Being privileged enough to count amongst my good friends a scientist who actually published papers involving this subject, I started there. Here, for example, is Dr Kaitlin Roke’s paper: “Evaluating Changes in Omega-3 Fatty Acid Intake after Receiving Personal FADS1 Genetic Information: A Randomized Nutrigenetic Intervention“. Extra credit is also very much due to her for being kind enough to help me out with some background knowledge.

Part of that fascinating study involved explaining to members of the public what the implications of the differing alleles possible at a specific SNP, rs174537, were with regards to the levels of Omega 3 fatty acids typically found in a person’s body, and how well the relevant conversion process proceeds. This may have dietary implications in terms of how much the person concerned should focus on increasing the amount of omega-3 laden food they eat – although it would be remiss of me to fail to mention the good doctor’s general advice that mostly everyone needs to up their levels anyway!

Anyway, to quote from their wonderfully clear description of the implications of the studied SNP:

…the document provided a brief overview of the reported difference in omega-3 FA levels in relation to a common SNP in FADS1 (rs174537)…individuals who are homozygote GG allele carriers have been reported to have more EPA in their bodies and an increased ability to convert ALA into EPA and DHA while individuals with at least one copy of the minor allele (GT or TT) were shown to have less EPA in their bodies and a reduced ability to convert ALA into EPA and DHA

Awesome, so here we have a definitive explanation as to which SNP was examined, and what the implications of the various genotypes are (which I’ve bolded for your convenience).

In part 1 of this post, we already saw how to filter your R-based 23andme data to view your results for a specific SNP, right? If you already completed all those import steps, you can do the same, just switching in the rsid of interest:

filter(genome_data_test, rsid == "rs174537")


Run that, and if you are returned a result of either GT or TT then you’ll know you should be especially careful to ensure you are getting a good amount of omega 3 in your diet.

OK, this is super cool, but what if you don’t happen to know a friendly scientist, or don’t know what traits you’re particular interested in –  how might you evaluate the SNP implications at scale?

Whilst there’s no substitute for actual expertise, luckily there is a R library called “gwascat“, which enables you to access the  NHGRI-EBI Catalog of published genome-wide association studies via a data structure in R. It has a whole lot of info in it, descriptions of the fields you end up with mostly being shown here. The critical point is that it contains a list of SNPs, associated traits and relevant genotypes, together with the references to the publications that found the associations should you want to get more details.

The first thing to do is install gwascat. gwascat is a bioconductor package, rather than the business-user-typical cran packages. So if you’re not a bioconductor user, there’s a slightly different installation routine, which you can see here.

But essentially:


This took a while to install for me, but can just be left to get on with itself.

It sometimes feels like a new genomics study is released almost every few seconds, so the next step may be to get an up-to-date version of the catalog data –  I think the one that installs by default is a few years out of date.

Imagine that we’d like our new GWAS catalogue to end up as a data frame called “updated_gwas_data”:

updated_gwas_data <-

This might take a few minutes to run, depending on how fast your download speed is. You can get some idea of the recency once it’s done by checking the latest date that any publication was added to the catalogue.


At the time of writing, this date is May 21st 2018. And what was that study?

filter(updated_gwas_data, DATE.ADDED.TO.CATALOG == "2018-05-21") %>% select(STUDY) %>% distinct()

“Key HLA-DRB1-DQB1 haplotypes and role of the BTNL2 gene for response to a hepatitis B vaccine.”

Where can I find that study, if I want to read what it actually said?

filter(updated_gwas_data, DATE.ADDED.TO.CATALOG == "2018-05-21") %>% select(LINK) %>% distinct()

OK, now we have up-to-date data, let’s figure out how join it to your personal raw genome data we imported in part 1 (or to be precise here, a mockup file in the same format, so as to avoid sharing anyone’s real genomic data here).

The GWAS data lists each SNP (e.g. “rs9302874”) in a field called SNPS. Our imported 23andme data has the same info in the rsid field. Hence we can do a simple join, here using the dplyr library.

Assuming the 23andme data you imported is called “genome_data”, then:


output_data <- inner_join(genome_data, updated_gwas_data, by = c("rsid" = "SNPS"))

Note the consequences of the inner join here. 23andme analyses SNPs that don’t appear in this GWAS database, and the GWAS database may contain SNPs that 23andme doesn’t provide for you. In either case, these will be removed in the above file result. There’ll just be rows for SNPs that 23andme does provide you, and that do have an entry in the GWAS database.

Also, the GWAS database may have several rows for a single SNP. It could be that several studies examined a single SNP, or that one study found many traits potentially associated with a SNP. This means your final “output_data” table above will have many rows per for some SNPs.

OK, so at the time of writing there are nearly 70,000 studies in the GWAS database, and over 600,000 SNPs in the 23andme data export. How shall we narrow down this data-mass to find something potentially interesting?

There are many fields in the GWAS database you might care about – the definitions being listed here. For us amateur folks here, DISEASE.TRAIT and STRONGEST.SNP.RISK.ALLELE might be of most interest.

DISEASE.TRAIT gives you a genericish name for the trait that a study investigated whether there was an association with a given SNP (e.g. “Plasma omega-3 polyunsaturated fatty acid levels”, or “Systolic blood pressure”). Note that the values are not actually all “diseases” by the common-sense meaning – unless you consider traits like being tall a type of illness anyway.

STRONGEST.SNP.RISK.ALLELE gives you the specific allele of the SNP that was “most strongly” associated with that trait in the study (or potentially a ? if unknown, but let’s ignore those for now). The format here is to show the name of the SNP first, then append a dash and the allele of interest afterwards e.g. “rs10787517-A” or “rs7977462-C”.

This can easily give the impression of greater specificity than the real world has – only 1 allele ever appears in this field, so if there are multiple associations then only the strongest will be listed. If there are associations in tandem with other alleles or other SNPs, then that information also cannot be fully represented here. Also, it’s not necessarily the case that alleles are additive; so without further research we shouldn’t assume that having 2 of the high risk bases gives increased risk over a single base.

That’s what the journal reference is for – another reason it’s critical you do the reading and seek the help of appropriate genetic professionals before any rejoicing or worrying about your results.

Taking the above example from Roke et al’s Omega 3 study, this GWAS database records the most relevant strongest SNP risk allele for the SNP they analysed as being “rs174537-T”. You’d want to read the study in order to know whether that meant that the TT genotype was the one to watch for, or whether GT had similar implications.

Back to an exploration of your genome – the two most obvious approaches that come to mind are either: 1) check whether your 23andme results suggest an association with a specific trait you’re interested in, or 2) check which traits your results may be associated with.

In either case, it’ll be useful to create a field that highlights whether your 23andme results indicate that you have the “strongest risk allele” for each study. This is one way to help narrow down towards the interesting traits you may have inherited.

The 23andme part of of your dataframe contains your personal allele results in the genotype field. There you’ll see entries like “AC” or “TT”. What we really want to do here is, for every combination of SNP and study, check to see if either of the letters in your genotype match up with the letter part of the strongest risk allele.

One method would be to separate out your “genotype” data field into two individual allele fields (so “AC” becomes “A” and “C”). Next, clean up the strongest risk allele so you only have the individual allele (so “rs10787517-A”  becomes “A”). Finally check whether either or both of your personal alleles match the strongest risk allele. If they do, there might be something of interest here.

One method:


output_data$risk_allele_clean <- str_sub(output_data$STRONGEST.SNP.RISK.ALLELE, -1)
output_data$my_allele_1 <- str_sub(output_data$genotype, 1, 1)
output_data$my_allele_2 <- str_sub(output_data$genotype, 2, 2)
output_data$have_risk_allele_count <- if_else(output_data$my_allele_1 == output_data$risk_allele_clean, 1, 0) + if_else(output_data$my_allele_2 == output_data$risk_allele_clean, 1, 0)

Now you have your two individual alleles stored in my_allele_1 and my_allele_2, and the allele for the “highest risk” stored in risk_allele_clean. Risk_allele_clean is the letter part of the GWAS STRONGEST.SNP.RISK.ALLELE field. And finally, the have_risk_allele_count is either 0, 1 or 2 depending on whether your 23andme genotype result at that SNP contains 0, 1 or 2 of the risk alleles.

The previously mentioned DISEASE.TRAIT field contains a summary of the trait involved. So by filtering your dataset to only look for studies about a trait you care about, you can see a summary of the risk allele and whether or not you have it, and the relevant studies that elicited that connection.

I did notice that this trait field can be kind of messy to use. You’ll see several different entries for similar topics; e.g. some studied traits around Body Mass Index are indeed classified as “Body mass index”, others as “BMI in non-smokers” or several other BMI-related phrases. So you might want to try a few different search strings in the below to access everything on the topic you care about.

For example, let’s assume that by now we also inevitably developed a strong interest in omegas and fatty acids. Which SNPs may relate to that topic, and do we personally have the risk allele for any of them?

We can use the str_detect function of the stringr library in order to search for any entries that contain the substring “omega” or “fatty acid”.

select(output_data, rsid, DISEASE.TRAIT, risk_allele = risk_allele_clean, your_geneotype = genotype) %>% 
 filter(str_detect(tolower(DISEASE.TRAIT), "omega") | str_detect(tolower(DISEASE.TRAIT), "fatty acid")) %>%



The full table this outputs is actually 149 rows long. That’s a fair few for an amateur to sift through. Maybe we’d prefer to restrict ourselves to the SNP we heard from Dr Roke’s study above was of particular interest:  rs174537. Easy: just filter on the rsid:

select(output_data, rsid, DISEASE.TRAIT, risk_allele = risk_allele_clean, your_geneotype = genotype) %>% 
 filter(str_detect(tolower(DISEASE.TRAIT), "omega") | str_detect(tolower(DISEASE.TRAIT), "fatty acid")) %>%
 distinct() %>%
 filter(rsid == "rs174537")


Maybe you are curious as to what other traits that same SNP might be associated with? Just reverse the criteria for omega and fatty acid strings. Here I also added the journal reference title, in case I wanted to read up more into these trait associations.

 select(output_data, rsid, DISEASE.TRAIT, STUDY) %>% 
 filter(!str_detect(tolower(DISEASE.TRAIT), "omega") & !str_detect(tolower(DISEASE.TRAIT), "fatty acid")) %>%
 distinct() %>%
 filter(rsid == "rs174537")


Now onto the second approach – this time, you don’t have a specific trait in mind. You’re more interested in discovering which traits have risk alleles that match the respective part of your genome. Please see all the above disclaimers. This is not remotely the same as saying which dreadful diseases you are going to get. But please stay away from this section if you are likely to be worried about seeing information that could even vaguely correspond to health concerns.

We already have our have_risk_allele_count field. If it’s 1 or 2 then you have some sort of match. So, the full list of your matches and the associated studies could be retrieved in a manner like this.

filter(output_data, have_risk_allele_count >= 1) %>%
 select(rsid, your_genotype = genotype, strongest_risk_allele = risk_allele_clean, DISEASE.TRAIT, STUDY )

Note that this list is likely to be long. Some risk alleles are very common in some populations, and remember that there may be many studies that relate to a single SNP, or many SNPs referred to by a single study. You might want to pop it in a nice DT Javascript DataTable to allow easy searching and sorting.


 filter(output_data, have_risk_allele_count >= 1) %>%
 select(rsid, your_genotype = genotype, strongest_risk_allele = risk_allele_clean, DISEASE.TRAIT, STUDY )

results in something like:


…which includes a handy search field at the top left of the output.

There are various other fields in the GWAS dataset you might consider using to filter down further. For example, you might be most interested in findings from studies that match your ethnicity, or occasions where you have risk alleles that are rare within the population. After all, we all like to think we’re special snowflakes, so if 95% of the general population have the risk allele for a trait, then that may be less interesting to an amateur genome explorer than one where you are in the lucky or unlucky 1%.

For the former, you might try searching within the INITIAL.SAMPLE.SIZE or REPLICATION.SAMPLE.SIZE fields, which has entries like: “272 Han Chinese ancestry individuals” or “1,180 European ancestry individuals from ~475 families”.

Similar to the caveats on searching the trait fields, one does need to be careful here if you’re looking for a comprehensive set of results. Some entries in the database have blanks in one of these fields, and others don’t specify ethnicities, having entries like “Up to 984 individuals”.

For the proportion of the studied population who had the risk allele, it’s the RISK.ALLELE.FREQUENCY field. Again, this can sometimes be blank or zero. But in theory, where it has a valid value, then, depending on the study design, you might find that lower frequencies are rarer traits.

We can use dplyr‘s arrange and filter functions to sort do the above sort of narrowing-down. For example: what are the top 10 trait / study / SNP combinations you have the risk allele for that were explicitly studied within European folk, ordered by virtue of them having the lowest population frequencies reported in the study?

 filter(output_data,have_risk_allele_count > 0 & (str_detect(tolower(INITIAL.SAMPLE.SIZE), "european") | str_detect(tolower(REPLICATION.SAMPLE.SIZE), "european")) & (RISK.ALLELE.FREQUENCY > 0 & ! %>%
 select(rsid, your_genotype = genotype, DISEASE.TRAIT, INITIAL.SAMPLE.SIZE,RISK.ALLELE.FREQUENCY)
 , 10)




Or perhaps you’d prefer to prioritise the order of the traits you have the risk allele for, for example, based on the number of entries in the GWAS database for that trait where the highest risk allele is one you have. You might argue that these could be some of the most reliably associated traits, in the sense that they would bias towards those that have so far been studied the most, at least within this database.

Let’s go graphical with this one, using the wonderful ggplot2 package.


trait_entry_count <- group_by(output_data, DISEASE.TRAIT) %>%
 filter(have_risk_allele_count >= 1) %>%
 summarise(count_of_entries = n())

ggplot(filter(trait_entry_count, count_of_entries > 100), aes(x = reorder(DISEASE.TRAIT, count_of_entries, sum), y = count_of_entries)) +
 geom_col() +
 coord_flip() +
 theme_bw() +
 labs(title = "Which traits do you have the risk allele for\nthat have over 100 entries in the GWAS database?", y = "Count of entries", x = "Trait")


Now, as we’re counting combinations of studies and SNPs per trait here, this is obviously going to be somewhat self-fulfilling as some traits have been featured in way more studies than others. Likewise some traits may have been associated with many more SNPs than others. Also, recalling that many interesting traits seem to be related to a complex mix of SNPs, each of which may only have a tiny effect size, it might be that whilst you do have 10 of the risk alleles for condition X, you also don’t have the other 20 risk alleles that we discovered so far have an association (let alone the 100 weren’t even publish on yet and hence aren’t in this data!).

Maybe then we can sort our output in a different way. How about we count the number of distinct SNPs where you have the risk allele, and then express those as a proportion of the count of all the distinct SNPs for the given trait in the database, whether not you have the risk allele? This would let us say things such as, based (solely) on what in this database, you have 60% of the known risk alleles associated with this trait.

One thing noted in the data, both the 23andme genome data and the gwascat highest risk allele have unusual values in the allele fields – things like ?, -, R, D, I and some numbers based on the fact the “uncleaned” STRONGEST.SNP.RISK.ALLELE didn’t have a -A, -C, -G or -T at the end of the SNP it named. Some of these entries may be meaningful – for example the D and I in the 23andme data refer to deletes and insertions, but won’t match up with anything in the gwascat data. Others may be more messy or missing data, for example 23andme reports “–” if no genotype result was provided for a specific SNP call.

In order to avoid these inflating the proportion’s denominator we’ll just filter down so that we only consider entries where our gwascat-derived “risk_allele_clean” and  23andme-derived “my_allele_1” and “my_allele_2″ are all one of the standard A, C, G or T bases.

Let’s also colour code the results by the rarity of the SNP variant within the studied population. That might provide some insight to exactly what sort of special exception we are as an individual – although some of the GWAS data is missing that field and basic averaging won’t necessarily give the correct weighting, so this part is extra…”directional”.

You are no doubt getting bored with the sheer repetition of caveats here – but it is so important. Whilst these are refinements of sorts, they are simplistic and flawed and you should not even consider concluding something significant about your personal health without seeking professional advice here. This is fun only. Well, for for those of us who could ever possibly classify data as fun anyway.

Here we go, building it up one piece at a time for clarity of some kind:


# Summarise proportion of SNPs for a given trait where you have a risk allele

trait_snp_proportion <-  filter(output_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
mutate(you_have_risk_allele = if_else(have_risk_allele_count >= 1, 1, 0)) %>%
 group_by(DISEASE.TRAIT, you_have_risk_allele) %>%
 summarise(count_of_snps = n_distinct(rsid)) %>%
 mutate(total_snps_for_trait = sum(count_of_snps), proportion_of_snps_for_trait = count_of_snps / sum(count_of_snps) * 100) %>%
 filter(you_have_risk_allele == 1) %>%
 arrange(desc(proportion_of_snps_for_trait)) %>%

# Count the studies per trait in the database

trait_study_count <- filter(output_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
 group_by(DISEASE.TRAIT) %>%
 summarise(count_of_studies = n_distinct(PUBMEDID), mean_risk_allele_freq = mean(RISK.ALLELE.FREQUENCY))

# Merge the above together

trait_snp_proportion <- inner_join(trait_snp_proportion, trait_study_count, by = "DISEASE.TRAIT")

# Plot the traits where there were more than 2 studies and you have risk alleles for more than 70% of the SNPs studied

ggplot(filter(trait_snp_proportion, count_of_studies > 1 & proportion_of_snps_for_trait > 70), aes(x = reorder(DISEASE.TRAIT, proportion_of_snps_for_trait, sum), y = proportion_of_snps_for_trait, fill = mean_risk_allele_freq)) +
 geom_col() +
 coord_flip() +
 theme_bw() + 
 labs(title = "Traits I have more than half of the risk\nalleles studied where > 1 studies involved", 
 y = "% of SNPs with risk allele", x = "Trait", fill = "Mean risk allele frequency") +



Again, beware that the same sort of trait can be expressed in different ways within the data, in which case these entries are not combined. If you wanted to be more comprehensive regarding a specific trait, you might feel inclined to produce your own categorisation first and group by that – e.g. lumping anything BMI or Body Mass Index into your own BMI category via creating a new field.

Happy exploring!

Analysing your 23andme genetic data in R part 1: importing your genome into R

23andme is one of the ever-increasing number of direct to consumer DNA testing companies. You send in a vial of your spit; and they analyse parts of your genome, returning you a bunch of reports on ancestry, traits and – if you wish – health.

Their business is highly regulated, as of course it should be (and some would say it oversteps the mark a little even with that), so they are, quite rightly, legally limited as to what info they can provide back to the consumer. However, the exciting news for us data geeks is that they do allow you to download the raw data behind their analysis. This means you can dig deeper into parts of your genome that their interpretations don’t cover.

It should be said that there is considerable risk involved here, unless – or perhaps even if – you happen to be a genetics expert. The general advice on interpretation for amateurs should be to seek a professional genetic counseller before concluding anything from your DTC test – although in reality that might be easier said than done.

Whilst I might know a bit about how to play with data, I am not at all a genetics expert, so anything below must be taken with a large  amount of skepticism. In fact, if you are in the perfectly legitimate camp of “best not to know” people when it comes to DNA analysis, or you feel there is any risk you won’t be able to constrain yourself to treat the innards of your genome as solely a fun piece of analysis and constrain yourself to avoid areas you don’t want to explore, it would be wise not to proceed.

Also, even as an amateur, I’m aware that the science behind a lot of the interpretation of one’s genome is in a nascent period, at best. There are many people or companies that may rather over-hype what is actually known here, perhaps even to the extent of fraud in some cases.

But if you are interested to browse your results, here is my first experience of playing with the 23andme raw data in R.

Firstly, you need to actually obtain your raw 23andme data. A obvious precondition to this is that you have purchased one of their analysis products, set up your 23andme account, and waited until they have delivered the results to you. Assuming that’s all done, you can visit this 23andme page, and press the “Download” button near the top of the screen. You’ll need to wait a while, but eventually they’ll notify you that your file is ready, and let you download a text file of results to your computer. Here, I called my example file “genome.txt”.

Once you have that, it’s time to load it into R!

The text file is in a tab-delimited format, and also contains 19 rows at the top describing the contents of the file in human-readable format. You’ll want to skip those rows before importing it into R. I used the readr package to do this, although it’s just as easy in base R.

A few notes:

  • It imported more successfully if I explicitly told R the data type of each column.
  • One of the column headers (i.e. the field names) starts with a # and includes spaces, which is a nuisance to deal with in R, so I renamed that right away
  • I decided to import it into a dataframe called “genome_data”

genome_data_test <- read_tsv(".\\data_files\\genome.txt", skip = 19, col_types = cols(
 `# rsid` = col_character(),
 chromosome = col_character(),
 position = col_integer(),
 genotype = col_character())

genome_data_test <- rename(genome_data_test, rsid = `# rsid`)

Let’s see what we have!



Sidenote: the genome data I am using is a mocked-up example in the 23andme format, rather than anyone’s real genome – so don’t be surprised if you see “impossible” results shown here. Call me paranoid, but I am not sure it’s necessarily a great idea to publicly share someone’s real results online, at least without giving it careful consideration.

OK, so we have a list of your SNP call data. The rsid column is the “Reference SNP cluster ID” used to refer to a specific SNP, the chromosome and position tell you whereabouts that SNP is located, and the genotype tells you which combination of the Adenine, Thymine, Cytosine and Guanine bases you happen have in those positions.

(Again, I am not at all an expert here, so apologies for any incorrect terminology! Please feel free to let me know what I should have written 🙂 )

Now, let’s check that the import went well.

Many of the built in 23andme website reports do actaully  list what SNPs they refer to. For instance, if you click on “Scientific Details” on the life-changing trait report which tells you how likely it is that you urine will smell odd to you after eating asparagus, and look for the “marker tested” section, it tells you that it’s looking at the rs4481887 SNP.


And it also tells you what bases were found there in your test results. Compare that to the data for the same person’s genome imported in R, by filtering your imported data like this:

filter(genome_data_test, rsid == "rs4481887")

If the results of that match the results shown in the scientific details of your asparagus urine smell report, yay, things are going OK so far.

OK, so now your 23andme data is safely in R. But why did we do this, and what might it mean? Come back soon for part 2.

Books I read in 2017

Long term readers (hi!) may recall my failure to achieve the target I had of reading 50 books in 2016. I had joined the 2016 Goodreads reading challenge, logged my reading activity, and hence had access to the data needed track my progress at the end of the year. It turns out that 41 books is less than 50.

Being a glutton for punishment, I signed up again in 2017, with the same cognitively terrifying 50 book target – basically one a week, although I cannot allow myself to think that way. It is now 2018, so time to review how I did.

Goodreads allows you to log which books you are reading and when you finished them. The finish date is what counts for the challenge. Nefarious readers may spot a few potential exploits here, especially if competing for only 1 year. However, I tried to play the game in good faith (but did I actually do so?  Perhaps the data will reveal!).

As you go through the year, Goodreads will update you on how you are doing with your challenge. Or for us nerd types, you can download a much more detailed and useful CSV. There’s also a the Goodreads API to explore, if that floats your boat.

Similarly to last year, I went with the CSV.  I did have to hand-edit the CSV a little, both to fill in a little missing data that appears to be absent from the Goodreads dataset, and also to add couple of extra data fields that I wanted to track that Goodreads doesn’t natively support. I then popped the CSV into a Tableau dashboard, which you can explore interactively by clicking here.

Results time!

How much did I read

Joyful times! In 2017 I got to, and even exceeded, my target! 55 books read.

In comparison to my 2016 results, I got ahead right from the start of the year, and widened the gap notably in Q2. You can see a similar boost to that witnessed in 2016 around the time of the summer holidays, weeks 33-35ish. Not working is clearly good for one’s reading obligations.

What were the characteristics of the books I read?

Although page count is a pretty vague and manipulable measure – different books have different physical sizes, font sizes, spacing, editions – it is one of the few measures where data is easily available so we’ll go with that. In the case of eBooks or audio books (more on this later) without set “pages” I used the page count of the respective paper version. I fully acknowledge this rigour of this analysis as falling under “fun” rather than “science”.

So the first revelation is that this year’s average pages per read book was 300, a roughly 10% decrease from last year’s average book. Hmm. Obviously, if everything else remains the same,  the target of 50 books is easier to meet if you read shorter books! Size doesn’t always reflect complexity or any other influence around time to complete of course.

I hadn’t deliberately picked short books – in fact, being aware of this incentive I had tried to be conscious of avoiding doing this, and concentrate on reading what I wanted to read, not just what boosts the stats. However, even outside of this challenge, I (most likely?) only have a certain number of years to live, and hence do feel a natural bias towards selecting shorter books if everything else about them was to be perfectly equal. Why plough through 500 pages if you can get the same level of insight about a topic in 150?

The reassuring news is that, despite the shorter average length of book, I did read 20% more pages in total. This suggests I probably have upped the abstract “quantity” of reading, rather than just inflated the book count by picking short books. There was also a little less variation in page count between books this year than last by some measures.

In the distribution charts, you can see a spike of books at around 150 pages long this year which didn’t show up last year. I didn’t note a common theme in these books, but a relatively high proportion of them were audio books.

Although I am an avid podcast listener, I am not a huge fan of audio books as a rule. I love the idea as a method to acquire knowledge whilst doing endless chores or other semi-mindless activities. I would encourage anyone else with an interest of entering book contents into their brain to give them a whirl. But, for me, in practice I struggle to focus on them in any multi-tasking scenario, so end up hitting rewind a whole lot. And if I am in a situation where I can dedicate full concentration to informational intake, I’d rather use my eyes than my ears. For one, it’s so much faster, which is an important consideration when one has a book target!  With all that, the fact that audio books are over-represented in the lower page-counts for me is perhaps therefore not surprising. I know my limits.

I have heard tell that some people may consider audio books as invalid for the book challenge. In defence, I offer up that Goodreads doesn’t seem to feel this way in their blog post on the 2018 challenge. Besides, this isn’t the Olympics – at least no-one has sent me a gold medal yet – so everyone can make their own personal choice. For me, if it’s a method to get a book’s contents into my brain, I’ll happily take it. I just know I have to be very discriminating with regards to selecting audio books I can be sure I will be able to focus on. Even I would personally regard it cheating to log a book that happened to be audio-streaming in the background when I was asleep. If you don’t know what the book was about, you can’t count it.

So, what did I read about?

What did I read

Book topics are not always easy to categorise. The categories I used here are mainly the same as last year, based entirely on my 2-second opinion rather than any comprehensive Dewey Decimal-like system. This means some sort of subjectivity was necessary. Is a book on political philosophy regarded as politics or philosophy? Rather than spend too much time fretting about classification, I just made a call one way or the other. Refer to above comment re fun vs science.

The main changes I noted were indeed a move away from pure philosophical entries towards those of a political tone. Likewise, a new category entrant was seen this year in “health”. I developed an interest in improving one’s mental well-being via mindfulness and meditation type subjects, which led me to read a couple of books on this, as well as sleep, which I have classified as health.

Despite me continuing to subjectively feel that I read the large majority of books in eBook form, I actually moved even further away from that being true this year. Slightly under half were in that form. That decrease has largely been taken up by the afore-mentioned audio books, of which I apparently read (listened?) 10 this year. Similarly to last year, 2 of the audio entries were actually “Great Courses“, which are more like a sequence of university-style lectures, with an accompanying book containing notes and summaries.

My books have also been slightly less popular with the general Goodreads-rating audience this year, although not dramatically so.

Now, back to the subject of reading shorter books in order to make it easier to hit my target: the sheer sense of relief I felt when I finished book #50 and hence could go wild with relaxed, long and slow reading, made me concerned as to whether I had managed to beat that bias or not. I wondered whether as I got nearer to my target, the length of the books I selected might have risen, even though this was not my intention.

Below, the top chart shows that average page count by book completed on a monthly basis, year on year.

Book length ofer time


The 2016 data risks producing somewhat invalid conclusions, especially if interpreted without reference to the bottom “count of books” chart, mainly because of the existence of a  September 2016, a month where I read a single book that happened to be over 1,000 pages long.

I also hadn’t actually decided to participate in the book challenge at the start of 2016. I was logging my books, but just for fun (imagine that!). I don’t remember quite when it was suggested I should explicitly join then challenge, but before then it’s less likely I felt pressure to read faster or shorter.

Let’s look then only at 2017:

Book length ofer time2Sidenote: What happened in July?! I only read one book, and it wasn’t especially long. I can only assume Sally Scholz’s intro to feminism must have been particularly thought-provoking.

For reference, I hit book #50 in November this year. There does seem some suggestion in the data that indeed that I did read longer books as time went on, despite my mental disavowal of doing such.

Stats geeks might like to know that the line of best fit shown in the top chart above could be argued to represent that 30% of the variation in book length over time, with each month cumulatively adding on an estimate of an extra 14 pages above a base of 211 pages.  It should be stated that I didn’t spend too long considering the best model or fact-checking the relevant assumptions for this dataset. Instead just pressed “insert trend line” in Tableau and let it decide :).

I’m afraid the regression should not be considered as being traditionally statistically significant at the 0.05 level though, having a p-value of – wait for it – 0.06. Fortunately, for my intention to publish the above in Nature :), I think people are increasingly aware of the silliness of uncontextual hardline p-value criteria and/or publication bias.

Nonetheless, as I participate in the 2018 challenge – now at 52 books, properly one a week – I shall be conscious of this trend and double-up my efforts to keep reading based on quality rather than length. Of course, I remain very open – some might say hopeful! – that one sign of a quality author is that they can convey their material in a way that would be described as concise. You generous readers of my ramblings may detect some hypocrisy here.

For any really interested readers out there, you can once more see the full list of the books I read, plus links to the relevant Goodreads description pages, on the last tab of the interactive viz.

Lessons from what happened before Snow’s famous cholera map changed the world

Anyone who studies any amount of the history of, or the best practice for, data visualisation will almost certainly come across a handful of “classic” vizzes. These specific transformations of data-into-diagram have stuck with us through the mists of time in order to become examples that teachers, authors, conference speakers and the like repeatedly pick to illustrate certain key points about the power of dataviz.

A classic when it comes to geospatial analysis is John Snow’s “Cholera map”. Back in the 1850s, it was noted that some areas of the country had a lot more people dying from cholera than other places. At the time, cholera’s transmission mechanism was unknown, so no-one really knew why. And if you don’t know why something’s happening, it’s usually hard to take action against it.

Snow’s map took data that had been gathered about people who had died of cholera, and overlaid the locations where these people resided against a street map of a particularly badly affected part of London. He then added a further data layer denoting the local water supplies.


(High-resolution versions available here).

By adding the geospatial element to the visualisation, geographic clusters showed up that provided evidence to suggest that use of a specific local drinking-water source, the now-famous Broad Street public well, was the key common factor for sufferers of this local peak of cholera infection.

Whilst at the time scientists hadn’t yet proven a mechanism for contagion, it turned out later that the well was indeed contaminated, in this case with cholera-infected nappies. When locals pumped water from it to drink, many therefore tragically succumbed to the disease.

Even without understanding the biological process driving the outbreak – nobody knew about germs back then –  seeing this data-driven evidence caused  the authorities to remove the Broad Street pump handle, people could no longer drink the contaminated water, and lives were saved. It’s an example of how data visualisation can open ones’ eyes to otherwise hidden knowledge, in this case with life-or-death consequences.

But what one hears a little less about perhaps is that this wasn’t the first data-driven analysis to confront the same problem. Any real-world practising data analyst might be unsurprised to hear that there’s a bit more to the story than a swift sequence of problem identification -> data gathering -> analysis determining the root cause ->  action being taken.

Snow wasn’t working in a bubble. Another gentleman, by the name of William Farr, whilst working at the General Register Office, had set up a system that recorded people’s deaths along with their cause. This input seems to have been a key enabler of Snow’s analysis.

Lesson 1: sharing data is a Very Good Thing. This is why the open data movement is so important, amongst other reasons. What if Snow hadn’t been able examine Farr’s dataset – could lives have been lost? How would the field of epidemiology have developed without data sharing?

In most cases, no single person can reasonably be expected to both be the original source of all the data they need and then go on to analyse it optimally. “Gathering data” does not even necessarily involve the same set of skills as “analysing data” does – although of course a good data practitioner should usually understand some of the theory of both.

As it happens, William Farr had gone beyond collecting the data. Being of a statistical bent, he had actually already used the same dataset himself to analytically tackle the same question – why are there relatively more cholera deaths in some places than others? He’d actually already found what appeared to be an answer. It later turned out that his conclusion wasn’t correct – but it certainly wasn’t obvious at the time. In fact, it likely seemed more intuitively correct than Snow’s theory back then.

Lesson 2: Here then is a real life example then of the value of analytical iteration. Just because one person has looked at a given dataset doesn’t mean that it’s worthless to have someone else re-analyse it – even if the former analyst has established a conclusion. This is especially important when the stakes are high, and the answer in hand hasn’t been “proven” by virtue of any resulting action confirming the mechanism. We can be pleased that Snow didn’t just think “oh, someone’s already looked at it” and move on to some shiny new activity.

So what was Farr’s original conclusion? Farr had analysed his dataset, again in a geospatial context, and seen a compelling association between the elevation of a piece of land and the number of cholera deaths suffered by people who live on it. In this case, when the land was lower (vs sea level for example) then cholera deaths seemed to increase.

In June 1852, Farr published a paper entitled “Influence of Elevation on the Fatality of Cholera“. It included this table:


The relationship seems quite clear; cholera deaths per 10k persons goes up dramatically as the elevation of the land goes down.

Here’s the same data, this time visualised in the form of a linechart, from a 1961 keynote address on “the epidemiology of airborne infection”, published in Bacteriology Reviews. Note the “observed mortality” line.


Based on that data, his elevation theory seems a plausible candidate, right?

You might notice that the re-vizzed chart also contains a line concerning the calculated death rate according to “miasma theory”, which seems to have an outcome very similar on this metric to the actual cholera death rate. Miasma was a leading theory of disease-spread back in the nineteenth century, with a pedigree encompassing many centuries. As the London Science Museum tells us:

In miasma theory, diseases were caused by the presence in the air of a miasma, a poisonous vapour in which were suspended particles of decaying matter that was characterised by its foul smell.

This theory was later replaced with the knowledge of germs, but at the time the miasma theory was a strong contender for explaining the distribution of disease. This was probably helped because some potential actions one might take to reduce “miasma” evidently would overlap with those of dealing with germs.

After analysing associations between cholera and multiple geo-variables (crowding, wealth, poor-rate and more), Farr’s paper selects the miasma explanation as the most important one, in a style that seems  quite poetic these days:

From an eminence, on summer evenings, when the sun has set, exhalations are often seen rising at the bottoms of valleys, over rivers, wet meadows, or low streets; the thickness of the fog diminishing and disappearing in upper air. The evaporation is most abundant in the day; but so long as the temperature of the air is high, it sustains the vapour in an invisible body, which is, according to common observation, less noxious while penetrated by sunlight and heat, than when the watery vapour has lost its elasticity, and floats about surcharged with organic compounds, in the chill and darkness of night.

The amount of organic matter, then, in the atmosphere we breathe, and in the waters, will differ at different elevations; and the law which regulates its distribution will bear some resemblance to the law regulating the mortality from cholera at the various elevations.

As we discover later, miasma theory wasn’t correct, and it certainly didn’t offer the optimum answer to addressing the cluster of cholera cases Snow examined.But there was nothing impossible or idiotic about Farr’s work. He (as far as I can see at a glance) gathered accurate enough data and analysed them in a reasonable way. He was testing a hypothesis that was based on the common sense at the time he was working, and found a relationship that does, descriptively, exist.

Lesson 3: correlation is not causation (I bet you’ve never heard that before 🙂 ). Obligatory link to the wonderful Spurious Correlations site.

Lesson 4: just because an analysis seems to support a widely held theory, it doesn’t mean that the theory must be true.

It’s very easy to lay down tools once we seem to have shown that what we have observed is explained by a common theory. Here though we can think of Karl Popper’s views of scientific knowledge being derived via falsification. If there are multiple competing theories in play, the we shouldn’t assume certainty that the dominant one is correct until we have come up with a way of proving the case either way. Sometimes, it’s a worthwhile exercise to try to disprove your findings.

Lesson 5: the most obvious interpretation of the same dataset may vary depending on temporal or other context.

If I was to ask a current-day analyst (who was unfamiliar with the case) to take a look at Farr’s data and provide a view with regards to the explanation of the differences in cholera death rates, then it’s quite possible they’d note the elevation link. I would hope so. But it’s unlikely that, even if they used precisely the same analytical approach, they would suggest that miasma theory is the answer. Whilst I’m hesitant to claim there’s anything that no-one believes, for the most part analysts will probably place an extremely low weight on discredited scientific theories from a couple of centuries ago when it comes to explaining what data shows.

This is more than an idealistic principle – parallels, albeit usually with less at stake, can happen in day-to-day business analysis. Preexisting knowledge changes over time, and differs between groups. Who hasn’t seen (or had of being) the poor analyst who revealed a deep, even dramatic, insight into business performance predicated on data which was later revealed to have been affected by something entirely different.

For my part, I would suggest to learn what’s normal, and apply double-scepticism (but not total disregard!) when you see something that isn’t. This is where domain knowledge is critical to add value to your technical analytical skills. Honestly, it’s more likely that some ETL process messed up your data warehouse, or your store manager is misreporting data, than overnight 100% of the public stopped buying anything at all from your previously highly successful store for instance.

Again, here is an argument for sharing one’s data, holding discussions with people outside of your immediate peer group, and re-analysing data later in time if the context has substantively changed. Although it’s now closed, back in the deep depths of computer data viz history (i.e. the year 2007), IBM launched a data visualisation platform called “Many Eyes”. I was never an avid user, but the concept and name rather enthralled me.

Many Eyes aims to democratize visualization by providing a forum for any users of the site to explore, discuss, and collaborate on visual content…

Sadly, I’m afraid it’s now closed. But other avenues of course exist.

In the data-explanation world, there’s another driving force of change – the development of new technologies for inferring meaning from datapoints. I use “technology” here in the widest possible sense, meaning not necessarily a new version of your favourite dataviz software or a faster computer (not that those don’t help), but also the development of new algorithms, new mathematical processes, new statistical models, new methods of communication, modes of thought and so on.

One statistical model, commonplace in predictive analysis today, is logistic regression. This technique was developed in the 1950s, so was obviously unavailable as a tool for Farr to use a hundred years beforehand. However, in 2004, Bingham et al. published a paper that re-analysed Farr’s data, but this time using logistic regression. Now, even here they still find a notable relationship between elevation and the cholera death rate, reinforcing the idea that Farr’s work was meaningful – but nonetheless conclude that:

Modern logistic regression that makes best use of all the data, however, shows that three variables are independently associated with mortality from cholera. On the basis of the size of effect, it is suggested that water supply most strongly invited further consideration.

Lesson 6: reanalysing data using new “technology” may lead to new or better insights (as long as the new technology is itself more meritorious in some way than the preexisting technology, which is not always the case!).

But anyway, even without such modern-day developments, Snow’s analysis was conducted, and provided evidence that a particular water supply was causing a concentration of cholera cases in a particular district of London. He immediately got the authorities to remove the handle of the contaminated pump, hence preventing its use, and hundreds of people were immediately saved from drinking its foul water and dying.

That’s the story, right? Well, the key events themselves seem to be true, and it remains a great example of that all-too-rare phenomena of data analysis leading to direct action. But it overlooks the point that, by the time the pump was disabled, the local cholera epidemic had already largely subsided.

The International Journal of Epidemiology published a commentary regarding the Broad Street pump in 2002, which included a chart using data taken from Whitehead’s “Remarks on the outbreak of cholera in Broad Street, Golden Square, London, in 1854” paper, which was published in 1867. The chart shows, quite vividly, that by the date that the handle of the pump was removed, the local cholera epidemic that it drove was likely largely over.


As Whitehead wrote:

It is commonly supposed, and sometimes asserted even at meetings of Medical Societies, that the Broad Street outbreak of cholera in 1854 was arrested in mid-career by the closing of the pump in that street. That this is a mistake is sufficiently shown by the following table, which, though incomplete, proves that the outbreak had already reached its climax, and had been steadily on the decline for several days before the pump-handle was removed

Lesson 7: timely analysis is often vital – but if it was genuinely important to analyse urgently, then it’s likely important to take action on the findings equally as fast.

It seems plausible that if the handle had been removed a few days earlier, many more lives could have been saved. This was particularly difficult in this case, as Snow had the unenviable task of persuading the authorities too take action based on a theory that was counter to the prevailing medical wisdom at the time. At least any modern-day analysts can take some solace in the knowledge that even our highest regarded dataviz heroes had some frustration in persuading decision makers to actually act on their findings.

This is not at all to reduce Snow’s impact on the world. His work clearly provided evidence that helped lead to germ theory, which we now hold to be the explanatory factor in cases like these. The implications of this are obviously huge. We save lives based on that knowledge.

Even in the short term, the removal of the handle, whilst too late for much of the initial outbreak, may well have prevented a deadly new outbreak. Whitehead happily acknowledged this in his article.

Here I must not omit to mention that if the removal of the pump-handle had nothing to do with checking the outbreak which had already run its course, it had probably everything to do with preventing a new outbreak; for the father of the infant, who slept in the same kitchen, was attacked with cholera on the very day (Sept. 8th) on which the pump-handle was removed. There can be no doubt that his discharges found their way into the cesspool, and thence into the well. But, thanks to Dr. Snow, the handle was then gone.

Lesson 8: even if it looks like your analysis was ignored until it was too late to solve the immediate problem, don’t be too disheartened –  it may well contribute towards great things in the future.

You made a chart. So what?

In the latest fascinating Periscope video from Chris Love, the conversation centred around a question that can be summarised as “Do data visualisations need a ‘so what’?“.

There are many ways of rephrasing this: one could ask whether it is (always) the responsibility of the viz author to highlight the story that their visualisations show? Or can a data visualisation be truly worthy of high merit even if it doesn’t lead the viewer to a conclusion?

This topic resonates strongly with me: part of my day job involves maintaining a reference library of the results from the analytical research or investigation we do. We publish this widely within our organisation, so that any employee who has cause or interest in what we found in the past can help themselves to the results. The title we happened to give the library is “So what?“.

Although the detailed results of our work may be reported in many different formats, each library entry has a templated front page that includes the same sections for each study:

  1. The title of the work.
  2. The question that the work intended to address.
  3. A summary of the scope and dataset that went into the analysis.
  4. A list of the main findings.
  5. And finally, the all-important “So what?” section.

Note the distinction between findings (e.g. “customers who don’t buy anything for 50 days are likely to never buy anything again”) and the so what (“we recommend you call the customer when it has been 40 days since you saw them if you wish to increase your sales by 10%”).

The simple answer

With the above in mind, my position is probably quite obvious. If you are going to demand a binary yes/no answer as to whether a data visualisation should have a “so what?”, then my simplistic generalization would be that the answer is yes, it should.

Most of the time, especially in a business context, the main intention behind the whole analytics pipeline is to provide some nugget of information that will lead to a specific decision or action being taken. If the data visualisation doesn’t lead to (or preferably even spoon-feed) a conclusion then there is a high risk that the audience might feel that they wasted their time in looking at it.

In reality though, the black-and-white answer portrayed above is naturally a series of various shades of grey.

A slightly more refined answer

Two key considerations are paramount to deciding whether a particular viz has to have a “so what” to be valuable.

The audience

Please note that I write this from the perspective of visualisations aimed at communities that are not necessarily all data scientist type professionals. If your intended audience is a startup data company populated entirely by computer science PhDs who live and breath dataviz, then the answers may differ. But for most of us, hobbyists or pros, this is not the audience we have, or seek.

A rule of thumb here then might be:

  • If your audience consists entirely of other analysts, then no, it is not essential to have a “so what?” aspect to your viz. However under many circumstances it still would be extremely useful to do so.
  • If your audience includes non-analysts, particularly those people who might term themselves “busy executives” or claim that they “don’t need data to make decisions” (ugh) then it is in general absolutely essential that your viz points towards a “so what”, if a viz is indeed what you intend to deliver.

Why is it OK to lose the “so what” for analysts? Well, only because these people are probably very capable of using a well-designed viz to generate their own conclusions in an analytically safe way. It’s not that they don’t need a “so what”: they almost certainly do – it’s just that you can feel more secure that, whilst not producing it yourself, you can rely on them to do that aspect of the work properly.

They might even be better than you at interpreting the results, if for instance they have extensive subject domain knowledge that you don’t. Interpretation of data is almost always a mix of analytical proficiency and domain-specific knowledge.

Even the best technical analyst cannot have knowledge of all domains. This is why it’s generally not good to let a brand spanking new super-IQ multiple-PhD analyst join an existing company and sit on their own in a dark computer-filled room for a year before entering into discussion as to what kind of analysis you might be interested in to add maximum value to your world.

The lack of an explicit “so what?” ruins many great dashboards

I’m going to go a step further and say that in many cases – especially in non-data focussed organisations – “general” dashboards turn out to be not very useful.

This may be a controversial statement in a world where every analytical software provider sells fancy new ways to make dashboards, every consultant can build ten for you quicksmart, and every “stakeholder” falls over in amazement when they see that they can view and interact with several facets of data at once in a way that was never possible with their tedious plain .csv files.

But a pattern I have often seen is:

  1. Someone sees or suggests a fancy new tool that claims dashboarding as one of its abilities (and this is not to denigrate any tool; this happens plenty even with my favourite tool de jour, Tableau).
  2. A VIP loves the theoretical power of what they see and decides they need a dashboard “on sales” for example.
  3. A analyst happily creates a “sales dashboard” – usually based on what they think the VIP should probably want to see, given that “sales” is not a very fully fleshed out description of anything.
  4. The sponsor VIP is very happy and views it as soon as it’s accessible.
  5. They may even go and visit it every day for the first week, rejoicing in the up-to-date, integrated, comprehensive, colourful data. Joy!
  6. The administrator checks the server logs next month and realises that no-one in the entire company opened the sales dashboard since week 1.
  7. The analyst is sad.

Why? Everyone (sort of…arguably…) did their job. But, after the novelty wore off, the decision maker probably got bored or “too busy” to open the dashboard every day. At best, perhaps they ask an analytical type to monitor what’s going on with the dashboard. At worst, perhaps they go back to making up decisions based on the random decay of radioactive isotopes, or something similar.

They got “too busy” because, after they had waited for the  dashboard to load up, they’d see a few nice charts with interactive filters to go through in order to try and determine whether there was anything they should actually go and do in the real world based on what they showed.

Sales are a bit up in Canada vs yesterday, horray! Yesterday they were a bit down, boo! Do I need to do something about this? Who knows? Do I want to fiddle around with 50 variations of a chart to try work it out? No, it’s not my job and quite possibly I don’t have the time or expertise (and nor should I need it) to do that, sayeth the VIP.

So are dashboards useless? Of course not. But they have to be implemented with the reality of the audience capability, interest and use-case in mind. Most dashboards (at least those that are not solely for analysts to explore) should start with

  • At least 1 clear pre-defined question to address; and
  • 1 clear pre-defined action that might realistically take place based on the answer to the question.

But I don’t want a computer running my business!

Shouldn’t you check that it would definitely be a bad idea before saying that? 🙂

But seriously, the above is not to say one has necessarily to commit blindly to taking the pre-defined action – not every organisation is ready for, or suited to, prescriptive analytics.

However, if there is no way at all that an answer that a dashboard provides could possibly lead to influencing an action, then is it really worth one’s time working on it, at least in a business context?

  • “Sales dashboard” is not a question or an action.
  • “Am I getting fewer sales this year than last year?” is a question.
  • “If I am getting fewer sales this year then I will spend more on marketing this year” is an action.
  • “What form of marketing gave the best ROI last year?” is a question.
  • “If I need to do more marketing this year then I’ll advertise using the method that gave the greatest ROI last year” is an action.

The list of questions doesn’t need to be exhaustive, in fact it usually can’t be. If someone can use a dashboard to answer 100 questions not even imagined at the time of creation, then great. Indeed this is one of the potential strengths of a well-designed dashboard – but there should be at least 1 question in mind before it is created.

Why does checking my dashboard bore me?

Note that in that example above, the listed actions actually imply that the dashboard user is only interested in the results shown on the dashboard under one particular condition: if the sales this year are lower than last year.

For 99 days in a row they might check the dashboard and see that the sales are higher this year, and hence do nothing. On the 100th day, perhaps there was a dramatic fall, so that day is the day when the appropriate advertising action is considered.

However, consider how many people will actually persist in checking the dashboard for 100 days in a row when 99% of the time the check results in no new action.

I myself am obviously very analytically inclined, am happy to and (like to think) efficient at interpreting data, and yet even I have automated rules in my Outlook email client to immediately delete unread almost every “daily report” that gets emailed to me automatically (ssssh, don’t tell anyone, that’s just between you and I). Even the simple act of double-clicking to open the attachment is too much effort in comparison with the expected value of seeing the report contents on an average day.

In this sort of circumstance, what might enable a dashboard to be truly useful is the concept of alerting.

A possible use case is as follows:

  1. A sales dashboard aimed at answering the question of whether we are getting fewer sales this year is set up.
  2. Every day, alerting software routinely checks this data, and emails the VIP (only) if it shows that yes, sales have fallen. The email also provides a direct web link to the targeted sales dashboard.
  3. When the VIP receives this email, knowing that there is something “interesting” to see, they may well be concerned enough to open the dashboard and, to the best of their ability, use whatever context is available there to decide on their next action.
  4. If the information they need isn’t there, or they don’t have the time / expertise / inclination to interpret it, then of course they will legitimately request some more work from their analyst. But at least here we see that “data” provided a trigger that has alerted a relevant decision maker that they need to…make a decision, and made it easy for them to use the dashboard tool at their disposal specifically on the day that they are likely to gain value from doing so.
  5. Everyone is happy (well, except about the poor sales).

There is an implicit “so what” in the scenario above.

Main findings

  • Sales are lower than last year.
  • Last year, TV adverts produced tremendous ROI.

So what?

  • To make sure the sales keep growing, consider buying some advertising.
  • To be safest, use a method that was proven effective last year, TV.

But aren’t there some occasions that a “so what” isn’t needed?

Yes, rules of thumb have exceptions. There are some scenarios in which one might legitimately consider not producing an explicit “so what”.

Here are a few I could think of quickly.

1: Exploratory analysis: maybe you just got access to a dataset but you don’t really know what it contains, what the typical patterns and distributions are or its scope. Building a few visualisations on top of that is a great way for an analyst to get a quick understanding of the potential of what they have, and, later, what sort of “so what?” questions could potentially be asked.

2: Data quality testing: in a similar vein to the above, you can often use histograms, line charts and so on to get a quick idea of whether your data is complete and correct. If your viz shows that all your current customers were born in the 19th century then something is probably wrong.

3: Getting inspiration: got too much time on your hands and can’t think of some other work to do? (!!!) You could pick a dataset, or set of datasets, and spend some time graphing their attributes and looking for interesting patterns, outliers, and so on that could form the basis of interesting questions.

  • Why does x correlate with y?
  • Why is x look like a Gaussian distribution whereas y looks like a gamma distribution?
  • Why does store X sell the most of product Y?

This doesn’t have to be done on an individual basis. An interactive dataviz might be a great basis for a group brainstorming discussion, whether within a group of analysts or a far wider audience of interested parties..

4: Learning technical skills: perhaps you are trialling new analysis software or techniques, or trying to improve your existing skills. Working with data you’re already familiar with in new tools is a great way to learn them; perhaps even recreating something you did elsewhere if it’s relevant. The aim here is to increase your skillset, not derive new insights.

5: “How to” guides for others to follow: whether formal training or blog posts (showing fancy extreme edge cases others can marvel at perhaps?), maybe your emphasis is not on what the data actually contains in a subject domain sense, but rather a demonstration of how to use a certain generic analytical feature or technique. Here the data is just a placeholder to provide a practical example for others to follow.

6: You’re an artist: perhaps you’re not actually trying to use data as a tool to generate insight, but rather to create art. This is no lesser a task than classic data analysis, but it’s a very different one, with very different priorities. Think for example of Nathalie Miebach, whose website’s tagline is:

“translating science data into sculpture, installations and musical scores.”


This might be fine art, but it does not try to lead to business insight.

7: You want to focus on promoting your work and become famous :-): a controversial one perhaps; but it is not always plain old bar charts that happen to show the greatest insights that get shared around the land of Twitter, Facebook and other such attention-requesting mediums.

If your goal is generically to get “coverage” – perhaps to increase advertising revenue based on CPM or to become more well-known for your work – and you feel that you have to choose between generating a true insight and making something that looks highly attractive, then the latter might actually be a better bet.

But one should acknowledge what they’re doing; perhaps the skills you demonstrate in doing this are closer to the afore-mentioned “data artist” than “data analyst”.

I have a sneaking suspicion for instance that – not to re-raise a never-ending debate! – David McCandless’ books are probably picked up in higher volumes than Stephen Few’s when both are presented together in a bookshop.

  • McCandless’ “Information is Beautiful“, a series of pretty, sometimes fascinating, infographics, many of which have little in the way of conclusions, is currently rank #1788 in Amazon UK books.
  • Stephen Few’s “Show me the numbers“, a more hardcore text on best practice in presenting information, at #7952, with a cover consisting of very unglamourous bar and line charts.

This is not to compare one to the other in terms of worthwhileness; they are aimed at totally different audiences whose desire to have a book in the “data visualisation” category is motivated by very different reasons.

Even amongst the specialist dataviz-analyst community Tableau has, I note that around half of the visualisations that Tableau picks as their public “Viz of the day” are variations on geographical maps.

Geo-maps tend to look “fancier” and more enticing than bar charts, even though they are applicable only to analysis of a very specific type of data, and can provide only certain types of insights. For most organisations, whilst there is often relevance in geospatial analysis, I suspect that “geo-maps” analytics forms far less than 50% of total analytical output.

It’s therefore very unlikely that the winning “Viz of the day” entries actually reflect how Tableau is actually used most of the time. Hence you might conclude that, if you want to be in the running for that particular honour, you should bias your work towards visualisation with the sort of attention-grabbing graphics that maps often provide, irrespective of whether another form might generate a similar or stronger “so what?” output.

8: Regulatory / reporting requirements: in some circumstances you might be bound by regulation or other authority to produce certain analytical reports, irrespective of whether you think they add value or provide insight. Think for instance of the fields of accounting, for publicly traded companies, healthcare companies, investment products and so on.

9:Your job is explicitly, literally, to “visualise data”. It’s possible to imagine, perhaps in a large business department, employing someone whose job is to repeatedly convert data, for instance, from text tables into a best-practice chart form, without going further. It would be another person’s job to derive the “so what?”.

You could think of this as a horizontal slice of the analytics pipeline vs the “beginning-to-end” vertical pipeline. After all, analysts often rely on other people with different skills (eg. IT) to do a preparatory phase of data analysis, the data provision/manipulation itself (including extract, transform and load operations). They could also rely on people to do the conclusion-forming stage.

Many companies do seem to have a de-facto version of this setup by employing people to “create reports”. By this, they may mean something akin to blindly getting up-to-date data into a certain agreed template or dashboard format that managers are supposed to use to derive decisions from.

However, unless your managers happen to be keen analysts, or your organisation is extraordinarily predictable, then I tend to be concerned about the efficiency and reliability of this method for anything other than, for example, the regulatory purposes mentioned above. It’s hard to imagine someone else consistently gaining optimal insights from a chart they had no control over designing, without a large amount of overhead-inducing iteration between chart-creator and insight-finder. Let’s face it – most non-quant managers would prefer a bullet-point summary of findings rather than a 10 tab Excel workbook full of charts if they’re honest.

There may be many more such scenarios; do let me know in the comments!

Hang on, isn’t there a “so what” in some of the above?

Did you notice the semantic trickery in the above “no so what” viz reasons? In fact, most do either have an implicit “so what” or are simply facilitating the later creation of one.

Items 1, 2 & 3 could be considered as part of the data preparation phase of the analytics pipeline. It would be unlikely (and unwanted) for the products of them to be the end of the analysis. Almost certainly, they’re a step 1 to a further analysis. An implicit “so what” here is either that the data is safe to proceed with, or it is not.

The output of these approaches can also be useful for establishing baselines for metrics, even if this isn’t the intended use at this point. For instance, if your exploration reveals the average customer purchases £5 of products, this may be useful down the line to compare next year’s sales to. Did your later interventions improve sales or not?

Items 4 & 5 come down to being technical training for either yourself or for others. Once trained, you’re likely to be off analysing “so what?” scenarios next. If we’re looking to contrive a “so what?” here, it might be “so I am ready to put my skills to good use tackling real questions”.

Item 6 is quite unique. The data visualisation itself may never be useful as a “so what?” to anyone. It was never intended to be. It’s for a totally different audience who would no more ask “so what?” of data-inspired art than they would of Da Vinci’s ‘Mona Lisa‘.

Item 7 again might be considered data-use for the sake of something other than intrinsic “analysis”. This type of work might well have an explicit “so what?”, that could even be part of its allure. But it’s not the primary reason why it the visualisation was created, so it might not. Sometimes it could be considered a variant of #6 with a specific goal.

It may also be itself a tool that generates useful data. If viewcount is what is important to the creator, then they may be tracking that pageviews on their own “so what”-enabled dashboard in order to determine what sort of output creates the most value for them..

Item 8 and 9 are mid-parts of the analytics pipeline. Although you may not be explicitly defining a “so what”, you’re enabling someone else to come up with their own later.

For better or worse, mandatory reporting regulations are there for someone’s perceived reason. A chart of fund performance is supposed to be there to help inform potential clients whether they would like to invest or not, not simply to provide a nice curved shape.

And if your job is to create “standard” reports or charts, then almost certainly someone else is completing the later step of interpreting them to form their own “so what?”. Or at least they are supposed to be.

To conclude: (why) are we valuable?

Fiddling around with data may be somewhere between Big Bang level geekery and the sexiest job of the 21st century, and holds a personal fascination for some of us. But if we want someone to employ us to do it, or to add value in some other way to the world, we should remember why data as a vocation exists. For the average data analyst, it’s not to make a series of lines that look pleasant (although it’s always nice when that happens).

To quote the viz-lord Stephen Few, in his book “Now You See It“:

We analyse information not for its own sake but so we can make informed decisions. Good decisions are based on understanding. Analysis is performed to arrive at clear, accurate, relevant, and thorough understanding.

(my emphasis added)

Outside of that book, he uses the term “data sensemaking” frequently, which is a good description of what organisations tend to want from their data analysts, even if they don’t know to phrase it in that manner. It must be stressed again, many “busy execs” are far happier with a few bullet points or alerts on potential issues than a set of even the most beautiful, most best-practice, visualisations.

When one exists within the analyst community, it can be hard to remember that not everyone enjoys “data”. Even many of those who are intrigued may not yet have had the time, privilege or education that leads towards quick, accurate interpretation of data. It can be frustrating, or even impossible, for a non-quant specialist to try and understand the real-world implication of an abstract representation of some measure: they simply don’t want to, or can’t – and shouldn’t need to – hunt for their own takeaways in many cases.

When a crime is committed, we hope a professional detective will put together the clues and provide the real world interpretation that allows us to successfully confront the criminal in court. When non-trivial data appears, we should hope that a professional analyst is on hand to put together the data-clues and provides a real world interpretation that lets us successfully confront whatever issue is at hand.

Bonus addendum: some “so whats” are worse than no “so whats”

Before we go, there is perhaps one extra risk of “so whatting” a viz that should be considered. Producing a conclusion that could lead to action tends to necessitate taking a position; essentially you move from presenting a picture to arguing for the implications of what it shows.

Much data can provide multiple distinct answers to the same question if it is manipulated enough. There are indeed lies, damned lies and statistics, and dataviz inspired variations of all three.

If the analyst approaches the “so what?” aspect with bias, then human psychology is such that they may be inclined to provide an awesome conclusion that just coincidentally happens to match their pre-analysis viewpoint; c.f. “confirmation bias“. Of course, many organisations effectively employ people or subcontract out work to for this exact reason, but that is generally not an ethically fantastic, or professionally fulfilling, position (and whole other organisations exist to debunk such guff).

It’s pretty much impossible to provide even a basic chart without the risk of bias. Data analysis is surely part art, mixed amongst the maths and science – one can of course debate the precise split. But a data vizzer has inherently made some explicit choices: what source of data to use, how much data, which type of visualisation, which comparisons to make, the format of chart and much more – all of which can induce, consciously or not, bias to the audience.

Many best-practice “rules” of dataviz, and analytics in general, are in fact designed to reduce the risk of this. This is a very key reason as to why it’s worth learning them. Outside of those memorisable basics though, it’s often interesting to try and test the opposing view to what you’re presenting as your “so what?”.

Perhaps this year you have a higher proportion of female customers than last year. So what? “Our 10 year strategy to redesign our product to be especially attractive to women has been successful, we deserve a bonus”? Well, perhaps, but what if:

  • Last year had a weirdly low proportion of female purchasers vs normal and you’re just seeing basic regression to the mean?
  • Or, for the past 9 years the proportion of women buying your product has plummeted 10% every year; only to increase 2% in the latest year. Does that make your 10 year strategy a success?
  • Or this year was the first year you advertised in Cosmo, instead of FHM. Have other changes produced a variable that confounds your results?
  • Or men have stopped buying your product whilst women continue to buy it at the exactly same rate…does that count as success?

The right data displayed in the right way can help you eliminate or confirm these and other possibilities.

For any decision where the benefit likely outweighs the cost, it’s worth doing the exercise of disproving your first intuition in order to provide comfort that you are supporting the best quality of decision making; not to mention reducing the risk that some joker with half a spreadsheet invalidates your finely crafted interpretation of your charts.