How to be happy: the data driven answer (part 1)

A fundamental goal for many people, explicit or otherwise, is to be maximally happy. Easily said, not always so easily done. So how might we set about raising our level of happiness? OK, at some level, we’re all individuals with our own set of wishes and desires. But, at a more macro level, there are underlying patterns in many human attributes that lead me to believe that learning what typically makes other people happy might be useful with regards to understanding what may help maintain or improve our own happiness.

With that in mind, let’s see if we can pursue a data driven answer to this question: what should our priorities or behaviours likely be if we want to optimise for happiness?

For a data driven answer, we’re going to need some data. I settled on the freely available   HappyDB corpus (thanks, researchers!).

The corpus contains around 100,000 free text answers, crowdsourced from members of the public who had signed up to Mechanical Turk,  to this question:

What made you happy today? Reflect on the past {24 hours|3 months}, and recall three actual events that happened to you that made you happy. Write down your happy moment in a complete sentence.

Write three such moments.

Examples of happy moments we are NOT looking for (e.g events in distant past, partial sentence):
– The day I married my spouse
– My Dog

It should be noted that the average Mechanical Turk user is not representative of the average person in the world and hence any findings may have limited generalisability. However, the file does contain some demographics which we consider digging into later in case there’s any correspondence between one’s personal characteristics and what makes them happy. Perhaps most notably, around 86% of respondents were from the USA, so there will clearly be a geographic / cultural bias at play. However, being from the UK, a similarly “westernised country”, this may be less of a problem for me if I personally wish to act on the results. A more varied geographical or cultural comparison on drivers of happiness would however be a fascinating exercise.

I’m most interested in eliciting the potential drivers of happiness in a manner conducive to informing my mid-to-longer term goals. With that in mind, here I’m only looking at the longer term variant of the question – what made people happy in the past 3 months? A followup could certainly be whether or not the same type of things that people report having made them happy over a 3 month timeframe correspond to the things that people report making them happy on a day-to-day basis.

So, having downloaded the dataset , the first step is to read it into our analysis software. Here I will be using R. The researchers supply several few data files, which are described on their Github page. Here I decided to use the file where they’d generously cleaned up a few of the less valid looking entries, for instance if they were blank, had only a single word, or seemed to be misspelled, called “cleaned_hm.csv“.

Whilst some metadata is also included in that file, for the first part of this exercise I am only interested in these columns:

  • hmid: the unique ID number for the “happy moment”
  • wmid: an ID number that tells us which “worker” i.e. respondent answered the question. One person may have responded to the question several times. It’s also a way to link up the demographic attributes of the respondent in future if we want to.
  • cleaned_hm: the actual (cleaned up) text of the response the user gave to the happiness question

We’ll also use the “reflection_period” field to filter down to only the responses to the 3 month version of the question. This field contains “3m” in the row for these 3-month responses.

library(readr)
library(dplyr)

happy_db <- read_csv(".\\happy_db_data\\rit-public-HappyDB-b9e529e\\happydb\\data\\cleaned_hm.csv")

happy_data <- filter(happy_db, reflection_period == "3m") %>%
select(hmid, wid, cleaned_hm)

Let’s take a look at a few rows of the data to ensure everything’s looking OK.

head

OK, that looks good, and perhaps already gives us a little preview of what sort of things  make people happy,

Checking the recordcount – nrow(happy_data) – shows we have 50,704  happy moments to look at after the filters have been applied. Reading them all one by one isn’t going to be much fun, so let’s load up some text analysis tools!

First, I wanted to tokenise the text. That is to say, extract each individual word from the each response to the happiness question and put it onto its own row. This makes certain types of data cleaning and analysis way easier, especially if, like me,  you’re a tidyverse advocate. I may want to be able to reconstruct the tokenised sentences in future, so will keep track of which word comes from which happy moment, by logging the relevant happy moment ID, “hmid”, on each row.

Past experience (and common sense) tells me that some types of words may be more interesting to analyse than others. I won’t learn a lot from knowing how many respondents used the word “and”.

One approach could be to classify words as to their “part of speech” type – adjectives, verbs, nouns and so on. My intuition is that people’s happiness might be influenced by encountering certain objects (in the widest possible sense – awkwardly incorporating people into that classification) or engaging in specific activities. This feels like a potential close fit to the nouns and verb speech parts. So let’s begin by extracting each word and classifying it as to whether it’s a noun, verb or something else.

One R option for this is the cleanNLP library. cleanNLP includes functions to tokenise and “annotate” each word with respect to what part of speech it is. It can actually use a variety of natural language processing backends to do this, some more fancy than others.  Some require installing non-R software like Python or Java, which, to keep it simple here, I’m going to avoid. So I’m going to use the udpipe library, which is pure R, and hence can be installed with the standard install.packages(“udpipe”) command if necessary.

Once cleanNLP is installed, we need to ask it to split up and tag our text as to whether each word is a noun, verb and so on. First we initialise the backend we want to use, udpipe. Then we use the cnlp_annotate command to perform the tokenisation and annotation, passing it:

  • the name of the dataframe containing the question responses: happy_data.
  • the name of the field which contains the actual text of the responses, as text_var.
  • the name of the field that identifies each individual unique response as doc_var.

This process can take a long time to complete, so don’t start it if you’re in a hurry.

Finally, we’ll take the resulting “annotation” object and extract the table of annotated words (aka tokens) from it, to make it easy to analyse with standard tidy tools.

install.packages("cleanNLP")
library(cleanNLP)

# initialise udpipe annotation engine

cnlp_init_udpipe()

# annotate text

happy_data_annotated <- cnlp_annotate(happy_data, as_strings = TRUE, 
                                   text_var = "cleaned_hm", doc_var = "hmid")

# extract tokens into a data frame

happy_terms <- happy_data_annotated %>% 
  cnlp_get_token()

OK, let’s see what we have, comparing this output to the original response data for the first entry.

before

After:

after

Lovely. In our annotated token data, we can see the id field matches the hmid field of the response file, allowing us to keep track of which words came from which response.

Each word of the response is now a row, in the field “word”. Furthermore, the word has also been converted to its lemma in the lemma field – a lemma being the base “dictionary” version of the word – for instance the words “studying” and “studies” both have the lemma “study“. This lemmatisation seems useful if I’m interested in a general overview of the subjects people mention in response to the happiness question, insomuch as it may group words together that have the same basic meaning.

Of course it can’t always be perfect – the above example in fact has a type of error, in that the word “nowing” has been lemmatised as “now”, whereas one has to assume the respondent actually meant “knowing”, with a lemma more like “know”. However, this is user error – they misspelt their response. I don’t fancy going through each response to fix up these types of errors, so for the purposes of this quick analysis I’ll just assume them to be relatively rare.

We can also see the in the field “upos” that the words have been classified into their grammatical function – including the verbs and nouns we’re looking for.  upos stands for “universal part of speech”, and you can see what the abbreviations mean here.

The “pos” field divides these categories up further, and can be deciphered as being Penn part of speech tags. In the above example, you can see some verbs are categorised as VBG and others as VB. This is distinguishing between the base form of a verb and its present participle. I’m not going to concern myself with these differences, so will stick to the upos field.

Now we have this more usable format, there’s still a little more data cleansing I want to do.

  • One part of speech (hereafter “POS”) category is called PUNCT, meaning punctuation. I don’t really want to include punctuation marks in my analysis, so will remove any words that are classified as PUNCT.
  • Same goes for the category SYM, which are symbols.
  • I also want to remove stopwords. These are words like “the”, “a” and other very common words that are typically not too informative when it comes to analysing sentences at a per-word level. Here I used the snowball list to define which words are in that category. This list is included in the excellent tidytext package,  the usage of which is documented in its companion book.
  • I’m also going to remove words that directly correspond to “happy”. Given the question itself is “what makes you happy?” I feel safe to assume that all the responses should relate to happiness. Knowing people used the word “happy” a lot won’t add much to my understanding.

Here’s how I did it:

happy_terms <- filter(happy_terms, upos != "PUNCT" & upos != "SYM") %>%
  anti_join(filter(stop_words, lexicon == "snowball"), by = c("lemma" = "word")) %>%
  filter(!(lemma %in% c("happy", "happiness", "happier")))

OK, now we have a nice cleanish list of the dictionary words from the responses. Let’s take a look at what they are! We’ll start with the most obvious type of analysis, a frequency count. What are the most common words people used in their responses?

Let’s start with “things”, i.e. nouns. Here are the top 20 nouns used in the answers to the happiness question.

library(ggplot2)

filter(happy_terms, upos == "NOUN") %>%
  count(lemma, sort = TRUE) %>%
  arrange(desc(n)) %>%
  slice(1:20) %>%
  ggplot(aes(x = reorder(lemma, n, mean), y = n)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(title = "Most common nouns in responses to happiness question", x = "", y = "Count")
common_nouns

OK, there’s already some strong themes emerging there! Many nouns relate to people (friend, family, etc.), there’s a mention of jobs, and some more “occasion” based terms like birthday, event etc.

There’s also a lot of time frame indicators. Whilst it may eventually be interesting that there are more mentions of day than month, and more month mentions than year, for a simple first pass I’m not sure they add a lot. Let’s exclude them, and plot a larger sample of nouns. This time we’ll use a word cloud, where the size of the words are relative to the frequency of usage. Larger words are those that are used more often in the responses. For this, we can use the appropriately named library “wordcloud“.

It should be noted that, whilst wordclouds are more visually appealing than bar charts to many people, they are certainly harder to interpret in a precise way. However, here I’m more looking for the major themes, so I’ll live with that and go with the prettier option.

install.packages("wordcloud")
library(wordcloud)

# create a count of nouns, excluding some time based ones

happy_token_frequency_nlp_noun <- 
  filter(happy_terms, upos == "NOUN" & !(lemma %in% c("day", "time", "month", "week", "year", "today"))) %>%
  count(lemma, sort = TRUE)

# open a png graphics device so wordcloud gets saved to disk as a png

png("wordcloud_packages.png", width=12,height=8, units='in', res=300) 

# create the wordcloud

wordcloud(words = happy_token_frequency_nlp_noun$lemma, freq = happy_token_frequency_nlp_noun$n, max.words = 150, colors = c("grey60", "darkgoldenrod1", "tomato"))

# close the png device

dev.off() 

A wordcloud of happy nouns:

nouncloud

A cornucopia of happiness-inducing things! In this format, some commonalities truly stand out. It seems that friends make people particularly happy. Families are key too, although the number of times “family” is directly mentioned is lower than “friend”. That said, parts of family-esque structures such as son, daughter, wife, husband, girlfriend and boyfriend are also specifically mentioned with relatively high frequency. Basically, people are made happy by other people. Or even by our furry relatives – dog and cat are both in there.

Next most common, perhaps less intuitively, are words around what is probably employment – job and work.

There’s plenty of potentially “event” type words in there – event itself, but also birthday, date, dinner, game, movie et al. Perhaps these show what type of occasions are most associated with happy times. I can’t help but note that many of them sound again like opportunities for socialising.

There’s a few actual “things” in the more conventional sense too; insentient objects that you could own.  They’re less prevalent in terms of mentions of any individual one, but car, computer, bike and phone some relevant examples that appear.

And then some perhaps less controllable phenomena – weather, season and surprise.

It must be noted these interpretations have to rely on assumptions based on preexisting knowledge of how people usually construct sentences and the norms for answering questions. We’re looking at single words in isolation. There’s no context.

If people are actually writing “Everything except my friend made me happy”  then the word “friend” would still appear in the wordcloud. Intuitively though, this seems unlikely to be the main driver of it featuring so strongly. We may however dig deeper into context later.  For now, we can also increase our confidence in the most straightforward interpretation by noting that there’s a fair bit of external research out there that suggests a positive connection between friendship and happiness, and even health, that would support these results. Happify produced a nice infographic summarising the results of a few such studies.

Next up, let’s repeat the exercise using verbs, or as I vaguely recall learning at primary school, “doing words”. What kind of actions do people take that results in them having a happy memory?

The code is very similar to the nouns example. Firstly, let’s filter and plot the most common verbs we found in the responses.

filter(happy_terms, upos == "VERB") %>%
  count(lemma, sort = TRUE) %>%
  arrange(desc(n)) %>%
  slice(1:20) %>%
  ggplot(aes(x = reorder(lemma, n, mean), y = n)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(title = "Most common verbs in responses to happiness question", x = "", y = "Count")
common_verbs

The top 3 there – get, go and make – seem be particularly prevalent compared to the rest. Getting things, going to places and the satisfaction of creating things are all things we might intuit are pleasing to the average human. However they’re somewhat vague, which makes me curious as to how exactly they’re being used in sentences. Are there particular things that show up as being got, gone to or made that generate happy memories?

To understand that, we’re going to have look at a bit of, at least simplistic, context. We’ll start by looking at the most common phrases those words appear in. To do this, we’ll use the tidytext library to generate “bigrams” – that is to say, each 2-word combination within the response sentences that those specific words are used in.

For example, if there’s a sentence ‘I love to make cakes’, then the bigrams involving ‘make’ here are:

  • to make
  • make cakes

If we calculate and count every such bigram, then, if enough people enjoy making cakes, that might pop out in our data.

First up though, remember that our analysis so far is actually showing the lemmas of the words used by the respondents, rather than the exact words themselves. So we’ll need to determine which words in our dataset are represented by the lemmas “get”, “go” and “make” in order to comprehensively find the appropriate bigrams in the text.

We can do that by looking for all the distinct combinations of lemma and word in our tokenised dataset. Here’s one way to see which unique real-world words in our dataset are represented by the lemma “get”:

filter(happy_terms, lemma == "get") %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%
  distinct()
getlemma

This means that we should look in the raw response text for any bigrams involving got, get, getting, gets or gotten if we want to see what the commonest contexts for the “get” responses are. To automate this a little, let’s create a vector for each of the 3 highly represented lemmas – get, go, and make – that include all the words that are grouped into that lemma.

get_words <- filter(happy_terms, lemma == "get") %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%
  distinct() %>%
  pull(word)

go_words <- filter(happy_terms, lemma == "go") %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%
    distinct() %>%
  pull(word)

make_words <- filter(happy_terms, lemma == "make") %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%
    distinct() %>%
  pull(word)

The next step is to generate the bigrams themselves. Here I’m also removing the common, generally tedious, stopwords as we did before. This simplistic approach to stopwords does have some potentially problematic side-effects – including that if an interesting single word is surrounded entirely by stopwords then it will be excluded from this analysis. However, in terms of detecting a few interesting themes as opposed to creating a detailed linguistic analysis, I’m OK with that for now.

# create the bigrams

happy_bigrams <-  unnest_tokens(happy_data, bigram, cleaned_hm, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!(word1 %in% filter(stop_words, lexicon == "snowball")$word) & !(word2 %in% filter(stop_words, lexicon == "snowball")$word))

# create a count of how many time each bigram was found - sorting from most to least frequent

bigram_counts <- happy_bigrams %>% 
  count(word1, word2, sort = TRUE)             
             
# show the top 20 bigrams that use the 'get' lemma
             
filter(bigram_counts, word1 %in% get_words | word2 %in% get_words) %>%
    head(20) %>%
    ggplot(aes(x = reorder(paste(word1, word2), n, mean), y = n)) +
    geom_col() +
    coord_flip() +
    theme_bw() +
    labs(title = "Most common bigrams involving the 'get' lemma", x = "", y = "Count")

What do people like to get?

get_common

“finally got” may not tell us much about what was gotten, but evidently achieving something one has waited for for some time is particularly pleasing in this context. Other obvious themes there are mentions again of other people – son got, wife got, friend got, get together etc. Likewise there’s a few life events there – getting married, and work stuff like getting promoted or hired.

But my curiosity remains unabated. What sort of things are people happy that they “finally got”? To get some idea, I decided to split the responses up into five-word n-grams –  think of these as being like bigrams, but looking at groups of 5 consecutive words rather than 2.

Groups of 5 words here will tend to produce low counts as people don’t necessarily express the same idea using precisely the same words – we may revisit this limitation in future! Nonetheless, it may give us a clue as to what some of the bigger topics are. We can use the unnest_tokens command from the tidytext library again. This is exactly what we did above to generate the bigrams, but this time specifying an n of 5, to specific that we want sequences of 5 words.

# make the 5-grams, keeping only those that were found at least twice
happy_fivegrams <-  unnest_tokens(happy_data, phrase, cleaned_hm, token = "ngrams", n = 5) %>%
  separate(phrase, c("word1", "word2", "word3", "word4", "word5"), sep = " ") %>%
  group_by(word1, word2, word3, word4, word5) %>%
  filter(n() >= 2 ) %>%
  ungroup()

# count how many times each 5-gram was used
fivegram_counts <- happy_fivegrams %>% 
  count(word1, word2, word3, word4, word5, sort = TRUE)   
  
 # show the top 15 that started with "finally got" 
filter(fivegram_counts, word1 == "finally" & word2 == "got" ) %>%
   head(15)
finallygot

Only a couple of people used most of the precise same phrases, but the most obvious feature that stands out here related to employment. People “finally” getting new jobs, or progressing towards the jobs they want, are occasions that make them happy.

Back now to drilling down into the common verb lemma bigrams: where are people going when they make happy memories?

# show the top 20 bigrams that use the 'go' lemma   
	  
filter(bigram_counts, word1 %in% go_words | word2 %in% go_words) %>%
    head(20) %>%
    ggplot(aes(x = reorder(paste(word1, word2), n, mean), y = n)) +
    geom_col() +
    coord_flip() +
    theme_bw() +
    labs(title = "Most common bigrams involving 'go' lemma", x = "", y = "Count")  
go_common

Here, going shopping wins the frequency competition – although it’s certainly not usually an experience I personally enjoy! There’s a few references to family and friends again, and a predilection for going home. A couple of specific outdoor leisure activities – hiking and fishing – seem to suit some folk well. A less expected top 20 entry there was Pokemon Go, the “gotta-catch-em-all” augmented reality game, which, sure enough, does  have the word “go” in its title. The human urge to collect apparently persists in the virtual world ūüôā

Finally, what are people making that pleases them?

# show the top 20 bigrams that use the 'make' lemma 

filter(bigram_counts, word1 %in% make_words | word2 %in% make_words) %>%
    head(20) %>%
    ggplot(aes(x = reorder(paste(word1, word2), n, mean), y = n)) +
    geom_col() +
    coord_flip() +
    theme_bw() +
    labs(title = "Most common bigrams involving 'make' lemma", x = "", y = "Count")
make_common

It’s not all about making in the sense of arts and crafts here. Making plans appeals to some – perhaps creating something that they can then look forward to occurring in the future. References to “moments” are pretty vague but perhaps describe happiness coming from specific events that could later be drilled down into.

There’s the ever-present social side of wives, sons, girlfriends et al featuring in the commonest bigrams. The hedonistic pleasures of money and sex also creep into the top 20.

OK, now we’ve dug a little into the 3 verb lemmas that dominate the most common verbs used, let’s take a look at the next most frequently used selection of verbs. Here again, we’ll use the (controversial) wordcloud, and hope it allows us to elucidate at least some common themes.

Removing “get”, “make” and “go” from the tokenised verb list, before wordclouding the most common of the remaining verb lemmas:

# filter the terms to show only verbs that are not get, make or go.

happy_token_frequency_nlp_verb_filtered <- filter(happy_terms, upos == "VERB" & !lemma %in% c("get", "make", "go")) %>%
count(lemma, sort = TRUE)

# open a png graphics device so wordcloud gets saved to disk as a png

png("wordcloud_verb_filtered.png", width=12,height=8, units='in', res=300) 

# create the wordcloud

wordcloud(words = happy_token_frequency_nlp_verb_filtered$lemma, freq = happy_token_frequency_nlp_verb_filtered$n, max.words = 150, colors = c("grey60", "darkgoldenrod1", "tomato"))

# close the png device

dev.off() 

A wordcloud of happy verbs:

verbcloudfilterd

The highest volumes here are seen in experiential words – see and feel, vague as they might seem in this uni-word analysis. Perhaps though they hint towards the conclusions of  some external research that suggests, at least after a certain point of privilege, people tend to gain more happiness from spending their money on experiences as opposed to objects.

Beyond that, some hints towards – suprise, surprise – social interactions appear. Give, receive, take, visit, love, meet, tell, say, play, talk, share, participate, listen, speak, invite and call, amongst other words, may all potentially fall into that category.

There’s “spend”, a further analysis of which could differentiate between whether we’re typically talking about spending time, spending money, or both. A hint towards the former existing at least to some extent is given by “buy” being a relatively common verb. “Work” also features, along with some almost-by-definition happiness inducers such as “win” and “celebrate”.

In terms of specific activities, we see hints at some intellectual pursuits: read, book, learn, know, graduate – together with some potentially fitness related activities: walk, run and move. Per the famous saying, eating and drinking also makes some people merry.

So, in conclusion:

That’s the end of our first step here, essentially variations on a frequency analysis of the words used by respondents when they’re asked to recall what made them happy in the past 3 months. Did it produce any insights?

My main takeaway here is really the key criticality of the social. Happiness seekers should usually prioritise people, where there’s an option to do so.

Yes, shopping, cars and phones and a few other inanimate objects of potential desire popped up. We also saw a few recreational activities that may or may not be social. Here I’m thinking of walking, running, fishing and hiking. It’s perhaps also notable that those examples happen to be potentially physically active activities that often take place outside. There’s external research extolling the well-being benefits of both being outdoors and physical exercise.

Words associated with education, learning and employment also appeared with relatively high frequency, so a focus on optimising those aspects of life might also be worthwhile for improving satisfaction.

But references to friends, family and occasions that may involve other people dominated the discourse. Thus, when making life plans, it’s probably wise to consider the social – meaningful interaction with real live human beings – as a particularly high priority.

In our noun analysis above, the word “friend” stuck out. Whilst as life goes on, and often gets busier, it can sometimes be hard to find time to focus on these wider relationships, there’s external evidence that¬†placing high a value on friendship is associated with improved well-being – potentially even moreso than the variation in valuing family relationships.

These effects seem to associate with even the most dramatic measures of well-being, with several studies suggesting that people who maintain strong social relationships of any kind tend to be happier, healthier and even live longer.

Advertisements

Using R to run many hypothesis tests (or other functions) on subsets of your data in one go

It’s easy to run a basic hypothesis test in R, once you know how. For example, if you’ve a nice set of data that you know meets the relevant assumptions, then you can run a t test in the following sort of way . Here we’ll assume that you’re interested in comparing the differences in the number of apples eaten between people, based on which group they’re in, from an imaginary “foods eaten” experiment dataset.

t.test(apples_eaten ~ group, data = foods_eaten)

The above uses the default values for hypothesis type, expected difference, confidence level et al. but these are all configurable by various parameters should you need to do so.

The default output looks something like this:

Welch Two Sample t-test

data: apples by group
t = 0.033192, df = 76.123, p-value = 0.9736
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.950128 3.050128
sample estimates:
mean in group a mean in group b 
9.05 9.00

Here we can see that the difference in means does not seem significant by pretty much any definition; the most obvious indicator being that the p-value is way above the traditional – if increasingly¬†controversial –¬† “significance” cutoff of 0.05.

This output nice enough for interactive use if you’re analyzing a single variable from an experiment or two. But, let’s say your dataset actually includes the results of several experiments, several outcome variables, or some other scenario where you want to conduct multiple hypothesis tests in one go. Do you really want to run the above command many times and manually collate the results into a nice table to let you see a quick overview of results? Well, of course not! R, being your servant rather than your master, should run them all for you and output the results into a nice clean table, right?

First up though, you should probably consider why you’re doing this. Running multiple unadjusted t-tests for different groups within the same data is often a very bad idea. Per usual, xkcd provides an awesome explanation as to why.

significant

If you’re looking at the differences between multiple subgroups in one experiment, you should often prefer, for example, an ANOVA test to several t-tests. This helps to avoid inflating the chances of a type 1 error, and falsely concluding you found an effect when in reality there isn’t one. I have also seen some instances of people running (unadjusted) multiple t-tests after an ANOVA has reported an effect in order to try and determine between which of the groups in the ANOVA the difference is significant in –¬† but this holds the same sort of danger. In that case, there’s plenty of standard post-hoc tests it would be more legitimate to use instead – Tukey’s HSD test for instance is similar to a t test, but contains adjustments to reduce the error rate.

Anyway, enough with the theoretical warnings. Perhaps you do have a valid use case. Or perhaps you realise that the below technique really allows you to run many different kind of functions that aren’t t tests on a dataset. These may be other types of hypothesis tests, or functions entirely unrelated to such testing. If this is the case, you can think of this as a template.

Let’s think through the mechanics of what we want to do. Let’s take a contrived (and fake) scenario in which we conducted 2 experiments, each involving 2 sets of people, a control and a test group. We measured how many of various different types of fruits and vegetable they ate. We’re interested in whether each group within an individual experiment differed in terms of the mean average of each type of fruit eaten.

Here’s a few rows of the sample data this non-existent experiment generated.

experiment_number group person_number apples bananas carrots dandelions eggplant
Experiment 1 a 1 1 3 3 3 1
Experiment 1 a 3 1 1 3 3 5
…
Experiment 1 b 103 2 5 3 5 2
Experiment 1 b 104 8 1 3 1 2
…
Experiment 2 c 7 19 5 8 0 3
Experiment 2 c 12 4 5 6 1 0
…
Experiment 2 d 111 16 3 6 0 5
Experiment 2 d 113 12 3 4 4 0

Our goal here, for the sake of argument, is to perform a t test for each experiment (identified by the “experiment number” column), comparing the 2 relevant groups of participants (the “group” column) in terms of how many apples they ate. Then, separately, on how many bananas they ate, and so on through the other edibles. Then we want to move on to experiment 2 and repeat the same exercise, until we have a hypothesis test result for each experiment on each of the foodstuffs. Yes, I know, if this was a real analysis, we’d need to revisit the above xkcd cartoon and think hard about whether this is the appropriate design for analysing the experiment.

But, that aside, this means we need to:

  1. Select only the group of data that has the first experiment_number, i.e. anything with experiment_number = Experiment 1 here.
  2. Select the first outcome variable of interest (apples)
  3. Run a t test on it, comparing participant groups against each other (here group a vs group b).
  4. Record the results of that t test (let’s say the t statistic, the p value, and the estimated difference in means) as a row in a results table
  5. Go back to step 2, but moving to the next outcome variable, here examining bananas instead of apples, until we’ve tested all the outcome variables.
  6. Go back to step 1, moving to the next experiment number, i.e. Experiment 2, and repeating the whole process again.

In this dataset, 2 experiments with 5 outcomes, that’s 10 t-tests. I don’t want to look at 10 of the default outputs for R’s t.test function, thank you. I’d prefer a result that looks something like this…

experiment_number food_type estimated mean of first group estimated mean of second  group t statistic p value
Experiment 1 apples 9.05 9 0.03 0.97
Experiment 1 bananas 2.5 2.875 -1.02 0.31
…
Experiment 2 dandelions 2.16 2.84 -1.76 0.08
Experiment 2 eggplant 3.2 3.66 -0.86 0.39

…where we’ve a table, or “dataframe” as R calls it, showing 1 row per t test with the most relevant statistics.

OK, let’s go, building it up step by step. Being R, of course there’s a multitudinous variety of ways we could do this. Being a tidyverse fan, I’m going with what could be described as a¬†dplyr approach. We’re going to need to install and load the following libraries to use this method.

  • tidyr: to get the data into a “tidy” format.
  • dplyr: to manipulate the data into the groups you wish to hypothesis test on.
  • broom: to tidy up the results of the default R t.test output into a nice tabular format

So, load up these libraries, remembering to install the packages first if you’ve not already done so, as below.

install.packages("tidyr") # only needed if you didn't ever install tidyr
install.packages("dplyr") # only needed if you didn't ever install dplyr
install.packages("broom") # only needed if you didn't ever install broom

library(tidyr) # to get data into "tidy" format
library(dplyr) # to manipulate data
library(broom) # to tidy up the output of the default t test function.

Step 2 is to transform the data table of experiment results from a “wide” format to a “long” format.

Many tidyverse functions, and much of R in general, works better with long data. We know our lowest level of grouping – i.e. the granularity we want to conduct our tests at – is at a per experiment per outcome variable level. For example, we’ll need to group all of the “apples” data for “experiment 1” together so we can analyse it with an individual test.

If we’re going to group then the experiment number and the outcome variable type should probably be individual columns. The experiment_number already is. But the outcome variables aren’t – they each have their own column. We thus want to get from this situation: 1 row per person who participated in the experiment…

experiment_number person_number group apples bananas carrots
1 1 a 1 2 3
2 7 c 19 24 6

…to this format – 1 row per outcome variable per person. Note how the experiment number, group, and person number are repeated down the rows, in order to keep track of whose apples outcome the row concerns.

experiment_number group person_number food_type count_eaten
Experiment 1 a 1 apples 1
Experiment 1 a 1 bananas 2
Experiment 1 a 1 carrots 3
…
Experiment 2 c 7 apples 19
Experiment 2 c 7 bananas 24
Experiment 2 c 7 bananas 6

We’re using a “food_type” column to record which outcome variable we’re talking about (apples vs bananas), and the “person_number” column to ensure we keep track of which entries belong to the same person:

This “unpivoting” is a perfect use-case for dplyr’s gather() function. Here we go:

results <- gather(data, key = "food_type", value = "count_eaten", -experiment_number, -person_number, -group)

The "key" parameter of the command lets us set the name of the column that will store the name of the outcome variable we're looking at. The "value" parameter lets us set the name of the column that will store the actual result obtained by the participant for the specified outcome variable.

Where we see "-experiment_number, -person_number, -group" at the end of the function call, that's to tell it that – because experiment number, person number and group are not outcome variables – we don’t want to collapse their contents into the same column as the count of apples, bananas etc. We need these variables to be replicated down the list to know which experiment, person and group the respective outcome variable was obtained from.

Note the minuses in front of those variable names. This means “don’t unpivot these columns’. If you only had¬† a few columns to unpivot, you could instead write the names of the columns you do want to unpivot there, without the minuses.

Let’s check what “results” now looks like, by typing head(results).

results1

OK, that looks very much like what we were aiming for. Onto the next step!

Now we need to ensure our analysis is only carried out using the data from a single experiment for a single outcome. Our outcomes here are the food type counts. For example, we want to consider the data from Experiment 1’s apples outcome separately from both Experiment 1’s bananas outcome, and Experiment 2’s apples outcome. In other words, we want to group the data by which experiment it’s from and which food type it’s measuring.

Handily, dplyr has a group_by function. Pass in the data you’re working with, and the names of the columns you want to group it by.

results <- group_by(results, experiment_number, food_type)

Checking the  head of the results now looks pretty similar to the previous step, only this time the output tells you, quite faintly, near the top, that we've grouped by the appropriate variables.

results2

Now, it’s time to do the actual t tests!

First of all, remember that we don’t want a bunch of those verbose default t.test outputs to work with. We want a nice table summarising the most important parts of the test results. We thus need to make the key parts of t test output “tidy“, so it can be put into a table, aka dataframe.

How to tidy the results of a hypothesis test? Use broom. Horray for R package names that attempt some kind of pun.

Let’s first take a moment to see how tidying works with a single t test, run in the conventional way.

Here’s the output of the default t.test function looking at the apples outcome of experiment 1. First I constructed a dataframe “exp1_apples_data”, which contains only data about apples, from experiment 1. Then I ran the standard t.test – being explicit about many of the options available, although if you like the defaults you don’t really need to be.


exp1_apples_data %>%
  select(group, person_number, apples)

t_test_results <- t.test(apples ~ group,
       data = exp1_apples_data,
        alternative = "two.sided",
        mu = 0,
        paired = FALSE,
        var.equal = FALSE,
        conf.level = 0.95)

t_test_results

This outputs:

ttest.PNG

Now, if we apply the tidy() function from the broom package to the results of the t test, what do we get?

tidy(t_test_results)

Outputs:

tidyt

Yes, a nice tidy 1-row summary for the results of the t test. Comparing it to the default output above, you can see how:

  • field “estimate1” corresponds to the mean in group a
  • “estimate2” corresponds to the mean in group b (and the “estimate” field corresponds to the difference between the mean in group a and b)
  • “statistic” corresponds to the t statistic
  • “p.value” corresponds to the p-value
  • “parameter” corresponds to the degrees of freedom
  • “conf.low” and “conf.high” correspond to the confidence interval
  • “method” and “alternative” record the type & nature of the test options selected

So now we have a set of grouped data, and a way of producing t test results in in a tabular format. Let’s combine!

For this, we can use the dplyr function “do“. This applies what the documentation refers to as an “arbitrary computation” (i.e. most any function) to a set of data, treating each of the groups we already set up separately.

The first parameter is the name of the (grouped) dataset to use. The second is the function you want to perform on it.

results <- do(results, tidy(t.test(.$count_eaten ~ .$group,
                 alternative = "two.sided",
                 mu = 0,
                 paired = FALSE,
                 var.equal = FALSE,
                 conf.level = 0.95
                 )))

 

Note that you have to express the t test formula as .$count_eaten ~ .$group rather than the more usual count_eaten ~ group. Inside, the do() function, the . represents the current group of data, and the $ is the standard R way of accessing the relevant column of a dataframe.

What do we have now? Here's the head of the table.

results_3

Success! Here we see a row for each of the tests we did –¬† 1 per experiment per food type –¬† along with the key outputs from the t tests themselves.

This is now just a regular data frame, so you can sort, filter, manipulate or chart as you would with any other dataset. Figuring out which of your results have  p value of 0.05 or lower for instance could be done via:

filter(results, p.value <= 0.05)

Below is a version where the commands are all pieced together, using the pipe operator.¬† That’s likely easier to copy and paste should you or I have a need to do so. Do remember that this is applicable, and perhaps more often actually theoretically valid, to the world outside of t tests.

The tidy() command itself is capable of tidying up many types of hypothesis tests and models so that their results appear in a useful table format. Beyond that, many many other functions that will return something coercible into a row of a dataframe or as an item in a R list, so will work.

results %>%
  group_by(experiment_number, food_type) %>%
  do(tidy(t.test(.$count_eaten ~ .$group,
                 alternative = "two.sided",
                 mu = 0,
                 paired = FALSE,
                 var.equal = FALSE,
                 conf.level = 0.95
                 )))

Extracting the date and time a UUID was created with Bigquery SQL (with a brief foray into the history of the Gregorian calendar)

I was recently working with records in a database that were identified by a Universally Unique Identifier, aka a UUID. These IDs are strings of characters that look something like “31ae75f0-cbe0-11e8-b568-0800200c9a66”.

I needed to know which records were generated during in a particular time period, but sadly there was no field about dates to be found. Unambiguously a database design flaw given what I needed to do – but it did lead me to discover that at least “version 1 UUIDs” have a date and time of creation embedded within them.

So how do we get from¬†31ae75f0-cbe0-11e8-b568-0800200c9a66 to the date and time the UUID was generated? I’d say “simple”, but it isn’t exactly. Thanks to Wikipedia, and famkruithof.net for the behind-the-scenes info of how this works.

So, the key components of the UUID to note by position are those highlighted below:

31ae75f0-cbe0-11e8-b568-0800200c9a66

Take the highlighted parts of the UUID aside, reversing the order of the chunks, so as to get:

1e8cbe031ae75f0

There’s your “60-bit timestamp, being the number of 100-nanosecond¬†intervals since midnight 15 October 1582” (thanks Wikipedia).

Rambling sidenote alert:

In case you’re wondering why 15 October 1582 is so special, then it’s because that was the date that the previous “Julian” calendar system was first replaced with the newer “Gregorian” calendar system, now the most widely used calendar throughout the world, under the diktat of Pope Gregory XIII.

Why? The Julian calendar had worked on the premise that the average year is 365.25 days long (the 0.25 days being compensated for by the existence of a leap day every 4 years).

However, that’s slightly different to the true length of the solar year, which is¬†365 days, 5 hours, 48 minutes, 45.25 seconds. Consequently. it was noticed that there was some “drift” whereby the date the calendar noted that the equinoxes should occur slowly became out of sync with real life observations of the equinox. Effectively, as the Britannica notes, this discrepancy causes the “calendar dates of the seasons to regress almost one day per century”. This is relevant to religions such as Catholicism in that it affects the calculation of, for instance, when to celebrate Easter.

The Gregorian calendar made a couple of changes, perhaps most notably introducing the rule that a century year (e.g. 1900) only counts as a leap year if its number is divisible by 400 with no remainder, instead of adhering to the Julian system where it only needs to be divisible by 4.¬† On average, this shortens the length of the measured year by¬†¬†0.0075 days, which keeps it in better sync with the reality of the solar year. It’s still not perfect, but leads to a much lower rate of drift of around 1 day per 3,030 years.

In order to account for the drift that had occurred by the time of this realisation, the Pope also advocated for fast forwarding the date by a few days to catch up with reality. So for Gregorian advocates, there was literally never a 14 October 1582. Overnight, the date skipped from October 4th 1582 through to October 15 1582, at least in countries where this system was accepted right away (and subsequently by the relevant ISO standards that most computing systems adhere to).

Not everywhere was keen to adopt this system right away – perhaps unsurprisingly Catholic countries were more likely to take the leap quicker.¬†But type 1 UUIDs presumably don’t care too much about religious politics.

End of rambling side note

Note that the base value of this timestamp, 15 October 1582, is a different date than the classic  January 1st, 1970-based timestamp you may know and love from Unix-type systems, which many databases, including Google BigQuery, work with. So it needs conversion.

Let’s start by getting it into decimal (using one of many web-based converters – I picked this one for no particular reason).

1e8cbe031ae75f0 hex = 137583952401430000 decimal

This is in units of 100 nanoseconds. I really don’t care about nanosecond granularity for my use case, so let’s divide it by 10,000,000 to get to seconds.

137583952401430000  100-nanoseconds intervals = 13758395240.1 seconds

This is now the number of seconds that have elapsed between the start of October 15 1582 and the date / time my UUID was created.

To get it into a more conventional  0 = 1970-01-01 based timestamp format, we then need to subtract the number of seconds between October 15 1582 and January 1st 1970 from it (12,219,292,800, as it happens):

13758395240 - 12219292800 = 1539102440

So now you have the number of seconds since Jan 1st 1970 and the time your UUID was generated. Pop it into a Unix timestamp interpreter (this one, for example) to translate it into something human-understandable, and you’ll discover I generated that UUID on October 9th 2018. Hooray.

Sidenote: I generated my test UUID on this site, if you ever need a valid one a hurry.

Flicking between websites and calculators is clearly a painful and inefficient way to do this operation, especially if you have a whole column of UUIDs to decode. Luckily many databases, including Bigquery, have the necessary functions to do it en masse.

Below is some Bigquery Standard SQL code that works for me Рwith most thanks and credit due to Roland Bouman, whose MySQL code version was nice and easy to adapt to the Bigquery SQL dialect.

In the below, imagine your UUID is stored in the field “uuid”. Pop this into your query, and the “extracted_timestamp” field will now contain the timestamp showing when the UUID was generated.

TIMESTAMP_SECONDS (
CAST (
(
CAST(
CONCAT('0x', -- starting the value with 0x will ensure Bigquery realises it's a hex string
SUBSTR(uuid, 16, 3), -- extract relevant parts of the UUID needed to recreate timestamp
SUBSTR(uuid, 10, 4),
SUBSTR(uuid, 1, 8)
)
AS INT64) -- convert hex to dec
/ 10000000) -- divide to get from 100 nanoseconds periods to seconds
- (141427 * 24 * 60 * 60) -- rebase from start of Gregorian calendar (1582-10-15) to unixtime (1970-01-01)
AS INT64 )
) AS extracted_timestamp

 

Average age at menarche by country

A question came up recently about variations in the age at menarche – the first occurrence of menstruation for a female human –¬† with regards to the environment. A comparison by country seemed like a reasonable first step in noting whether there were in fact any significant, potentially environmental, differences in this age.

A quick Google search seemed to suggest a nice reliable chart summarising this data by country was surprisingly (?) hard to find. However, delving more into academic publishing world, I did find a relevant paper called “International variability of ages at menarche and menopause: patterns and main determinants” (Thomas et al, in Hum Biol. 2001 Apr;73(2):271-90), which stated that the purpose of their study was:

to review published studies on the variability of age at menarche and age at menopause throughout the world, and to identify the main causes for age variation in the timing of these events.

Score! And sure enough, it does contain a lengthyish datatable showing the average age of menarche by country from a survey of prior literature that they did. However, it’s not super easy to get a immediately get a sense of the magnitude of any variation by looking at a datatable from a scanned PDF that was printed over 3 pages. Instead, I decided to extract the data from that table and visualise it in a couple of charts which are shown below, in case anyone else has a similar curiosity about the subject.

Firstly a simple bar chart. You may wish to click through to get a larger copy with more readable axes:

Bar chart

And now plotted geographically:

Age at menarche by country map

 

There’s also an interactive dashboard version available if you click here, where you can filter, highlight, hover and otherwise interact with the data.

If you have a use for the underlying data yourself, I uploaded it in a machine-readable format to data.world, whereby you can connect directly to it from various analytics tools, use the querying or visualisation tools on the site itself, or download it for your own use.  Again, this data is all consolidated by the authors of the Human Biology study above, so full credit is due to them.

A couple of notes to be aware of:

Firstly, the study I took the data from was published in 2001 i.e. is not particularly recent. The menarche age averages shown were collected from other studies, which obviously took place before 2001 – some even decades before, and in countries that technically no longer exist! They therefore may not reflect current values.

It’s generally considered that menarche ages have changed over time to some extent – for example a study concerning self-reported age of menarche in the US “Has age at menarche changed? Results from the National Health and Nutrition Examination Survey (NHANES) 1999-2004.” (McDowell et al,¬†J Adolesc Health. 2007 Mar;40(3):227-31) concluded that “Mean age of menarche declined by .9 year overall in women born before 1920 compared to women born in 1980-84” (and also that ages and changes differed by ethnicity, showing that country averages may mask underlying structural differences). A few other such studies I skimmed,¬†usually focusing on individual countries, also tended to show trends in the same direction. One might hypothesize that countries where the general population has been through even more radical living-condition shifts than residents of the US over the past few decades may have seen larger changes than those reported by McDowell et al.

Secondly, some of the averages reported are means, others are medians. These may not be directly comparable. The bar chart I’ve shown above differentiates between those based on the colour of the bar. In the interactive version there’s the ability to filter to only show results based on a single one of those aggregation types.

So now onto the interesting theoretical part – OK, menarche age may differ between countries, but why? As noted, that was part of the question driving the authors of the source study, so you should certainly read their whole study to get their take. In summary though, they created a sequence of linear models.

The first one shows a negative association between life expectancy and menarche age.  OK, but what factors drive life expectancy in the first place, that also correlate with menarche?

They produced 2 alternative models to investigate that. The first was a combination of illiteracy rate (with a positive correlation) and vegetable calorie consumption (negative). The second kept illiteracy in, but switched in the country’s average gross national product for vegetable calorie consumption.

Vegetable calorie consumption is perhaps somewhat intuitive – it’s been previously found that what their paper describes as “good nutritional conditions” tend to lower the age of menarche, with the body’s fat:mass ratio being potentially involved. There are many papers on that topic – not being remotely a subject matter expert here I’ll pick one at random – “Body weight and the initiation of puberty” (Baker, Clin Obstet Gynecol. 1985 Sep;28(3):573-9.), which concluded that malnutrition may “retard the onset of menarche”.

But the larger influence in the model was the country’s illiteracy rate. Does reading books induce menstruation? Well, hey, probably not directly, so it likely proxies for something that may more plausibly affect menarche. The theory the researchers present here is that societies with higher illiteracy also tend to have a higher incidence of child labour. This labour is often physical in nature, implying a higher rate of energy expenditure. Citing previous evidence that excessive exercise may affect the fat balance, again altering menarche, they form a new hypothesis that it’s in fact the energy balance within a person that may affect menarche.

Analysing your 23andme genetic data in R part 2: exploring the traits associated with your genome

In part one of this mini-series, you heroically obtained and imported your 23andme raw genome data into R. Fun as that was, let’s see if we can learn something interesting from it.¬† After all, 23andme does automatically provide several genomic analysis reports, but – for many sensible reasons – it is certainly limited in what it can show when compared to the entire literature of this exciting field.

It would be tiresome for me to repeat all the caveats you can find in part one, and all over the more responsible parts of the internet. But do remember that in the grand scheme of things we likely know only somewhere between “nothing” and “a little” so far about the implications of most of the SNPs you imported into R in part one.

Whilst there are rare exceptions, for the most part it seems like many “interesting” real-world human traits are a product of many variations in different SNPs, each of which have a relatively tiny effect size. In these cases, if you happen to have a T somewhere rather than an A, it’s usually unlikely on its own to produce to any huge “real world” outcome you’d notice alone. Or if it does, we don’t necessarily know it yet. Or perhaps it could, but only if you have certain other bases in certain other positions, many of which 23andme may not report on – even if we knew which ones were critical. That was a long and meandering sentence which could be summarised as “things are complicated”.

Note also that 23andme do provide a disclaimer on the raw data, as follows:

This data has undergone a general quality review however only a subset of markers have been individually validated for accuracy. As such, the data from 23andMe’s Browse Raw Data feature is suitable only for research, educational, and informational use and not for medical or other use.

So, all in all, what follows should be treated as a fun-only investigation. You should first seek the services of genetic medical professionals, including the all-important genetic counsellor, if you have any real concerns about what you might find.

But, for the data thrill-seekers,¬† let’s start by finding some info as which genotypes at which SNPs are thought to have some potential association with a known trait. This is part of what 23andme does for you in their standard reporting. However, they have legal obligations and true experts working on this. I, on the other hand, recently read “Genetics for Dummies”, so tread carefully.

How do we find out about the associations between your SNP results and any subsequent phenotypes? The obvious place, especially if you’re interested in a particular trait, is the ever-growing published scientific literature on genetics. Being privileged enough to count amongst my good friends a scientist who actually published papers involving this subject, I started there. Here, for example, is Dr Kaitlin Roke’s paper: “Evaluating Changes in Omega-3 Fatty Acid Intake after Receiving Personal FADS1 Genetic Information: A Randomized Nutrigenetic Intervention“.¬†Extra credit is also very much due to her for being kind enough to help me out with some background knowledge.

Part of that fascinating study involved explaining to members of the public what the implications of the differing alleles possible at a specific SNP,¬†rs174537, were with regards to the levels of Omega 3 fatty acids typically found in a person’s body, and how well the relevant conversion process proceeds. This may have dietary implications in terms of how much the person concerned should focus on increasing the amount of omega-3 laden food they eat – although it would be remiss of me to fail to mention the good doctor’s general advice that mostly everyone needs to up their levels anyway!

Anyway, to quote from their wonderfully clear description of the implications of the studied SNP:

…the document provided a brief overview of the reported difference in omega-3 FA levels in relation to a common SNP in FADS1 (rs174537)…individuals who are homozygote GG allele carriers have been reported to have more EPA in their bodies and an increased ability to convert ALA into EPA and DHA while individuals with at least one copy of the minor allele (GT or TT) were shown to have less EPA in their bodies and a reduced ability to convert ALA into EPA and DHA

Awesome, so here we have a definitive explanation as to which SNP was examined, and what the implications of the various genotypes are (which I’ve bolded for your convenience).

In part 1 of this post, we already saw how to filter your R-based 23andme data to view your results for a specific SNP, right? If you already completed all those import steps, you can do the same, just switching in the rsid of interest:

library(dplyr)
filter(genome_data_test, rsid == "rs174537")

g

Run that, and if you are returned a result of either GT or TT then you’ll know you should be especially careful to ensure you are getting a good amount of omega 3 in your diet.

OK, this is super cool, but what if you don’t happen to know a friendly scientist, or don’t know what traits you’re particular interested in –¬† how might you evaluate the SNP implications at scale?

Whilst there’s no substitute for actual expertise, luckily there is a R library called “gwascat“, which enables you to access the¬† NHGRI-EBI Catalog of published genome-wide association studies via a data structure in R. It has a whole lot of info in it, descriptions of the fields you end up with mostly being shown¬†here. The critical point is that it contains a list of SNPs, associated traits and relevant genotypes, together with the references to the publications that found the associations should you want to get more details.

The first thing to do is install gwascat. gwascat is a bioconductor package, rather than the business-user-typical cran packages. So if you’re not a bioconductor user, there’s a slightly different installation routine, which you can see here.

But essentially:

source("https://bioconductor.org/biocLite.R")
biocLite("gwascat")
library(gwascat)

This took a while to install for me, but can just be left to get on with itself.

It sometimes feels like a new genomics study is released almost every few seconds, so the next step may be to get an up-to-date version of the catalog data –¬† I think the one that installs by default is a few years out of date.

Imagine that we’d like our new GWAS catalogue to end up as a data frame called “updated_gwas_data”:

updated_gwas_data <- as.data.frame(makeCurrentGwascat())

This might take a few minutes to run, depending on how fast your download speed is. You can get some idea of the recency once it’s done by checking the latest date that any publication was added to the catalogue.

max(updated_gwas_data$DATE.ADDED.TO.CATALOG)

At the time of writing, this date is May 21st 2018. And what was that study?

library(dplyr)
filter(updated_gwas_data, DATE.ADDED.TO.CATALOG == "2018-05-21") %>% select(STUDY) %>% distinct()

“Key HLA-DRB1-DQB1 haplotypes and role of the BTNL2 gene for response to a hepatitis B vaccine.”

Where can I find that study, if I want to read what it actually said?

library(dplyr)
filter(updated_gwas_data, DATE.ADDED.TO.CATALOG == "2018-05-21") %>% select(LINK) %>% distinct()

www.ncbi.nlm.nih.gov/pubmed/29534301

OK, now we have up-to-date data, let’s figure out how join it to your personal raw genome data we imported in part 1 (or to be precise here, a mockup file in the same format, so as to avoid sharing anyone’s real genomic data here).

The GWAS data lists each SNP (e.g. “rs9302874”) in a field called SNPS. Our imported 23andme data has the same info in the rsid field. Hence we can do a simple join, here using the dplyr library.

Assuming the 23andme data you imported is called “genome_data”, then:

library(dplyr)

output_data <- inner_join(genome_data, updated_gwas_data, by = c("rsid" = "SNPS"))

Note the consequences of the inner join here. 23andme analyses SNPs that don’t appear in this GWAS database, and the GWAS database may contain SNPs that 23andme doesn’t provide for you. In either case, these will be removed in the above file result. There’ll just be rows for SNPs that 23andme does provide you, and that do have an entry in the GWAS database.

Also, the GWAS database may have several rows for a single SNP. It could be that several studies examined a single SNP, or that one study found many traits potentially associated with a SNP. This means your final “output_data” table above will have many rows per for some SNPs.

OK, so at the time of writing there are nearly 70,000 studies in the GWAS database, and over 600,000 SNPs in the 23andme data export. How shall we narrow down this data-mass to find something potentially interesting?

There are many fields in the GWAS database you might care about – the definitions being listed here. For us amateur folks here, DISEASE.TRAIT and STRONGEST.SNP.RISK.ALLELE might be of most interest.

DISEASE.TRAIT gives you a genericish name for the trait that a study investigated whether there was an association with a given SNP (e.g. “Plasma omega-3 polyunsaturated fatty acid levels”, or “Systolic blood pressure”). Note that the values are not actually all “diseases” by the common-sense meaning – unless you consider traits like being tall a type of illness anyway.

STRONGEST.SNP.RISK.ALLELE gives you the specific allele of the SNP that was “most strongly” associated with that trait in the study (or potentially a ? if unknown, but let’s ignore those for now). The format here is to show the name of the SNP first, then append a dash and the allele of interest afterwards e.g. “rs10787517-A” or “rs7977462-C”.

This can easily give the impression of greater specificity than the real world has – only 1 allele ever appears in this field, so if there are multiple associations then only the strongest will be listed. If there are associations in tandem with other alleles or other SNPs, then that information also cannot be fully represented here. Also, it’s not necessarily the case that alleles are additive; so without further research we shouldn’t assume that having 2 of the high risk bases gives increased risk over a single base.

That’s what the journal reference is for – another reason it’s critical you do the reading and seek the help of appropriate genetic professionals before any rejoicing or worrying about your results.

Taking the above example from Roke et al’s Omega 3 study, this GWAS database records the most relevant strongest SNP risk allele for the SNP they analysed as being “rs174537-T”. You’d want to read the study in order to know whether that meant that the TT genotype was the one to watch for, or whether GT had similar implications.

Back to an exploration of your genome – the two most obvious approaches that come to mind are either: 1) check whether your 23andme results suggest an association with a specific trait you’re interested in, or 2) check which traits your results may be associated with.

In either case, it’ll be useful to create a field that highlights whether your 23andme results indicate that you have the “strongest risk allele” for each study. This is one way to help narrow down towards the interesting traits you may have inherited.

The 23andme part of of your dataframe contains your personal allele results in the genotype field. There you’ll see entries like “AC” or “TT”. What we really want to do here is, for every combination of SNP and study, check to see if either of the letters in your genotype match up with the letter part of the strongest risk allele.

One method would be to separate out your “genotype” data field into two individual allele fields (so “AC” becomes “A” and “C”). Next, clean up the strongest risk allele so you only have the individual allele (so “rs10787517-A”¬† becomes “A”). Finally check whether either or both of your personal alleles match the strongest risk allele. If they do, there might be something of interest here.

One method:

library(dplyr)
library(stringr)

output_data$risk_allele_clean <- str_sub(output_data$STRONGEST.SNP.RISK.ALLELE, -1)
output_data$my_allele_1 <- str_sub(output_data$genotype, 1, 1)
output_data$my_allele_2 <- str_sub(output_data$genotype, 2, 2)
output_data$have_risk_allele_count <- if_else(output_data$my_allele_1 == output_data$risk_allele_clean, 1, 0) + if_else(output_data$my_allele_2 == output_data$risk_allele_clean, 1, 0)

Now you have your two individual alleles stored in my_allele_1 and my_allele_2, and the allele for the “highest risk” stored in risk_allele_clean. Risk_allele_clean is the letter part of the GWAS STRONGEST.SNP.RISK.ALLELE field. And finally, the have_risk_allele_count is either 0, 1 or 2 depending on whether your 23andme genotype result at that SNP contains 0, 1 or 2 of the risk alleles.

The previously mentioned DISEASE.TRAIT field contains a summary of the trait involved. So by filtering your dataset to only look for studies about a trait you care about, you can see a summary of the risk allele and whether or not you have it, and the relevant studies that elicited that connection.

I did notice that this trait field can be kind of messy to use. You’ll see several different entries for similar topics; e.g. some studied traits around Body Mass Index are indeed classified as “Body mass index”, others as “BMI in non-smokers” or several other BMI-related phrases. So you might want to try a few different search strings in the below to access everything on the topic you care about.

For example, let’s assume that by now we also inevitably developed a strong interest in omegas and fatty acids. Which SNPs may relate to that topic, and do we personally have the risk allele for any of them?

We can use the str_detect function of the stringr library in order to search for any entries that contain the substring “omega” or “fatty acid”.

library(dplyr)
library(stringr)
select(output_data, rsid, DISEASE.TRAIT, risk_allele = risk_allele_clean, your_geneotype = genotype) %>% 
 filter(str_detect(tolower(DISEASE.TRAIT), "omega") | str_detect(tolower(DISEASE.TRAIT), "fatty acid")) %>%
 distinct()

 

rs

The full table this outputs is actually 149 rows long. That’s a fair few for an amateur to sift through. Maybe we’d prefer to restrict ourselves to the SNP we heard from Dr Roke’s study above was of particular interest:¬† rs174537. Easy: just filter on the rsid:

library(dplyr) 
library(stringr)
select(output_data, rsid, DISEASE.TRAIT, risk_allele = risk_allele_clean, your_geneotype = genotype) %>% 
 filter(str_detect(tolower(DISEASE.TRAIT), "omega") | str_detect(tolower(DISEASE.TRAIT), "fatty acid")) %>%
 distinct() %>%
 filter(rsid == "rs174537")

rof

Maybe you are curious as to what other traits that same SNP might be associated with? Just reverse the criteria for omega and fatty acid strings. Here I also added the journal reference title, in case I wanted to read up more into these trait associations.

library(dplyr)
 select(output_data, rsid, DISEASE.TRAIT, STUDY) %>% 
 filter(!str_detect(tolower(DISEASE.TRAIT), "omega") & !str_detect(tolower(DISEASE.TRAIT), "fatty acid")) %>%
 distinct() %>%
 filter(rsid == "rs174537")

rs174537

Now onto the second approach – this time, you don’t have a specific trait in mind. You’re more interested in discovering which traits have risk alleles that match the respective part of your genome. Please see all the above disclaimers. This is not remotely the same as saying which dreadful diseases you are going to get. But please stay away from this section if you are likely to be worried about seeing information that could even vaguely correspond to health concerns.

We already have our¬†have_risk_allele_count field. If it’s 1 or 2 then you have some sort of match. So, the full list of your matches and the associated studies could be retrieved in a manner like this.

library(dplyr)
filter(output_data, have_risk_allele_count >= 1) %>%
 select(rsid, your_genotype = genotype, strongest_risk_allele = risk_allele_clean, DISEASE.TRAIT, STUDY )

Note that this list is likely to be long. Some risk alleles are very common in some populations, and remember that there may be many studies that relate to a single SNP, or many SNPs referred to by a single study. You might want to pop it in a nice DT Javascript DataTable to allow easy searching and sorting.

library(DT)
library(dplyr)

datatable(
 filter(output_data, have_risk_allele_count >= 1) %>%
 select(rsid, your_genotype = genotype, strongest_risk_allele = risk_allele_clean, DISEASE.TRAIT, STUDY )
)

results in something like:

dt

…which includes a handy search field at the top left of the output.

There are various other fields in the GWAS dataset you might consider using to filter down further. For example, you might be most interested in findings from studies that match your ethnicity, or occasions where you have risk alleles that are rare within the population. After all, we all like to think we’re special snowflakes, so if 95% of the general population have the risk allele for a trait, then that may be less interesting to an amateur genome explorer than one where you are in the lucky or unlucky 1%.

For the former, you might try searching within the INITIAL.SAMPLE.SIZE or REPLICATION.SAMPLE.SIZE fields, which has entries like: “272 Han Chinese ancestry individuals” or “1,180 European ancestry individuals from ~475 families”.

Similar to the caveats on searching the trait fields, one does need to be careful here if you’re looking for a comprehensive set of results. Some entries in the database have blanks in one of these fields, and others don’t specify ethnicities, having entries like “Up to 984 individuals”.

For the proportion of the studied population who had the risk allele, it’s the RISK.ALLELE.FREQUENCY field. Again, this can sometimes be blank or zero. But in theory, where it has a valid value, then, depending on the study design, you might find that lower frequencies are rarer traits.

We can use dplyr‘s arrange and filter functions to sort do the above sort of narrowing-down. For example: what are the top 10 trait / study / SNP combinations you have the risk allele for that were explicitly studied within European folk, ordered by virtue of them having the lowest population frequencies reported in the study?

library(stringr)
library(dplyr)
head(
 filter(output_data,have_risk_allele_count > 0 & (str_detect(tolower(INITIAL.SAMPLE.SIZE), "european") | str_detect(tolower(REPLICATION.SAMPLE.SIZE), "european")) & (RISK.ALLELE.FREQUENCY > 0 & !is.na(RISK.ALLELE.FREQUENCY))) %>%
 arrange(RISK.ALLELE.FREQUENCY) %>%
 select(rsid, your_genotype = genotype, DISEASE.TRAIT, INITIAL.SAMPLE.SIZE,RISK.ALLELE.FREQUENCY)
 , 10)

 

dt2

 

Or perhaps you’d prefer to prioritise the order of the traits you have the risk allele for, for example, based on the number of entries in the GWAS database for that trait where the highest risk allele is one you have. You might argue that these could be some of the most reliably associated traits, in the sense that they would bias towards those that have so far been studied the most, at least within this database.

Let’s go graphical with this one, using the wonderful ggplot2 package.

library(dplyr)
library(ggplot2)

trait_entry_count <- group_by(output_data, DISEASE.TRAIT) %>%
 filter(have_risk_allele_count >= 1) %>%
 summarise(count_of_entries = n())

ggplot(filter(trait_entry_count, count_of_entries > 100), aes(x = reorder(DISEASE.TRAIT, count_of_entries, sum), y = count_of_entries)) +
 geom_col() +
 coord_flip() +
 theme_bw() +
 labs(title = "Which traits do you have the risk allele for\nthat have over 100 entries in the GWAS database?", y = "Count of entries", x = "Trait")

over100

Now, as we’re counting combinations of studies and SNPs per trait here, this is obviously going to be somewhat self-fulfilling as some traits have been featured in way more studies than others. Likewise some traits may have been associated with many more SNPs than others. Also, recalling that many interesting traits seem to be related to a complex mix of SNPs, each of which may only have a tiny effect size, it might be that whilst you do have 10 of the risk alleles for condition X, you also don’t have the other 20 risk alleles that we discovered so far have an association (let alone the 100 weren’t even publish on yet and hence aren’t in this data!).

Maybe then we can sort our output in a different way. How about we count the number of distinct SNPs where you have the risk allele, and then express those as a proportion of the count of all the distinct SNPs for the given trait in the database, whether not you have the risk allele? This would let us say things such as, based (solely) on what in this database, you have 60% of the known risk alleles associated with this trait.

One thing noted in the data, both the 23andme genome data and the gwascat highest risk allele have unusual values in the allele fields – things like ?, -, R, D, I and some numbers based on the fact the “uncleaned” STRONGEST.SNP.RISK.ALLELE didn’t have a -A, -C, -G or -T at the end of the SNP it named. Some of these entries may be meaningful – for example the D and I in the 23andme data refer to deletes and insertions, but won’t match up with anything in the gwascat data. Others may be more messy or missing data, for example 23andme reports “–” if no genotype result was provided for a specific SNP call.

In order to avoid these inflating the proportion’s denominator we’ll just filter down so that we only consider entries where our gwascat-derived “risk_allele_clean” and¬† 23andme-derived “my_allele_1” and “my_allele_2″ are all one of the standard A, C, G or T bases.

Let’s also colour code the results by the rarity of the SNP variant within the studied population. That might provide some insight to exactly what sort of special exception we are as an individual – although some of the GWAS data is missing that field and basic averaging won’t necessarily give the correct weighting, so this part is extra…”directional”.

You are no doubt getting bored with the sheer repetition of caveats here – but it is so important. Whilst these are refinements of sorts, they are simplistic and flawed and you should not even consider concluding something significant about your personal health without seeking professional advice here. This is fun only. Well, for for those of us who could ever possibly classify data as fun anyway.

Here we go, building it up one piece at a time for clarity of some kind:

library(dplyr)
library(ggplot2)

# Summarise proportion of SNPs for a given trait where you have a risk allele

trait_snp_proportion <-  filter(output_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
mutate(you_have_risk_allele = if_else(have_risk_allele_count >= 1, 1, 0)) %>%
 group_by(DISEASE.TRAIT, you_have_risk_allele) %>%
 summarise(count_of_snps = n_distinct(rsid)) %>%
 mutate(total_snps_for_trait = sum(count_of_snps), proportion_of_snps_for_trait = count_of_snps / sum(count_of_snps) * 100) %>%
 filter(you_have_risk_allele == 1) %>%
 arrange(desc(proportion_of_snps_for_trait)) %>%
 ungroup()

# Count the studies per trait in the database

trait_study_count <- filter(output_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
 group_by(DISEASE.TRAIT) %>%
 summarise(count_of_studies = n_distinct(PUBMEDID), mean_risk_allele_freq = mean(RISK.ALLELE.FREQUENCY))

# Merge the above together

trait_snp_proportion <- inner_join(trait_snp_proportion, trait_study_count, by = "DISEASE.TRAIT")

# Plot the traits where there were more than 2 studies and you have risk alleles for more than 70% of the SNPs studied

ggplot(filter(trait_snp_proportion, count_of_studies > 1 & proportion_of_snps_for_trait > 70), aes(x = reorder(DISEASE.TRAIT, proportion_of_snps_for_trait, sum), y = proportion_of_snps_for_trait, fill = mean_risk_allele_freq)) +
 geom_col() +
 coord_flip() +
 theme_bw() + 
 labs(title = "Traits I have more than half of the risk\nalleles studied where > 1 studies involved", 
 y = "% of SNPs with risk allele", x = "Trait", fill = "Mean risk allele frequency") +
 theme(legend.position="bottom")

 

proptraits

Again, beware that the same sort of trait can be expressed in different ways within the data, in which case these entries are not combined. If you wanted to be more comprehensive regarding a specific trait, you might feel inclined to produce your own categorisation first and group by that – e.g. lumping anything BMI or Body Mass Index into your own BMI category via creating a new field.

Happy exploring!

Analysing your 23andme genetic data in R part 1: importing your genome into R

23andme is one of the ever-increasing number of direct to consumer DNA testing companies. You send in a vial of your spit; and they analyse parts of your genome, returning you a bunch of reports on ancestry, traits and – if you wish – health.

Their business is highly regulated, as of course it should be (and some would say it oversteps the mark a little even with that), so they are, quite rightly, legally limited as to what info they can provide back to the consumer. However, the exciting news for us data geeks is that they do allow you to download the raw data behind their analysis. This means you can dig deeper into parts of your genome that their interpretations don’t cover.

It should be said that there is considerable risk involved here, unless – or perhaps even if – you happen to be a genetics expert. The general advice on interpretation for amateurs should be to seek a professional genetic counseller before concluding anything from your DTC test – although in reality that might be easier said than done.

Whilst I might know a bit about how to play with data, I am not at all a genetics expert, so anything below must be taken with a large¬† amount of skepticism. In fact, if you are in the perfectly legitimate camp of “best not to know” people when it comes to DNA analysis, or you feel there is any risk you won’t be able to constrain yourself to treat the innards of your genome as solely a fun piece of analysis and constrain yourself to avoid areas you don’t want to explore, it would be wise not to proceed.

Also, even as an amateur, I’m aware that the science behind a lot of the interpretation of one’s genome is in a nascent period, at best. There are many people or companies that may rather over-hype what is actually known here, perhaps even to the extent of fraud in some cases.

But if you are interested to browse your results, here is my first experience of playing with the 23andme raw data in R.

Firstly, you need to actually obtain your raw 23andme data. A obvious precondition to this is that you have purchased one of their analysis products, set up your 23andme account, and waited until they have delivered the results to you. Assuming that’s all done, you can visit this 23andme page, and press the “Download” button near the top of the screen. You’ll need to wait a while, but eventually they’ll notify you that your file is ready, and let you download a text file of results to your computer. Here, I called my example file “genome.txt”.

Once you have that, it’s time to load it into R!

The text file is in a tab-delimited format, and also contains 19 rows at the top describing the contents of the file in human-readable format. You’ll want to skip those rows before importing it into R. I used the readr package to do this, although it’s just as easy in base R.

A few notes:

  • It imported more successfully if I explicitly told R the data type of each column.
  • One of the column headers (i.e. the field names) starts with a # and includes spaces, which is a nuisance to deal with in R, so I renamed that right away
  • I decided to import it into a dataframe called “genome_data”
library(readr)

genome_data_test <- read_tsv(".\\data_files\\genome.txt", skip = 19, col_types = cols(
 `# rsid` = col_character(),
 chromosome = col_character(),
 position = col_integer(),
 genotype = col_character())
 )

genome_data_test <- rename(genome_data_test, rsid = `# rsid`)

Let’s see what we have!

head(genome_data_test)

genomedatatest

Sidenote: the genome data I am using is a mocked-up example in the 23andme format, rather than anyone’s real genome – so don’t be surprised if you see “impossible” results shown here. Call me paranoid, but I am not sure it’s necessarily a great idea to publicly share someone’s real results online, at least without giving it careful consideration.

OK, so we have a list of your SNP call data. The rsid column is the “Reference SNP cluster ID” used to refer to a specific SNP, the chromosome and position tell you whereabouts that SNP is located, and the genotype tells you which combination of the Adenine, Thymine, Cytosine and Guanine bases you happen have in those positions.

(Again, I am not at all an expert here, so apologies for any incorrect terminology! Please feel free to let me know what I should have written ūüôā )

Now, let’s check that the import went well.

Many of the built in 23andme website reports do actaully¬† list what SNPs they refer to. For instance, if you click on “Scientific Details” on the life-changing trait report which tells you how likely it is that you urine will smell odd to you after eating asparagus, and look for the “marker tested” section, it tells you that it’s looking at the¬†rs4481887 SNP.

Capture

And it also tells you what bases were found there in your test results. Compare that to the data for the same person’s genome imported in R, by filtering your imported data like this:

library(dplyr)
filter(genome_data_test, rsid == "rs4481887")

If the results of that match the results shown in the scientific details of your asparagus urine smell report, yay, things are going OK so far.

OK, so now your 23andme data is safely in R. But why did we do this, and what might it mean? Come back soon for part 2.

R packages for summarising data – part 2

In a recent post, I searched a tiny percentage of the CRAN packages in order to check out the options for R functions that quickly and comprehensively summarise data, in a way conducive to tasks such as data validation and exploratory analytics.

Since then, several generous people have been kind enough to contact me with suggestions of other packages aimed at the same sort of task that I might enjoy. Always being grateful for new ideas, I went ahead and tested them out. Below I’ll summarise what I found.

I also decided to add another type of variable to the small test set from last time – a date. I now have a field “date_1” which is a Date type, with no missing data, and a “date_2” which does have some missing data in it.

My “personal preference” requirements when evaluating these tools for my specific use-cases haven’t changed from last time, other than that I’d love the tool to be able to handle date fields sensibly.

For dates, I’d ideally like to see the summarisation output acknowledge that the fields are indeed dates, show the range of them (i.e. max and min), if any are missing and how many distinct dates are involved. Many a time I have found that I have fewer dates in my data than I thought, or I had a time series that was missing a certain period. Both situations would be critical to know about for a reliable analysis. It would be great if it could somehow indicate if there was a missing block of dates – e.g. if you think you have all of 2017’s data, but actually June is missing, that would be important to know before starting your analysis.

Before we examine the new packages, I wanted to note that one of my favourites from last time, skimr, received a wonderful update, such that it now displays sparkline histograms when describing certain types of data, even on a Windows machine. Update to version 1.0.1 and you too will get to see handy mini-charts as opposed to strange unicode content when summarising. This only makes an already great tool even more great, kudos to all involved.

Let’s check how last time’s favourite for most of my uses cases, skimr, treats dates – and see the lovely fixed histograms!

library(skimr)

skim(data)

skim

This is good. It identifies and separates out the date fields. It shows the min, max and median date to give an idea of the distribution, along with the number of unique dates. This will provide invaluable clues as to how complete your time series is. As with the other data types, it clearly shows how much missing data there is. Very good. Although I might love it even more if it displayed a histogram for dates that showed the volume of records (y) per time period (x; granularity depending on range), giving a quick visual answer to the question of missing time periods.

The skim-by-groups output for dates is equally as sensible:

library(dplyr)
library(skimr)
group_by(data, category) %>% skim()

skimgroup

Now then, onto some new packages! I collated the many kind suggestions, and had a look at:

  • CreateTableOne, from the tableone package
  • desctable from desctable
  • ggpairs from GGally
  • ds_summary_stats from Descriptr
  • I was also going to look at the CompareGroups package, but unfortunately it had apparently been removed from CRAN recently because “check problems were not corrected in time” apparently. Maybe I’ll try again in future.

CreateTableOne, from the tableone package

Documentation

library(tableone)
 CreateTableOne(data = data)

tableone

Here we can easily see the count of observations. It has automatically differentiated between the categorical and numeric variables, and presented likely appropriate summaries on that basis. There are fewer summary stats for the numeric variables than I would have liked to see when I am evaluating data quality and similar tasks.

There’s also no highlighting of missing data. You could summarise that “type” must have some missing data as it has fewer observations than exist in the overall dataset, but that’s subtle and wouldn’t be obvious on a numeric field. However, the counts and percentage breakdown on the categorical variables is very useful, making it a potential dedicated frequency table tool¬†replacement.

The warning message shows that date fields aren’t supported or shown.

A lot of these “limitations” likely come down to the actual purpose this tool is aimed at. The package is called tableone because it’s aimed at producing a typical “table 1” of a biomedical journal paper, which is often a quantitative breakdown of the population being studied, with exactly these measures. That’s a different and more well-specified task than a couple of use-cases I commonly use summary tools for, such as getting a handle on a new dataset, or trying to get some idea of data quality.

This made me realise that perhaps I am over-simplifying to imagine a single “summary tool” would be best for every one of my use-cases. Whilst I don’t think CreateTableOne’s default output is the most comprehensive for judging data quality, at the same time no journal is going to publish the output of skimr directly.

There are a few more tableone tricks though! You can summary() the object it creates, whereupon you get an output that is less journaly, but contains more information.

summary(CreateTableOne(data = data))

summarytableone

This view produces a very clear view of how much data is missing in both absolute and percentage terms, and most of the summary stats (and more!) I wanted for the numeric variables.

tableone can also do nice summarisation by group, using the “strata” parameter. For example:

CreateTableOne(strata = "category", data = data)

stratatableone.PNG

You might note above that not only do you get the summary segmented by group, but you also automatically get p values from a statistical hypothesis  test as to whether the groups differ from each other. By default these are either chi-squared for categorical variables or ANOVA for continuous variables. You can add parameters to use non-parametric or exact tests though if you feel they are more appropriate. This also changes up the summary stats that are shown for numeric variables Рwhich also happens if you apply the nonnormal parameter even without stratification.

For example, if we think “score” should not be treated as normal:

print(CreateTableOne(strata = "category", data = data), nonnormal = "score")

nonnormaltableone

Note how “score” is now summarised by its median & IQR as opposed to “rating”, which is summarised by mean and standard deviation. A “nonnorm” test has also been carried out when checking group differences in score, in this case a Kruskal Wallis test.

The default output does not work with kable, nor is the resulting summarisation in a “tidy” format, so no easy piping the results in or out for further analysis.

desctable, from the desctable package

Documentation

library(desctable)
desctable(data)

desctable

The vignette for desctable describes how it is intended to fulfil a similar function to the aforementioned tableone, but is built to be customisable, manipulable and fit in well with common tidy data tools such as dplyr.

By default, you can see it shows summary data in a tabular format, taking into showing appropriate summaries based on the type of the variables (numeric vs categorical, although it does not show anything other than counts of non-missing data for dates). The basic summaries are shown, although, like tableone, I would prefer a few more shown by default if I was using it as an exploratory tool. Like tableone though, exploratory analysis of messy data is likely not really its primary aim.

The tabular format is quite easy to manipulate downstream, in a “tidy” way. It actually produces a list of dataframes, one containing the variable names and a second containing the summary stats. But you can as.data.frame() it to pop them into a standard usable dataframe.

I did find it a little difficult to get an quick overview at a glance in the console. Everything was displayed perfectly well – but¬† it took me a moment to make my conclusions. Perhaps part of this is because the variables of different types are all displayed together, making me take a second to realise that some fields are essentially hierarchical – row 1 shows 60 “type” fields, and rows 2-6 show the breakdown by value of “type”.

There’s no special highlighting of missing data, although with a bit of mental arithmetic one could work out that if there are N = 64 records in the date_1 field then if there are 57 entries in the date_2 field, we must be missing at least 7 date_2 entries. But there’s no overall summary of record count by default so you would not necessarily know the full count if every field has some missing data.

It works nicely with kable and has features allowing output to markdown and html.

The way it selects which summaries to produce for the numerical fields is clever. It automatically runs a Shapiro-Wilk test on the field contents to determine whether it is normally distributed. If yes, then you get mean and standard deviation. If no, then median and inter-quartile range. I like the idea of tools that nudge you towards using distribution-appropriate summaries and tests, without you having to take the time (and remember) to check distributions yourself, in a transparent fashion.

Group comparison is possible within the same command, by using the dplyr group_by function, and everything is pipeable – which is a very convenient method for those of already immersed in the tidyverse.

group_by(data, category) %>%
 desctable()

desctableg

The output is “wide” with lengthy field names, which makes it slightly hard to follow if you have limited horizontal space. It might also be easier to compare certain summary stats, e.g. the mean of each group, if they were presented next to each other when the columns are wide.

Note also the nice way that it has picked an appropriate statistical test to automatically test for significant differences between groups. Here it picked the Kruskall-Wallis test for the numeric fields as the data was detected as being non-normally distributed.

Although I am not going to dive deep into it here, this function is super customisable. I mentioned above that I might like to see more summary stats shown.  If I wanted to always see the mean, standard deviation and range of numeric fields I could do:

desctable(data,stats = list("N" = length, "Mean" = mean, "SD" = sd, "Min" = min, "Max" = max))

desctablec

You can also include conditional formulae (“show stat x only if this statement is true”), or customise which statistical tests are applied when comparing groups.

If you spent a bit of time working this through, you could probably construct a very useful output for almost any reasonable requirements. All this and more is documented in the vignette, which also shows how it can be used to generate interactive tables via the datatable function of DT.

 

ggpairs, from the GGally package

Documentation

library(GGally)
ggpairs(data)

ggpairs

ggpairs is a very different type of tool than the other summarising tools tried here so far. It’s not designed to tell you the count, mean, standard deviation or other one-number summary stats of each field, or any information on missing data. Rather, it’s like a very fancy pairs plot, showing the interactions of each of your variables with each of the others.

It is sensitive to the types of data available – continuous vs discrete, so makes appropriate choices as to the visualisations to use. These can be customised if you prefer to make your own choices.

The diagonal of the pairs plot Рwhich would compare each variable to itself, does provide univariate information depending on the type of the variable. For example, a categorical variable would, by default,  show a histogram of the its distribution. This can be a great way to spot potential issues such as outliers or unusual distributions. It also can may reveal stats that could otherwise be represented numerically, for instance the range or variance, in a way often more intuitive to humans.

It looks to handle dates as though they were numbers, rather than as a separate date type.

There isn’t exactly a built-in group comparison feature, although the nature of a pairs plot may mean relevant comparisons are done by default. You can also pass ggplot2 aesthetics through, meaning you can essentially compare groups by assigning them different colours for example.

Here for example is the output where I have specified that the colours in the charts should reflect the category variables in my data.

ggpairs(data, mapping = aes(colour = category))

ggpairsc

As the output is entirely graphic, it doesn’t need to work with kable and obviously doesn’t produce a tidy data output.

I wouldn’t categorise this as having the same aim as the other tools mentioned here or in the past post – although I find it excellent for its intended use-case.

After first investigating data with the more conventional summarisation tools to the point I have a mental overview of its contents and cleanliness, I might often run it through ggpairs as the next step. Visualising data is often a great, sometimes overlooked, method to increase your understanding of a dataset. This tool will additionally allow you to get an overview of the relationship between any two of your variables in a fast and efficient way.

ds_summary_stats from descriptr

Documentation

library(descriptr)
ds_summary_stats(data$score)

descriptrss

descriptr is a package that contains several tools to summarise data. The most obvious of these aimed at my use cases is “ds_summary_stats”, shown above. This does only work on numeric variables, but the summary it produces is extremely comprehensive. Certainly every summary statistic I normally look at for numeric data is shown somewhere in the above!

The output it produces is very clear in terms of human readability, although it doesn’t work directly with kable, nor does it produce tidy data if you wished to use the results downstream.

One concern I had regards its reporting of missing data. The above output suggests that my data$score field has 58 entries, and 0 missing data. In reality, the dataframe contains 64 entries and has 6 records with a missing (NA) score. Interestingly enough, the same descriptr package does contain another function called ds_screener, which does correctly note the missing data in this field.

ds_screener(data)

descriptrscreener

This function is in itself a nicely presented way of getting a high-level overview of how your data is structured, perhaps as a precursor to the more in-depth summary statistics that concern the values of the data.

Back on the topic of producing tidy data: another command in the package, ds_multi_stats, does produce tidy, kableable and comprehensive summaries that can include more than one variable – again restricted to numeric variables only.

I also had problems with this command fields that contained missing data – with it giving the classic “missing values and NaN’s not allowed if ‘na.rm’ is FALSE.” error. I have to admit I did not dig deep to resolve this, only confirming that simply adding “na.rm = TRUE” did not fix the problem. But here’s how the output would look, if you didn’t have any missing data.

ds_multi_stats(filter(data, !is.na(score)), score, rating)

descriptrms

There is also a command that has specific functionality to do group comparisons between numeric variables.

ds_group_summary(data$category, data$rating)

descriptrg

This one also requires that you have no missing data in the numeric field you are summarising. The output isn’t for kable or tidy tools, but its presentation is one of the best I’ve seen for making it easy to visually compare any particular summary stat between groups at a glance.

Whilst ds_summary_stats only works with numeric data, the package does contain other tools that work with categorical data. One such function builds frequency tables.

ds_freq_table(data$category)

descriptrft

This would actually have been a great contender for my previous post regarding R packages aimed at producing frequency tables to my taste. It does miss information that I would have liked to have seen regarding the presence of missing data. The percentages it displays are calculated out of the total non-missing data, so you would be wise to first pre-ensure you that your dataset is indeed complete, perhaps with the afore-mentioned ds_screener command.

Some of these commands also define plot methods, allowing you to produce graphical summaries with ease.

Books I read in 2017

Long term readers (hi!) may recall my failure to achieve the target I had of reading 50 books in 2016. I had joined the 2016 Goodreads reading challenge, logged my reading activity, and hence had access to the data needed track my progress at the end of the year. It turns out that 41 books is less than 50.

Being a glutton for punishment, I signed up again in 2017, with the same cognitively terrifying 50 book target – basically one a week, although I cannot allow myself to think that way. It is now 2018, so time to review how I did.

Goodreads allows you to log which books you are reading and when you finished them. The finish date is what counts for the challenge. Nefarious readers may spot a few potential exploits here, especially if competing for only 1 year. However, I tried to play the game in good faith (but did I actually do so?  Perhaps the data will reveal!).

As you go through the year, Goodreads will update you on how you are doing with your challenge. Or for us nerd types, you can download a much more detailed and useful CSV. There’s also a the Goodreads API¬†to explore, if that floats your boat.

Similarly to last year, I went with the CSV.¬† I did have to hand-edit the CSV a little, both to fill in a little missing data that appears to be absent from the Goodreads dataset, and also to add couple of extra data fields that I wanted to track that Goodreads doesn’t natively support. I then popped the CSV into a Tableau dashboard, which you can explore interactively by clicking here.

Results time!

How much did I read

Joyful times! In 2017 I got to, and even exceeded, my target! 55 books read.

In comparison to my 2016 results, I got ahead right from the start of the year, and widened the gap notably in Q2. You can see a similar boost to that witnessed in 2016 around the time of the summer holidays, weeks 33-35ish. Not working is clearly good for one’s reading obligations.

What were the characteristics of the books I read?

Although page count is a pretty vague and manipulable measure – different books have different physical sizes, font sizes, spacing, editions – it is one of the few measures where data is easily available so we’ll go with that. In the case of eBooks or audio books (more on this later) without set “pages” I used the page count of the respective paper version. I fully acknowledge this rigour of this analysis as falling under “fun” rather than “science”.

So the first revelation is that this year’s average pages per read book was 300, a roughly 10% decrease from last year’s average book. Hmm. Obviously, if everything else remains the same,¬† the target of 50 books is easier to meet if you read shorter books! Size doesn’t always reflect complexity or any other influence around time to complete of course.

I hadn’t deliberately picked short books – in fact, being aware of this incentive I had tried to be conscious of avoiding doing this, and concentrate on reading what I wanted to read, not just what boosts the stats. However, even outside of this challenge, I (most likely?) only have a certain number of years to live, and hence do feel a natural bias towards selecting shorter books if everything else about them was to be perfectly equal. Why plough through 500 pages if you can get the same level of insight about a topic in 150?

The reassuring news is that, despite the shorter average length of book, I did read 20% more pages in total. This suggests I probably have upped the abstract “quantity” of reading, rather than just inflated the book count by picking short books. There was also a little less variation in page count between books this year than last by some measures.

In the distribution charts, you can see a spike of books at around 150 pages long this year which didn’t show up last year. I didn’t note a common theme in these books, but a relatively high proportion of them were audio books.

Although I am an avid podcast listener, I am not a huge fan of audio books as a rule. I love the idea as a method to acquire knowledge whilst doing endless chores or other semi-mindless activities. I would encourage anyone else with an interest of entering book contents into their brain to give them a whirl. But, for me, in practice I struggle to focus on them in any multi-tasking scenario, so end up hitting rewind a whole lot. And if I am in a situation where I can dedicate full concentration to informational intake, I’d rather use my eyes than my ears. For one, it’s so much faster, which is an important consideration when one has a book target!¬† With all that, the fact that audio books are over-represented in the lower page-counts for me is perhaps therefore not surprising. I know my limits.

I have heard tell that some people may consider audio books as invalid for the book challenge. In defence, I offer up that Goodreads doesn’t seem to feel this way in their blog post on the 2018 challenge. Besides, this isn’t the Olympics – at least no-one has sent me a gold medal yet – so everyone can make their own personal choice. For me, if it’s a method to get a book’s contents into my brain, I’ll happily take it. I just know I have to be very discriminating with regards to selecting audio books I can be sure I will be able to focus on. Even I would personally regard it cheating to log a book that happened to be audio-streaming in the background when I was asleep. If you don’t know what the book was about, you can’t count it.

So, what did I read about?

What did I read

Book topics are not always easy to categorise. The categories I used here are mainly the same as last year, based entirely on my 2-second opinion rather than any comprehensive Dewey Decimal-like system. This means some sort of subjectivity was necessary. Is a book on political philosophy regarded as politics or philosophy? Rather than spend too much time fretting about classification, I just made a call one way or the other. Refer to above comment re fun vs science.

The main changes I noted were indeed a move away from pure philosophical entries towards those of a political tone. Likewise, a new category entrant was seen this year in “health”. I developed an interest in improving one’s mental well-being via mindfulness and meditation type subjects, which led me to read a couple of books on this, as well as sleep, which I have classified as health.

Despite me continuing to subjectively feel that I read the large majority of books in eBook form, I actually moved even further away from that being true this year. Slightly under half were in that form. That decrease has largely been taken up by the afore-mentioned audio books, of which I apparently read (listened?) 10 this year. Similarly to last year, 2 of the audio entries were actually “Great Courses“, which are more like a sequence of university-style lectures, with an accompanying book containing notes and summaries.

My books have also been slightly less popular with the general Goodreads-rating audience this year, although not dramatically so.

Now, back to the subject of reading shorter books in order to make it easier to hit my target: the sheer sense of relief I felt when I finished book #50 and hence could go wild with relaxed, long and slow reading, made me concerned as to whether I had managed to beat that bias or not. I wondered whether as I got nearer to my target, the length of the books I selected might have risen, even though this was not my intention.

Below, the top chart shows that average page count by book completed on a monthly basis, year on year.

Book length ofer time

 

The 2016 data risks producing somewhat invalid conclusions, especially if interpreted without reference to the bottom “count of books” chart, mainly because of the existence of a¬† September 2016, a month where I read a single book that happened to be over 1,000 pages long.

I also hadn’t actually decided to participate in the book challenge at the start of 2016. I was logging my books, but just for fun (imagine that!). I don’t remember quite when it was suggested I should explicitly join then challenge, but before then it’s less likely I felt pressure to read faster or shorter.

Let’s look then only at 2017:

Book length ofer time2Sidenote: What happened in July?! I only read one book, and it wasn’t especially long. I can only assume Sally Scholz’s intro to feminism must have been particularly thought-provoking.

For reference, I hit book #50 in November this year. There does seem some suggestion in the data that indeed that I did read longer books as time went on, despite my mental disavowal of doing such.

Stats geeks might like to know that the line of best fit shown in the top chart above could be argued to represent that 30% of the variation in book length over time, with each month cumulatively adding on an estimate of an extra 14 pages above a base of 211 pages.¬† It should be stated that I didn’t spend too long considering the best model or fact-checking the relevant assumptions for this dataset. Instead just pressed “insert trend line” in Tableau and let it decide :).

I’m afraid the regression should not be considered as being traditionally statistically significant at the 0.05 level though, having a p-value of – wait for it – 0.06. Fortunately, for my intention to publish the above in Nature :), I think people are increasingly aware of the silliness of uncontextual hardline p-value criteria and/or publication bias.

Nonetheless, as I participate in the 2018 challenge – now at 52 books, properly one a week – I shall be conscious of this trend and double-up my efforts to keep reading based on quality rather than length. Of course, I remain very open – some might say hopeful! – that one sign of a quality author is that they can convey their material in a way that would be described as concise. You generous readers of my ramblings may detect some hypocrisy here.

For any really interested readers out there, you can once more see the full list of the books I read, plus links to the relevant Goodreads description pages, on the last tab of the interactive viz.

My favourite R package for: summarising data

Hot on the heels of delving into the world of R frequency table tools, it’s now time to expand the scope and think about data summary functions in general. One of the first steps analysts should perform when working with a new dataset is to review its contents and shape.

How many records are there? What fields exist? Of which type? Is there missing data? Is the data in a reasonable range? What sort of distribution does it have? Whilst I am a huge fan of data exploration via visualisation, running a summary statistical function over the whole dataset is a great first step to understanding what you have, and whether it’s valid and/or useful.

So, in the usual format, what would I like my data summarisation tool to do in an ideal world? You may note some copy and paste from my previous post. I like consistency ūüôā

  1. Provide a count of how many observations (records) there are.
  2. Show the number, names and types of the fields.
  3. Be able to provide info on as many types of fields as possible (numeric, categorical, character, etc.).
  4. Produce appropriate summary stats depending on the data type. For example, if you have a continuous numeric field, you might want to know the mean. But a “mean” of an unordered categorical field makes no sense.
  5. Deal with missing data transparently. It is often important to know how many of your observations are missing. Other times, you might only care about the statistics derived from those which are not missing.
  6. For numeric data, produce at least these types of summary stats. And not to produce too many more esoteric ones, cluttering up the screen. Of course, what I regard as esoteric may be very different to what you would.
    1. Mean
    2. Median
    3. Range
    4. Some measure of variability, probably standard deviation.
    5. Optionally, some key percentiles
    6. Also optionally, some measures of skew, kurtosis etc.
  7. For categorical data, produce at least these types of summary stats:
    1. Count of distinct categories
    2. A list of the categories – perhaps restricted to the most popular if there are a high number.
    3. Some indication as to the distribution – e.g. does the most popular category contain 10% or 99% of the data?
  8. Be able to summarise a single field or all the fields in a particular dataframe at once, depending on user preference.
  9. Ideally, optionally be able to summarise by group, where group is typically some categorical variable. For example, maybe I want to see a summary of the mean average score in a test, split by whether the test taker was male or female.
  10. If an external library, then be on CRAN or some other well supported network so I can be reasonably confident the library won’t vanish, given how often I want to use it.
  11. Output data in a ‚Äútidy‚ÄĚ but human-readable format. Being a big fan of the tidyverse, it‚Äôd be great if I could pipe the results directly into¬†ggplot,¬†dplyr, or similar, for some quick plots and manipulations. Other times, if working interactively, I‚Äôd like to be able to see the key results at a glance in the R console, without having to use further coding.
  12. Work with ‚Äúkable‚ÄĚ from the¬†Knitr¬†package, or similar table output tools. I often use¬†R markdown¬†and would like the ability to show the summary statistics output in reasonably presentable manner.
  13. Have a sensible set of defaults (aka facilitate my laziness).

What’s in base R?

The obvious place to look is the “summary” command.

This is the output, when run on a very simple data file consisting of two categorical (“type”, “category”) and two numeric (“score”, “rating”) fields. Both type and score have some missing data. The others do not. Rating has a both one¬† particularly high and one particularly low outlier.

summary(data)

basesummary

This isn’t too terrible at all.

It clearly shows we have 4 fields, and it has determined that type and category are categorical, hence displaying the distribution of counts per category. It works out that score and rating are numerical, so gives a different, sensible, summary.

It highlights which fields have missing data. But it doesn’t show the overall count of records, although you could manually work it out by summing up the counts in the categorical variables (but why would you want to?). There’s no standard deviation. And whilst it’s OK to read interactively, it is definitely not “tidy”, pipeable or kable-compatible.

Just as with many other commands, analysing by groups could be done with the base R “by” command. But the output is “vertical”, making it hard to compare the same stats between groups at a glance, especially if there are a large number of categories. To determine the difference in means between category X and category Z in the below would be a lot easier if they were visually closer together. Especially if you had many more than 3 categories.

by(data, data$category, summary)

bybasesummary

So, can we improve on that effort by using libraries that are not automatically installed as part of base R? I tested 5 options. Inevitably, there are many more possibilities, so please feel free to write in if you think I missed an even better one.

  • describe, from the Hmisc package
  • stat.desc from pastecs
  • describe from psych
  • skim from skimr
  • descr and dfSummary from summarytools

Was there a winner from the point of view of fitting nicely to my personal preferences? I think so, although the choice may depend on your specific use-case.

For readability, compatibility with the tidyverse, and ability to use the resulting statistics downstream, I really like the skimr feature set. It also facilitates group comparisons better than most. This is my new favourite.

If you prefer to prioritise the visual quality of the output, at the expense of processing time and flexibility, dfSummary from summarytools is definitely worth a look. It’s a very pleasing way to see a summary of an entire dataset.
Update: thanks to Dominic who left a comment after having fixed the processing time issue very quickly in version 0.8.2

If you don’t enjoy either of those, you are probably fussy :). But for reference, Hmisc’s describe was my most-used variant before conducting this exploration.

describe, from the Hmisc package

Documentation

library(Hmisc)
Hmisc::describe(data)

hmiscdescribe

This clearly provides the count of variables and observations. It works well with both categorical and numerical data, giving appropriate summaries in each case, even adapting its output to take into account for instance how many categories exist in a given field. It shows how much data is missing, if any.

For numeric data, instead of giving the range as such, it shows the highest and lowest 5 entries. I actually like that a lot. It helps to show at a glance whether you have one weird outlier (e.g. a totals row that got accidentally included in the dataframe) or whether there are several values many standard deviations away from the mean. On the subject of deviations, there’s no specific variance or standard deviation value shown – although you can infer much about the distribution from the several percentiles it shows by default.

The output is nicely formatted and spacious for reading interactively, but isn’t tidy or kableable.

There’s no specific summary by group function although again you can pass this function into the by() command to have it run once per group, i.e.¬†by(data, data$type, Hmisc::describe)

The output from that however is very “long” and in order of groups rather than variables naturally, rendering comparisons of the same stat between different groups quite challenging at a glimpse.

stat.desc, from the pastecs package

Documentation

library(pastecs)
stat.desc(data)

statdesc

The first thing to notice is that this only handles numeric variables, producing NA for the fields that are categorical. It does provide all the key stats and missingness info you would usually want for the numeric fields though, and it is great to see measures of uncertainty like confidence intervals and standard errors available. With other parameters you can also apply tests of normality.

It works well with kable. The output is fairly manipulable in terms of being tidy, although the measures show up as row labels as opposed to a true field. You get one column per variable, which may or may not be what you want if passing onwards for further analysis.

There’s no inbuilt group comparison function, although of course the by() function works with it, producing a list containing one copy of the above style of table for each group – again, great if you want to see a summary of a particular group, less great if you want to compare the same statistic across groups.

describe and describeBy, from the psych package

Documentation

library(psych)
psych::describe(data)

psychdescribe

OK, this is different! It has included all the numeric and categorical fields in its output, but the categorical fields show up, somewhat surprisingly if you’re new to the package, with the summary stats you’d normally associate with numeric fields. This is because the default behaviour is to recode categories as numbers, as described in the documentation:

…variables that are categorical or logical are converted to numeric and then described. These variables are marked with an * in the row name…Note that in the case of categories or factors, the numerical ordering is not necessarily the one expected. For instance, if education is coded “high school”, “some college” , “finished college”, then the default coding will lead to these as values of 2, 3, 1. Thus, statistics for those variables marked with * should be interpreted cautiously (if at all).

As the docs indicate, this can be risky! It is certainly risky if you are not expecting it :). I don’t generally have use-cases where I want this to happen automatically, but if you did, and you were very careful how you named your categories, it could be handy for you.

For the genuinely numeric data though, you get most of the key statistics and a few nice extras. It does not indicate where data is missing though.

The output works with kable, and is pretty tidy, outside of the common issue of using rownames to represent the variable the statistics are summarising, if we are being pernickety.

This command does have a specific summary-by-group variation, describeBy. Here’s how we’d use it if we want the stats for each “type” in my dataset, A – E.

psych::describeBy(data, data$type)

psychdescribeby

Everything you need is there, subject to the limitations of the basic describe(). It’s much more compact than using the by() command on some of the other summary tools, but it’s still not super easy to compare the same stat across groups visually.¬† It also does not work with kable and is not tidy.

The “mat” parameter does allow you to produce a matrix output of the above.

psych::describeBy(data, data$type, mat = TRUE)

psychdescribebymat

This is visually less pleasant, but it does enable you to produce a potentially useful dataframe, which you could tidy up or use to produce group comparisons downstream, if you don’t mind a little bit of post-processing.

skim, from the skimr package

Documentation

library(skimr)
skim(data)

skim

At the top skim clearly summarises the record and variable count. It is adept at handling both categorical and numeric data. For readability, I like the way it separates them into different sections dependent on data type, which makes for quick interpretation given that different summary stats are relevant for different data types.

It reports missing data clearly, and has all the most common summary stats I like.

Sidenote: see the paragraph in red below. This issue mentioned in this section is no longer an issue as of skimr 1.0.1, although the skim_with function may still be of interest.

There is what appears to be a strange sequence of unicode-esque characters like <U+2587> shown at the bottom of the output. In reality, these are intended to be a graphical visualisation of distributions using sparklines, hence the column name “hist”, referring to histograms. This is a fantastic idea, especially to see in-line with the other stats in the table. Unfortunately, they do not by default display properly in the Windows environment which is why I see the U+ characters instead.

The skimr documentation details how this is actually a problem with underlying R code rather than this library, which is unfortunate as I suspect this means there cannot be a quick fix. There is a workaround involving changing ones locale, although I have not tried this, and probably won’t before establishing if there would be any side effects in doing so.

In the mean time, if the nonsense-looking U+ characters bother you, you can turn off the column that displays them by changing the default summary that skim uses per data type. There’s a skim_with function that you can use to add your own summary stats into the display, but it also works to remove existing ones. For example, to remove the “hist” column:

skim_with(integer = list(hist = NULL))
skim(data)

skimnohist

Now we don’t see the messy unicode characters, and we won’t for the rest of our skimming session.

UPDATE 2018-01-22 : the geniuses who designed skimr actually did find a way to make the sparklines appear in Windows after all! Just update your skimr version to version 1.0.1 and you’re back in graphical business, as the rightmost column of the integer variables below demonstrate.

skim2

 

 

The output works well with kable. Happily, it also respects the group_by function from dplyr, which means you can produce summaries by group. For example:

group_by(data, category) %>%
 skim()

groupbyskim

Whilst the output is still arranged by the grouping variable before the summary variable, making it slightly inconvenient to visually compare categories, this seems to be the nicest “at a glimpse” way yet to perform that operation without further manipulation.

But if you are OK with a little further manipulation, life becomes surprisingly easy! Although the output above does not look tidy or particularly manipulable, behind the scenes it does create a tidy dataframe-esque representation of each combination of variable and statistic. Here’s the top of what that looks like by default:

mydata <- group_by(data, category) %>%
 skim()

head(mydata, 10)

skimdf

It’s not super-readable to the human eye at a glimpse – but you might be able to tell that it has produced a “long” table that contains one row for every combination of group, variable and summary stat that was shown horizontally in the interactive console display. This means you can use standard methods of dataframe manipulation to programmatically post-process your summary.

For example, sticking to the tidyverse, let’s graphically compare the mean, median and standard deviation of the “score” variable, comparing the results between each value of the 3 “categories” we have in the data.

mydata %>%
 filter(variable == "score", stat %in% c("mean", "median", "sd")) %>%
 ggplot(aes(x = category, y = value)) +
 facet_grid(stat ~ ., scales = "free") +
 geom_col()

skimggplot

descr and dfSummary, from the summarytools package

Documentation

Let’s start with descr.

library(summarytools)
summarytools::descr(data)

descrsummarytools

The first thing I note is that this is another one of the summary functions that (deliberately) only works with numerical data. Here though, a useful red warning showing which columns have thus been ignored is shown at the top. You also get a record count, and a nice selection of standard summary stats for the numeric variables, including information on missing data (for instance Pct.Valid is the proportion of data which isn’t missing).

kable does not work here, although you can recast to a dataframe and later kable that, i.e.

kable(as.data.frame(summarytools::descr(data)))

The data comes out relatively tidy although it does use rownames to represent the summary stat.

mydata <- summarytools::descr(data)
View(mydata)

viewmydata

There is also a transpose option if you prefer to arrange your variables by row and summary stats as columns.

summarytools::descr(data, transpose = TRUE)

descrtranspose

There is no special functionality for group comparisons, although by() works, with the standard limitations.

The summarytools package also includes a fancier, more comprehensive, summarising function called dfSummary, intended to summarise a whole dataframe – which is often exactly what I want to do with this type of summarisation.

dfSummary(data)

summarytoolsdf

This function can deal with both categorical and numeric variables and provides a pretty output in the console with all of the most used summary stats, info on sample sizes and missingness. There’s even a “text graph” intended to show distributions. These graphs are not as beautiful as the sparklines that the skimr function tries to show, but have the advantage that they work right away on Windows machines.

On the downside, the function seems very slow to perform its calculations at the moment. Even though I’m using a relatively tiny dataset, I had to wait an annoyingly large amount of time for the command to complete – perhaps 1-2 minutes, vs other summary functions which complete almost instantly. This may be worth it to you for the clarity of output it produces, and if you are careful to run it once with all the variables and options you are interested in – but it can be quite frustrating when engaged in interactive exploratory analysis where you might have reason to run it several times.

Update 2018-02-10: the processing time issues should be fixed in version 0.82. Thanks very much to Dominic, the package author, for leaving a comment below and performing such a quick fix!

There is no special grouping feature.

Whilst it does work with kable, it doesn’t make for nice output. But don’t despair, there’s a good reason for that. The function has built-in capabilities to output directly into markdown or HTML.

This goes way beyond dumping a load of text into HTML format – instead giving you rather beautiful output like that shown below. This would be perfectly acceptable for sharing with other users, and less-headache inducing than other representations if staring at in order to gain an understanding of your dataset. Again though, it does take a surprisingly long time to generate.

summarytoolsdfhtml.PNG

My favourite R package for: frequency tables

Back for the next part of the “which of the infinite ways of doing a certain task in R do I most like today?” series. This time, what could more more fascinating an aspect of analysis to focus on than: frequency tables?

OK, most topics might actually be more fascinating. Especially when my definition of frequency tables here will restrict itself to 1-dimensional variations, which in theory a primary school kid could calculate manually, given time. But they are such a common tool, that analysts can use for all sorts of data validation and exploratory data analysis jobs, that finding a nice implementation might prove to be a time-and-sanity saving task over a lifetime of counting how many things are of which type.

Here’s the top of an example dataset. Imagine a “tidy” dataset, such that each row is an one observation. I would like to know how many observations (e.g. people) are of¬† which type (e.g. demographic – here a category between A and E inclusive)

Type Person ID
E 1
E 2
B 3
B 4
B 5
B 6
C 7

I want to be able to say things like: “4 of my records are of type E”, or “10% of my records are of type A”. The dataset I will use in my below example is similar to the above table, only with more records, including some with a blank (missing) type.

What would I like my 1-dimensional frequency table tool to do in an ideal world?

  1. Provide a count of how many observations are in which category.
  2. Show the percentages or proportions of total observations that represents
  3. Be able to sort by count, so I see the most popular options at the top – but only when I want to, as sometimes the order of data is meaningful for other reasons.
  4. Show a cumulative %, sorted by count, so I can see quickly that, for example, the top 3 options make up 80% of the data – useful for some swift Pareto analysis and the like.
  5. Deal with missing data transparently. It is often important to know how many of your observations are “missing”. Other times, you might only care about the statistics derived from those which are not missing.
  6. If an external library, then be on CRAN or some other well supported network so I can be reasonably confident the library won’t vanish, given how often I want to use it.
  7. Output data in a “tidy” but human-readable format. Being a big fan of the tidyverse, it’d be great if I could pipe the results directly into ggplot, dplyr, or whatever for some quick plots and manipulations. Other times, if working interactively, I’d like to be able to see the key results at a glance, without having to use further coding.
  8. Work with “kable” from the¬†Knitr¬†package, or similar table output tools. I often use R markdown and would like the ability to show the frequency table output in reasonably presentable manner.
  9. Have a sensible set of defaults (aka facilitate my laziness).

 

So what options come by default with base R?

Most famously, perhaps the “table” command.

table(data$Type)

table

A super simple way to count up the number of records by type. But it doesn’t show percentages or any sort of cumulation. By default it hasn’t highlighted that there are some records with missing data. It does have a useNA parameter that will show that though if desired.

table(data$Type, useNA = "ifany")

tablena

The output also isn’t tidy and doesn’t work well with Knitr.

The table command can be wrapped in the prop.table command to show proportions.

prop.table(table(data$Type))

proptable

But you’d need to run both commands to understand the count and percentages, and the latter inherits many of the limitations from the former.

So what’s available outside of base R? I tested 5 options, although there are, of course , countless more. In no particular order:

  • tabyl, from the janitor package
  • tab1, from epidisplay
  • freq, from summarytools
  • CrossTable, from gmodels
  • freq, from questionr

Because I am fussy, I managed to find some slight personal niggle with all of them, so it’s hard to pick an overall personal winner for all circumstances. Several came very close. I would recommend looking at any of the janitor, summarytools and questionr package functions outlined below if you have similar requirements and tastes to me.

tabyl, from the janitor package

Documentation

library(janitor)
tabyl(data$Type, sort = TRUE)

janitor

This is a pretty good start! By default, it shows counts, percents, and percent of non-missing data. It can optionally sort in order of frequency. It the output is tidy, and works with kable just fine. The only thing missing really is a cumulative percentage option. But it’s a great improvement over base table.

I do find myself constantly misspelling “tabyl” as “taybl” though, which is annoying, but not really something I can really criticise anyone else for.

tab1, from the epidisplay package

Documentation

library(epiDisplay)
tab1(data$Type, sort.group = "decreasing", cum.percent = TRUE)

epidisplayepidisplay2

This one is pretty fully featured. It even (optionally) generates a visual frequency chart output as you can see above. It shows the frequencies, proportions and cumulative proportions both with and without missing data. It can sort in order of frequency, and has a totals row so you know how many observations you have all in.

However it isn’t very tidy by default, and doesn’t work with knitr. I also don’t really like the column names it assigns, although one can certainly claim that’s pure personal preference.

A greater issue may be that the cumulative columns don’t seem to work as I would expect when the table is sorted, as in the above example. The first entry in the table is “E”, because that’s the largest category. However, it isn’t 100% of the non-missing dataset, as you might infer from the fifth numerical column. In reality it’s 31.7%, per column 4.

As far as I can tell, the function is working out the cumulative frequencies before sorting the table – so as category E is the last category in the data file it has calculated that by the time you reach the end of category E you have 100% of the non-missing data in hand. I can’t envisage a situation where you would want this behaviour, but I’m open to correction if anyone can.

freq, from the summarytools package

Documentation

library(summarytools)
summarytools::freq(data$Type, order = "freq")

summarytools

This looks pretty great. Has all the variations of counts, percents and missing-data output I want – here you can interpret the “% valid” column as “% of all non-missing”. Very readable in the console, and works well with Knitr. In fact it has some further nice formatting options that I wasn’t particularly looking for.

It it pretty much tidy, although has a minor niggle in that the output always includes the total row. It’s often important to know your totals, but if you’re piping it to other tools or charts, you may have to use another command to filter that row out each time, as there doesn’t seem to be an obvious way to prevent it being included with the rest of the dataset when running it directly.

Update 2018-04-28: thanks to Roland in the comments below pointing out that a new feature to disable the totals display has been added: set the “totals” parameter to false, and the totals row won’t show up, potential making it easier to pass on for further analysis.

CrossTable, from the gmodels library

Documentation

library(gmodels)
CrossTable(data$Type)

crosstable

Here the results are displayed in a horizontal format, a bit like the base “table”. Here though, the proportions are clearly shown, albeit not with a cumulative version. It doesn’t highlight that there are missing values, and isn’t “tidy”. You can get it to display a vertical version (add the parameter max.width = 1 ) which is visually distinctive, but untidy in the usual R tidyverse sense.

It’s not a great tool for my particular requirements here, but most likely this is because, as you may guess from the command name, it’s not particularly designed for 1-way frequency tables. If you are crosstabulating multiple dimensions it may provide a powerful and visually accessible way to see counts, proportions and even run hypothesis tests.

freq, from the questionr package

Documentation

library(questionr)
questionr::freq(data$Type, cum = TRUE, sort = "dec", total = TRUE)

questionr

Counts, percentages, cumulative percentages, missing values data, yes, all here! The table can optionally be sorted in descending frequency, and works well with kable.

It is mostly tidy, but also has an annoyance in that the category values themselves (A -E are row labels rather than a standalone column. This means you may have to pop them into in a new column for best use in any downstream tidy tools. That’s easy enough with e.g. dplyr’s add_rownames command. But that is another processing step to remember, which is not a huge selling point.

There is a total row at the bottom, but it’s optional, so just don’t use the “total” parameter if you plan to pass the data onwards in a way where you don’t want to risk double-counting your totals. There’s an “exclude” parameter if you want to remove any particular categories from analysis before performing the calculations as well as a couple of extra formatting options that might be handy.