How to be happy: the data driven answer (part 1)

A fundamental goal for many people, explicit or otherwise, is to be maximally happy. Easily said, not always so easily done. So how might we set about raising our level of happiness? OK, at some level, we’re all individuals with our own set of wishes and desires. But, at a more macro level, there are underlying patterns in many human attributes that lead me to believe that learning what typically makes other people happy might be useful with regards to understanding what may help maintain or improve our own happiness.

With that in mind, let’s see if we can pursue a data driven answer to this question: what should our priorities or behaviours likely be if we want to optimise for happiness?

For a data driven answer, we’re going to need some data. I settled on the freely available   HappyDB corpus (thanks, researchers!).

The corpus contains around 100,000 free text answers, crowdsourced from members of the public who had signed up to Mechanical Turk,  to this question:

What made you happy today? Reflect on the past {24 hours|3 months}, and recall three actual events that happened to you that made you happy. Write down your happy moment in a complete sentence.

Write three such moments.

Examples of happy moments we are NOT looking for (e.g events in distant past, partial sentence):
– The day I married my spouse
– My Dog

It should be noted that the average Mechanical Turk user is not representative of the average person in the world and hence any findings may have limited generalisability. However, the file does contain some demographics which we consider digging into later in case there’s any correspondence between one’s personal characteristics and what makes them happy. Perhaps most notably, around 86% of respondents were from the USA, so there will clearly be a geographic / cultural bias at play. However, being from the UK, a similarly “westernised country”, this may be less of a problem for me if I personally wish to act on the results. A more varied geographical or cultural comparison on drivers of happiness would however be a fascinating exercise.

I’m most interested in eliciting the potential drivers of happiness in a manner conducive to informing my mid-to-longer term goals. With that in mind, here I’m only looking at the longer term variant of the question – what made people happy in the past 3 months? A followup could certainly be whether or not the same type of things that people report having made them happy over a 3 month timeframe correspond to the things that people report making them happy on a day-to-day basis.

So, having downloaded the dataset , the first step is to read it into our analysis software. Here I will be using R. The researchers supply several few data files, which are described on their Github page. Here I decided to use the file where they’d generously cleaned up a few of the less valid looking entries, for instance if they were blank, had only a single word, or seemed to be misspelled, called “cleaned_hm.csv“.

Whilst some metadata is also included in that file, for the first part of this exercise I am only interested in these columns:

  • hmid: the unique ID number for the “happy moment”
  • wmid: an ID number that tells us which “worker” i.e. respondent answered the question. One person may have responded to the question several times. It’s also a way to link up the demographic attributes of the respondent in future if we want to.
  • cleaned_hm: the actual (cleaned up) text of the response the user gave to the happiness question

We’ll also use the “reflection_period” field to filter down to only the responses to the 3 month version of the question. This field contains “3m” in the row for these 3-month responses.

library(readr)
library(dplyr)

happy_db <- read_csv(".\\happy_db_data\\rit-public-HappyDB-b9e529e\\happydb\\data\\cleaned_hm.csv")

happy_data <- filter(happy_db, reflection_period == "3m") %>%
select(hmid, wid, cleaned_hm)

Let’s take a look at a few rows of the data to ensure everything’s looking OK.

head

OK, that looks good, and perhaps already gives us a little preview of what sort of things  make people happy,

Checking the recordcount – nrow(happy_data) – shows we have 50,704  happy moments to look at after the filters have been applied. Reading them all one by one isn’t going to be much fun, so let’s load up some text analysis tools!

First, I wanted to tokenise the text. That is to say, extract each individual word from the each response to the happiness question and put it onto its own row. This makes certain types of data cleaning and analysis way easier, especially if, like me,  you’re a tidyverse advocate. I may want to be able to reconstruct the tokenised sentences in future, so will keep track of which word comes from which happy moment, by logging the relevant happy moment ID, “hmid”, on each row.

Past experience (and common sense) tells me that some types of words may be more interesting to analyse than others. I won’t learn a lot from knowing how many respondents used the word “and”.

One approach could be to classify words as to their “part of speech” type – adjectives, verbs, nouns and so on. My intuition is that people’s happiness might be influenced by encountering certain objects (in the widest possible sense – awkwardly incorporating people into that classification) or engaging in specific activities. This feels like a potential close fit to the nouns and verb speech parts. So let’s begin by extracting each word and classifying it as to whether it’s a noun, verb or something else.

One R option for this is the cleanNLP library. cleanNLP includes functions to tokenise and “annotate” each word with respect to what part of speech it is. It can actually use a variety of natural language processing backends to do this, some more fancy than others.  Some require installing non-R software like Python or Java, which, to keep it simple here, I’m going to avoid. So I’m going to use the udpipe library, which is pure R, and hence can be installed with the standard install.packages(“udpipe”) command if necessary.

Once cleanNLP is installed, we need to ask it to split up and tag our text as to whether each word is a noun, verb and so on. First we initialise the backend we want to use, udpipe. Then we use the cnlp_annotate command to perform the tokenisation and annotation, passing it:

  • the name of the dataframe containing the question responses: happy_data.
  • the name of the field which contains the actual text of the responses, as text_var.
  • the name of the field that identifies each individual unique response as doc_var.

This process can take a long time to complete, so don’t start it if you’re in a hurry.

Finally, we’ll take the resulting “annotation” object and extract the table of annotated words (aka tokens) from it, to make it easy to analyse with standard tidy tools.

install.packages("cleanNLP")
library(cleanNLP)

# initialise udpipe annotation engine

cnlp_init_udpipe()

# annotate text

happy_data_annotated <- cnlp_annotate(happy_data, as_strings = TRUE, 
                                   text_var = "cleaned_hm", doc_var = "hmid")

# extract tokens into a data frame

happy_terms <- happy_data_annotated %>% 
  cnlp_get_token()

OK, let’s see what we have, comparing this output to the original response data for the first entry.

before

After:

after

Lovely. In our annotated token data, we can see the id field matches the hmid field of the response file, allowing us to keep track of which words came from which response.

Each word of the response is now a row, in the field “word”. Furthermore, the word has also been converted to its lemma in the lemma field – a lemma being the base “dictionary” version of the word – for instance the words “studying” and “studies” both have the lemma “study“. This lemmatisation seems useful if I’m interested in a general overview of the subjects people mention in response to the happiness question, insomuch as it may group words together that have the same basic meaning.

Of course it can’t always be perfect – the above example in fact has a type of error, in that the word “nowing” has been lemmatised as “now”, whereas one has to assume the respondent actually meant “knowing”, with a lemma more like “know”. However, this is user error – they misspelt their response. I don’t fancy going through each response to fix up these types of errors, so for the purposes of this quick analysis I’ll just assume them to be relatively rare.

We can also see the in the field “upos” that the words have been classified into their grammatical function – including the verbs and nouns we’re looking for.  upos stands for “universal part of speech”, and you can see what the abbreviations mean here.

The “pos” field divides these categories up further, and can be deciphered as being Penn part of speech tags. In the above example, you can see some verbs are categorised as VBG and others as VB. This is distinguishing between the base form of a verb and its present participle. I’m not going to concern myself with these differences, so will stick to the upos field.

Now we have this more usable format, there’s still a little more data cleansing I want to do.

  • One part of speech (hereafter “POS”) category is called PUNCT, meaning punctuation. I don’t really want to include punctuation marks in my analysis, so will remove any words that are classified as PUNCT.
  • Same goes for the category SYM, which are symbols.
  • I also want to remove stopwords. These are words like “the”, “a” and other very common words that are typically not too informative when it comes to analysing sentences at a per-word level. Here I used the snowball list to define which words are in that category. This list is included in the excellent tidytext package,  the usage of which is documented in its companion book.
  • I’m also going to remove words that directly correspond to “happy”. Given the question itself is “what makes you happy?” I feel safe to assume that all the responses should relate to happiness. Knowing people used the word “happy” a lot won’t add much to my understanding.

Here’s how I did it:

happy_terms <- filter(happy_terms, upos != "PUNCT" & upos != "SYM") %>%
  anti_join(filter(stop_words, lexicon == "snowball"), by = c("lemma" = "word")) %>%
  filter(!(lemma %in% c("happy", "happiness", "happier")))

OK, now we have a nice cleanish list of the dictionary words from the responses. Let’s take a look at what they are! We’ll start with the most obvious type of analysis, a frequency count. What are the most common words people used in their responses?

Let’s start with “things”, i.e. nouns. Here are the top 20 nouns used in the answers to the happiness question.

library(ggplot2)

filter(happy_terms, upos == "NOUN") %>%
  count(lemma, sort = TRUE) %>%
  arrange(desc(n)) %>%
  slice(1:20) %>%
  ggplot(aes(x = reorder(lemma, n, mean), y = n)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(title = "Most common nouns in responses to happiness question", x = "", y = "Count")
common_nouns

OK, there’s already some strong themes emerging there! Many nouns relate to people (friend, family, etc.), there’s a mention of jobs, and some more “occasion” based terms like birthday, event etc.

There’s also a lot of time frame indicators. Whilst it may eventually be interesting that there are more mentions of day than month, and more month mentions than year, for a simple first pass I’m not sure they add a lot. Let’s exclude them, and plot a larger sample of nouns. This time we’ll use a word cloud, where the size of the words are relative to the frequency of usage. Larger words are those that are used more often in the responses. For this, we can use the appropriately named library “wordcloud“.

It should be noted that, whilst wordclouds are more visually appealing than bar charts to many people, they are certainly harder to interpret in a precise way. However, here I’m more looking for the major themes, so I’ll live with that and go with the prettier option.

install.packages("wordcloud")
library(wordcloud)

# create a count of nouns, excluding some time based ones

happy_token_frequency_nlp_noun <- 
  filter(happy_terms, upos == "NOUN" & !(lemma %in% c("day", "time", "month", "week", "year", "today"))) %>%
  count(lemma, sort = TRUE)

# open a png graphics device so wordcloud gets saved to disk as a png

png("wordcloud_packages.png", width=12,height=8, units='in', res=300) 

# create the wordcloud

wordcloud(words = happy_token_frequency_nlp_noun$lemma, freq = happy_token_frequency_nlp_noun$n, max.words = 150, colors = c("grey60", "darkgoldenrod1", "tomato"))

# close the png device

dev.off() 

A wordcloud of happy nouns:

nouncloud

A cornucopia of happiness-inducing things! In this format, some commonalities truly stand out. It seems that friends make people particularly happy. Families are key too, although the number of times “family” is directly mentioned is lower than “friend”. That said, parts of family-esque structures such as son, daughter, wife, husband, girlfriend and boyfriend are also specifically mentioned with relatively high frequency. Basically, people are made happy by other people. Or even by our furry relatives – dog and cat are both in there.

Next most common, perhaps less intuitively, are words around what is probably employment – job and work.

There’s plenty of potentially “event” type words in there – event itself, but also birthday, date, dinner, game, movie et al. Perhaps these show what type of occasions are most associated with happy times. I can’t help but note that many of them sound again like opportunities for socialising.

There’s a few actual “things” in the more conventional sense too; insentient objects that you could own.  They’re less prevalent in terms of mentions of any individual one, but car, computer, bike and phone some relevant examples that appear.

And then some perhaps less controllable phenomena – weather, season and surprise.

It must be noted these interpretations have to rely on assumptions based on preexisting knowledge of how people usually construct sentences and the norms for answering questions. We’re looking at single words in isolation. There’s no context.

If people are actually writing “Everything except my friend made me happy”  then the word “friend” would still appear in the wordcloud. Intuitively though, this seems unlikely to be the main driver of it featuring so strongly. We may however dig deeper into context later.  For now, we can also increase our confidence in the most straightforward interpretation by noting that there’s a fair bit of external research out there that suggests a positive connection between friendship and happiness, and even health, that would support these results. Happify produced a nice infographic summarising the results of a few such studies.

Next up, let’s repeat the exercise using verbs, or as I vaguely recall learning at primary school, “doing words”. What kind of actions do people take that results in them having a happy memory?

The code is very similar to the nouns example. Firstly, let’s filter and plot the most common verbs we found in the responses.

filter(happy_terms, upos == "VERB") %>%
  count(lemma, sort = TRUE) %>%
  arrange(desc(n)) %>%
  slice(1:20) %>%
  ggplot(aes(x = reorder(lemma, n, mean), y = n)) +
  geom_col() +
  coord_flip() +
  theme_bw() +
  labs(title = "Most common verbs in responses to happiness question", x = "", y = "Count")
common_verbs

The top 3 there – get, go and make – seem be particularly prevalent compared to the rest. Getting things, going to places and the satisfaction of creating things are all things we might intuit are pleasing to the average human. However they’re somewhat vague, which makes me curious as to how exactly they’re being used in sentences. Are there particular things that show up as being got, gone to or made that generate happy memories?

To understand that, we’re going to have look at a bit of, at least simplistic, context. We’ll start by looking at the most common phrases those words appear in. To do this, we’ll use the tidytext library to generate “bigrams” – that is to say, each 2-word combination within the response sentences that those specific words are used in.

For example, if there’s a sentence ‘I love to make cakes’, then the bigrams involving ‘make’ here are:

  • to make
  • make cakes

If we calculate and count every such bigram, then, if enough people enjoy making cakes, that might pop out in our data.

First up though, remember that our analysis so far is actually showing the lemmas of the words used by the respondents, rather than the exact words themselves. So we’ll need to determine which words in our dataset are represented by the lemmas “get”, “go” and “make” in order to comprehensively find the appropriate bigrams in the text.

We can do that by looking for all the distinct combinations of lemma and word in our tokenised dataset. Here’s one way to see which unique real-world words in our dataset are represented by the lemma “get”:

filter(happy_terms, lemma == "get") %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%
  distinct()
getlemma

This means that we should look in the raw response text for any bigrams involving got, get, getting, gets or gotten if we want to see what the commonest contexts for the “get” responses are. To automate this a little, let’s create a vector for each of the 3 highly represented lemmas – get, go, and make – that include all the words that are grouped into that lemma.

get_words <- filter(happy_terms, lemma == "get") %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%
  distinct() %>%
  pull(word)

go_words <- filter(happy_terms, lemma == "go") %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%
    distinct() %>%
  pull(word)

make_words <- filter(happy_terms, lemma == "make") %>%
  select(word) %>%
  mutate(word = tolower(word)) %>%
    distinct() %>%
  pull(word)

The next step is to generate the bigrams themselves. Here I’m also removing the common, generally tedious, stopwords as we did before. This simplistic approach to stopwords does have some potentially problematic side-effects – including that if an interesting single word is surrounded entirely by stopwords then it will be excluded from this analysis. However, in terms of detecting a few interesting themes as opposed to creating a detailed linguistic analysis, I’m OK with that for now.

# create the bigrams

happy_bigrams <-  unnest_tokens(happy_data, bigram, cleaned_hm, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!(word1 %in% filter(stop_words, lexicon == "snowball")$word) & !(word2 %in% filter(stop_words, lexicon == "snowball")$word))

# create a count of how many time each bigram was found - sorting from most to least frequent

bigram_counts <- happy_bigrams %>% 
  count(word1, word2, sort = TRUE)             
             
# show the top 20 bigrams that use the 'get' lemma
             
filter(bigram_counts, word1 %in% get_words | word2 %in% get_words) %>%
    head(20) %>%
    ggplot(aes(x = reorder(paste(word1, word2), n, mean), y = n)) +
    geom_col() +
    coord_flip() +
    theme_bw() +
    labs(title = "Most common bigrams involving the 'get' lemma", x = "", y = "Count")

What do people like to get?

get_common

“finally got” may not tell us much about what was gotten, but evidently achieving something one has waited for for some time is particularly pleasing in this context. Other obvious themes there are mentions again of other people – son got, wife got, friend got, get together etc. Likewise there’s a few life events there – getting married, and work stuff like getting promoted or hired.

But my curiosity remains unabated. What sort of things are people happy that they “finally got”? To get some idea, I decided to split the responses up into five-word n-grams –  think of these as being like bigrams, but looking at groups of 5 consecutive words rather than 2.

Groups of 5 words here will tend to produce low counts as people don’t necessarily express the same idea using precisely the same words – we may revisit this limitation in future! Nonetheless, it may give us a clue as to what some of the bigger topics are. We can use the unnest_tokens command from the tidytext library again. This is exactly what we did above to generate the bigrams, but this time specifying an n of 5, to specific that we want sequences of 5 words.

# make the 5-grams, keeping only those that were found at least twice
happy_fivegrams <-  unnest_tokens(happy_data, phrase, cleaned_hm, token = "ngrams", n = 5) %>%
  separate(phrase, c("word1", "word2", "word3", "word4", "word5"), sep = " ") %>%
  group_by(word1, word2, word3, word4, word5) %>%
  filter(n() >= 2 ) %>%
  ungroup()

# count how many times each 5-gram was used
fivegram_counts <- happy_fivegrams %>% 
  count(word1, word2, word3, word4, word5, sort = TRUE)   
  
 # show the top 15 that started with "finally got" 
filter(fivegram_counts, word1 == "finally" & word2 == "got" ) %>%
   head(15)
finallygot

Only a couple of people used most of the precise same phrases, but the most obvious feature that stands out here related to employment. People “finally” getting new jobs, or progressing towards the jobs they want, are occasions that make them happy.

Back now to drilling down into the common verb lemma bigrams: where are people going when they make happy memories?

# show the top 20 bigrams that use the 'go' lemma   
	  
filter(bigram_counts, word1 %in% go_words | word2 %in% go_words) %>%
    head(20) %>%
    ggplot(aes(x = reorder(paste(word1, word2), n, mean), y = n)) +
    geom_col() +
    coord_flip() +
    theme_bw() +
    labs(title = "Most common bigrams involving 'go' lemma", x = "", y = "Count")  
go_common

Here, going shopping wins the frequency competition – although it’s certainly not usually an experience I personally enjoy! There’s a few references to family and friends again, and a predilection for going home. A couple of specific outdoor leisure activities – hiking and fishing – seem to suit some folk well. A less expected top 20 entry there was Pokemon Go, the “gotta-catch-em-all” augmented reality game, which, sure enough, does  have the word “go” in its title. The human urge to collect apparently persists in the virtual world ūüôā

Finally, what are people making that pleases them?

# show the top 20 bigrams that use the 'make' lemma 

filter(bigram_counts, word1 %in% make_words | word2 %in% make_words) %>%
    head(20) %>%
    ggplot(aes(x = reorder(paste(word1, word2), n, mean), y = n)) +
    geom_col() +
    coord_flip() +
    theme_bw() +
    labs(title = "Most common bigrams involving 'make' lemma", x = "", y = "Count")
make_common

It’s not all about making in the sense of arts and crafts here. Making plans appeals to some – perhaps creating something that they can then look forward to occurring in the future. References to “moments” are pretty vague but perhaps describe happiness coming from specific events that could later be drilled down into.

There’s the ever-present social side of wives, sons, girlfriends et al featuring in the commonest bigrams. The hedonistic pleasures of money and sex also creep into the top 20.

OK, now we’ve dug a little into the 3 verb lemmas that dominate the most common verbs used, let’s take a look at the next most frequently used selection of verbs. Here again, we’ll use the (controversial) wordcloud, and hope it allows us to elucidate at least some common themes.

Removing “get”, “make” and “go” from the tokenised verb list, before wordclouding the most common of the remaining verb lemmas:

# filter the terms to show only verbs that are not get, make or go.

happy_token_frequency_nlp_verb_filtered <- filter(happy_terms, upos == "VERB" & !lemma %in% c("get", "make", "go")) %>%
count(lemma, sort = TRUE)

# open a png graphics device so wordcloud gets saved to disk as a png

png("wordcloud_verb_filtered.png", width=12,height=8, units='in', res=300) 

# create the wordcloud

wordcloud(words = happy_token_frequency_nlp_verb_filtered$lemma, freq = happy_token_frequency_nlp_verb_filtered$n, max.words = 150, colors = c("grey60", "darkgoldenrod1", "tomato"))

# close the png device

dev.off() 

A wordcloud of happy verbs:

verbcloudfilterd

The highest volumes here are seen in experiential words – see and feel, vague as they might seem in this uni-word analysis. Perhaps though they hint towards the conclusions of  some external research that suggests, at least after a certain point of privilege, people tend to gain more happiness from spending their money on experiences as opposed to objects.

Beyond that, some hints towards – suprise, surprise – social interactions appear. Give, receive, take, visit, love, meet, tell, say, play, talk, share, participate, listen, speak, invite and call, amongst other words, may all potentially fall into that category.

There’s “spend”, a further analysis of which could differentiate between whether we’re typically talking about spending time, spending money, or both. A hint towards the former existing at least to some extent is given by “buy” being a relatively common verb. “Work” also features, along with some almost-by-definition happiness inducers such as “win” and “celebrate”.

In terms of specific activities, we see hints at some intellectual pursuits: read, book, learn, know, graduate – together with some potentially fitness related activities: walk, run and move. Per the famous saying, eating and drinking also makes some people merry.

So, in conclusion:

That’s the end of our first step here, essentially variations on a frequency analysis of the words used by respondents when they’re asked to recall what made them happy in the past 3 months. Did it produce any insights?

My main takeaway here is really the key criticality of the social. Happiness seekers should usually prioritise people, where there’s an option to do so.

Yes, shopping, cars and phones and a few other inanimate objects of potential desire popped up. We also saw a few recreational activities that may or may not be social. Here I’m thinking of walking, running, fishing and hiking. It’s perhaps also notable that those examples happen to be potentially physically active activities that often take place outside. There’s external research extolling the well-being benefits of both being outdoors and physical exercise.

Words associated with education, learning and employment also appeared with relatively high frequency, so a focus on optimising those aspects of life might also be worthwhile for improving satisfaction.

But references to friends, family and occasions that may involve other people dominated the discourse. Thus, when making life plans, it’s probably wise to consider the social – meaningful interaction with real live human beings – as a particularly high priority.

In our noun analysis above, the word “friend” stuck out. Whilst as life goes on, and often gets busier, it can sometimes be hard to find time to focus on these wider relationships, there’s external evidence that¬†placing high a value on friendship is associated with improved well-being – potentially even moreso than the variation in valuing family relationships.

These effects seem to associate with even the most dramatic measures of well-being, with several studies suggesting that people who maintain strong social relationships of any kind tend to be happier, healthier and even live longer.

Advertisements

Using R to run many hypothesis tests (or other functions) on subsets of your data in one go

It’s easy to run a basic hypothesis test in R, once you know how. For example, if you’ve a nice set of data that you know meets the relevant assumptions, then you can run a t test in the following sort of way . Here we’ll assume that you’re interested in comparing the differences in the number of apples eaten between people, based on which group they’re in, from an imaginary “foods eaten” experiment dataset.

t.test(apples_eaten ~ group, data = foods_eaten)

The above uses the default values for hypothesis type, expected difference, confidence level et al. but these are all configurable by various parameters should you need to do so.

The default output looks something like this:

Welch Two Sample t-test

data: apples by group
t = 0.033192, df = 76.123, p-value = 0.9736
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.950128 3.050128
sample estimates:
mean in group a mean in group b 
9.05 9.00

Here we can see that the difference in means does not seem significant by pretty much any definition; the most obvious indicator being that the p-value is way above the traditional – if increasingly¬†controversial –¬† “significance” cutoff of 0.05.

This output nice enough for interactive use if you’re analyzing a single variable from an experiment or two. But, let’s say your dataset actually includes the results of several experiments, several outcome variables, or some other scenario where you want to conduct multiple hypothesis tests in one go. Do you really want to run the above command many times and manually collate the results into a nice table to let you see a quick overview of results? Well, of course not! R, being your servant rather than your master, should run them all for you and output the results into a nice clean table, right?

First up though, you should probably consider why you’re doing this. Running multiple unadjusted t-tests for different groups within the same data is often a very bad idea. Per usual, xkcd provides an awesome explanation as to why.

significant

If you’re looking at the differences between multiple subgroups in one experiment, you should often prefer, for example, an ANOVA test to several t-tests. This helps to avoid inflating the chances of a type 1 error, and falsely concluding you found an effect when in reality there isn’t one. I have also seen some instances of people running (unadjusted) multiple t-tests after an ANOVA has reported an effect in order to try and determine between which of the groups in the ANOVA the difference is significant in –¬† but this holds the same sort of danger. In that case, there’s plenty of standard post-hoc tests it would be more legitimate to use instead – Tukey’s HSD test for instance is similar to a t test, but contains adjustments to reduce the error rate.

Anyway, enough with the theoretical warnings. Perhaps you do have a valid use case. Or perhaps you realise that the below technique really allows you to run many different kind of functions that aren’t t tests on a dataset. These may be other types of hypothesis tests, or functions entirely unrelated to such testing. If this is the case, you can think of this as a template.

Let’s think through the mechanics of what we want to do. Let’s take a contrived (and fake) scenario in which we conducted 2 experiments, each involving 2 sets of people, a control and a test group. We measured how many of various different types of fruits and vegetable they ate. We’re interested in whether each group within an individual experiment differed in terms of the mean average of each type of fruit eaten.

Here’s a few rows of the sample data this non-existent experiment generated.

experiment_number group person_number apples bananas carrots dandelions eggplant
Experiment 1 a 1 1 3 3 3 1
Experiment 1 a 3 1 1 3 3 5
…
Experiment 1 b 103 2 5 3 5 2
Experiment 1 b 104 8 1 3 1 2
…
Experiment 2 c 7 19 5 8 0 3
Experiment 2 c 12 4 5 6 1 0
…
Experiment 2 d 111 16 3 6 0 5
Experiment 2 d 113 12 3 4 4 0

Our goal here, for the sake of argument, is to perform a t test for each experiment (identified by the “experiment number” column), comparing the 2 relevant groups of participants (the “group” column) in terms of how many apples they ate. Then, separately, on how many bananas they ate, and so on through the other edibles. Then we want to move on to experiment 2 and repeat the same exercise, until we have a hypothesis test result for each experiment on each of the foodstuffs. Yes, I know, if this was a real analysis, we’d need to revisit the above xkcd cartoon and think hard about whether this is the appropriate design for analysing the experiment.

But, that aside, this means we need to:

  1. Select only the group of data that has the first experiment_number, i.e. anything with experiment_number = Experiment 1 here.
  2. Select the first outcome variable of interest (apples)
  3. Run a t test on it, comparing participant groups against each other (here group a vs group b).
  4. Record the results of that t test (let’s say the t statistic, the p value, and the estimated difference in means) as a row in a results table
  5. Go back to step 2, but moving to the next outcome variable, here examining bananas instead of apples, until we’ve tested all the outcome variables.
  6. Go back to step 1, moving to the next experiment number, i.e. Experiment 2, and repeating the whole process again.

In this dataset, 2 experiments with 5 outcomes, that’s 10 t-tests. I don’t want to look at 10 of the default outputs for R’s t.test function, thank you. I’d prefer a result that looks something like this…

experiment_number food_type estimated mean of first group estimated mean of second  group t statistic p value
Experiment 1 apples 9.05 9 0.03 0.97
Experiment 1 bananas 2.5 2.875 -1.02 0.31
…
Experiment 2 dandelions 2.16 2.84 -1.76 0.08
Experiment 2 eggplant 3.2 3.66 -0.86 0.39

…where we’ve a table, or “dataframe” as R calls it, showing 1 row per t test with the most relevant statistics.

OK, let’s go, building it up step by step. Being R, of course there’s a multitudinous variety of ways we could do this. Being a tidyverse fan, I’m going with what could be described as a¬†dplyr approach. We’re going to need to install and load the following libraries to use this method.

  • tidyr: to get the data into a “tidy” format.
  • dplyr: to manipulate the data into the groups you wish to hypothesis test on.
  • broom: to tidy up the results of the default R t.test output into a nice tabular format

So, load up these libraries, remembering to install the packages first if you’ve not already done so, as below.

install.packages("tidyr") # only needed if you didn't ever install tidyr
install.packages("dplyr") # only needed if you didn't ever install dplyr
install.packages("broom") # only needed if you didn't ever install broom

library(tidyr) # to get data into "tidy" format
library(dplyr) # to manipulate data
library(broom) # to tidy up the output of the default t test function.

Step 2 is to transform the data table of experiment results from a “wide” format to a “long” format.

Many tidyverse functions, and much of R in general, works better with long data. We know our lowest level of grouping – i.e. the granularity we want to conduct our tests at – is at a per experiment per outcome variable level. For example, we’ll need to group all of the “apples” data for “experiment 1” together so we can analyse it with an individual test.

If we’re going to group then the experiment number and the outcome variable type should probably be individual columns. The experiment_number already is. But the outcome variables aren’t – they each have their own column. We thus want to get from this situation: 1 row per person who participated in the experiment…

experiment_number person_number group apples bananas carrots
1 1 a 1 2 3
2 7 c 19 24 6

…to this format – 1 row per outcome variable per person. Note how the experiment number, group, and person number are repeated down the rows, in order to keep track of whose apples outcome the row concerns.

experiment_number group person_number food_type count_eaten
Experiment 1 a 1 apples 1
Experiment 1 a 1 bananas 2
Experiment 1 a 1 carrots 3
…
Experiment 2 c 7 apples 19
Experiment 2 c 7 bananas 24
Experiment 2 c 7 bananas 6

We’re using a “food_type” column to record which outcome variable we’re talking about (apples vs bananas), and the “person_number” column to ensure we keep track of which entries belong to the same person:

This “unpivoting” is a perfect use-case for dplyr’s gather() function. Here we go:

results <- gather(data, key = "food_type", value = "count_eaten", -experiment_number, -person_number, -group)

The "key" parameter of the command lets us set the name of the column that will store the name of the outcome variable we're looking at. The "value" parameter lets us set the name of the column that will store the actual result obtained by the participant for the specified outcome variable.

Where we see "-experiment_number, -person_number, -group" at the end of the function call, that's to tell it that – because experiment number, person number and group are not outcome variables – we don’t want to collapse their contents into the same column as the count of apples, bananas etc. We need these variables to be replicated down the list to know which experiment, person and group the respective outcome variable was obtained from.

Note the minuses in front of those variable names. This means “don’t unpivot these columns’. If you only had¬† a few columns to unpivot, you could instead write the names of the columns you do want to unpivot there, without the minuses.

Let’s check what “results” now looks like, by typing head(results).

results1

OK, that looks very much like what we were aiming for. Onto the next step!

Now we need to ensure our analysis is only carried out using the data from a single experiment for a single outcome. Our outcomes here are the food type counts. For example, we want to consider the data from Experiment 1’s apples outcome separately from both Experiment 1’s bananas outcome, and Experiment 2’s apples outcome. In other words, we want to group the data by which experiment it’s from and which food type it’s measuring.

Handily, dplyr has a group_by function. Pass in the data you’re working with, and the names of the columns you want to group it by.

results <- group_by(results, experiment_number, food_type)

Checking the  head of the results now looks pretty similar to the previous step, only this time the output tells you, quite faintly, near the top, that we've grouped by the appropriate variables.

results2

Now, it’s time to do the actual t tests!

First of all, remember that we don’t want a bunch of those verbose default t.test outputs to work with. We want a nice table summarising the most important parts of the test results. We thus need to make the key parts of t test output “tidy“, so it can be put into a table, aka dataframe.

How to tidy the results of a hypothesis test? Use broom. Horray for R package names that attempt some kind of pun.

Let’s first take a moment to see how tidying works with a single t test, run in the conventional way.

Here’s the output of the default t.test function looking at the apples outcome of experiment 1. First I constructed a dataframe “exp1_apples_data”, which contains only data about apples, from experiment 1. Then I ran the standard t.test – being explicit about many of the options available, although if you like the defaults you don’t really need to be.


exp1_apples_data %>%
  select(group, person_number, apples)

t_test_results <- t.test(apples ~ group,
       data = exp1_apples_data,
        alternative = "two.sided",
        mu = 0,
        paired = FALSE,
        var.equal = FALSE,
        conf.level = 0.95)

t_test_results

This outputs:

ttest.PNG

Now, if we apply the tidy() function from the broom package to the results of the t test, what do we get?

tidy(t_test_results)

Outputs:

tidyt

Yes, a nice tidy 1-row summary for the results of the t test. Comparing it to the default output above, you can see how:

  • field “estimate1” corresponds to the mean in group a
  • “estimate2” corresponds to the mean in group b (and the “estimate” field corresponds to the difference between the mean in group a and b)
  • “statistic” corresponds to the t statistic
  • “p.value” corresponds to the p-value
  • “parameter” corresponds to the degrees of freedom
  • “conf.low” and “conf.high” correspond to the confidence interval
  • “method” and “alternative” record the type & nature of the test options selected

So now we have a set of grouped data, and a way of producing t test results in in a tabular format. Let’s combine!

For this, we can use the dplyr function “do“. This applies what the documentation refers to as an “arbitrary computation” (i.e. most any function) to a set of data, treating each of the groups we already set up separately.

The first parameter is the name of the (grouped) dataset to use. The second is the function you want to perform on it.

results <- do(results, tidy(t.test(.$count_eaten ~ .$group,
                 alternative = "two.sided",
                 mu = 0,
                 paired = FALSE,
                 var.equal = FALSE,
                 conf.level = 0.95
                 )))

 

Note that you have to express the t test formula as .$count_eaten ~ .$group rather than the more usual count_eaten ~ group. Inside, the do() function, the . represents the current group of data, and the $ is the standard R way of accessing the relevant column of a dataframe.

What do we have now? Here's the head of the table.

results_3

Success! Here we see a row for each of the tests we did –¬† 1 per experiment per food type –¬† along with the key outputs from the t tests themselves.

This is now just a regular data frame, so you can sort, filter, manipulate or chart as you would with any other dataset. Figuring out which of your results have  p value of 0.05 or lower for instance could be done via:

filter(results, p.value <= 0.05)

Below is a version where the commands are all pieced together, using the pipe operator.¬† That’s likely easier to copy and paste should you or I have a need to do so. Do remember that this is applicable, and perhaps more often actually theoretically valid, to the world outside of t tests.

The tidy() command itself is capable of tidying up many types of hypothesis tests and models so that their results appear in a useful table format. Beyond that, many many other functions that will return something coercible into a row of a dataframe or as an item in a R list, so will work.

results %>%
  group_by(experiment_number, food_type) %>%
  do(tidy(t.test(.$count_eaten ~ .$group,
                 alternative = "two.sided",
                 mu = 0,
                 paired = FALSE,
                 var.equal = FALSE,
                 conf.level = 0.95
                 )))

Analysing your 23andme genetic data in R part 2: exploring the traits associated with your genome

In part one of this mini-series, you heroically obtained and imported your 23andme raw genome data into R. Fun as that was, let’s see if we can learn something interesting from it.¬† After all, 23andme does automatically provide several genomic analysis reports, but – for many sensible reasons – it is certainly limited in what it can show when compared to the entire literature of this exciting field.

It would be tiresome for me to repeat all the caveats you can find in part one, and all over the more responsible parts of the internet. But do remember that in the grand scheme of things we likely know only somewhere between “nothing” and “a little” so far about the implications of most of the SNPs you imported into R in part one.

Whilst there are rare exceptions, for the most part it seems like many “interesting” real-world human traits are a product of many variations in different SNPs, each of which have a relatively tiny effect size. In these cases, if you happen to have a T somewhere rather than an A, it’s usually unlikely on its own to produce to any huge “real world” outcome you’d notice alone. Or if it does, we don’t necessarily know it yet. Or perhaps it could, but only if you have certain other bases in certain other positions, many of which 23andme may not report on – even if we knew which ones were critical. That was a long and meandering sentence which could be summarised as “things are complicated”.

Note also that 23andme do provide a disclaimer on the raw data, as follows:

This data has undergone a general quality review however only a subset of markers have been individually validated for accuracy. As such, the data from 23andMe’s Browse Raw Data feature is suitable only for research, educational, and informational use and not for medical or other use.

So, all in all, what follows should be treated as a fun-only investigation. You should first seek the services of genetic medical professionals, including the all-important genetic counsellor, if you have any real concerns about what you might find.

But, for the data thrill-seekers,¬† let’s start by finding some info as which genotypes at which SNPs are thought to have some potential association with a known trait. This is part of what 23andme does for you in their standard reporting. However, they have legal obligations and true experts working on this. I, on the other hand, recently read “Genetics for Dummies”, so tread carefully.

How do we find out about the associations between your SNP results and any subsequent phenotypes? The obvious place, especially if you’re interested in a particular trait, is the ever-growing published scientific literature on genetics. Being privileged enough to count amongst my good friends a scientist who actually published papers involving this subject, I started there. Here, for example, is Dr Kaitlin Roke’s paper: “Evaluating Changes in Omega-3 Fatty Acid Intake after Receiving Personal FADS1 Genetic Information: A Randomized Nutrigenetic Intervention“.¬†Extra credit is also very much due to her for being kind enough to help me out with some background knowledge.

Part of that fascinating study involved explaining to members of the public what the implications of the differing alleles possible at a specific SNP,¬†rs174537, were with regards to the levels of Omega 3 fatty acids typically found in a person’s body, and how well the relevant conversion process proceeds. This may have dietary implications in terms of how much the person concerned should focus on increasing the amount of omega-3 laden food they eat – although it would be remiss of me to fail to mention the good doctor’s general advice that mostly everyone needs to up their levels anyway!

Anyway, to quote from their wonderfully clear description of the implications of the studied SNP:

…the document provided a brief overview of the reported difference in omega-3 FA levels in relation to a common SNP in FADS1 (rs174537)…individuals who are homozygote GG allele carriers have been reported to have more EPA in their bodies and an increased ability to convert ALA into EPA and DHA while individuals with at least one copy of the minor allele (GT or TT) were shown to have less EPA in their bodies and a reduced ability to convert ALA into EPA and DHA

Awesome, so here we have a definitive explanation as to which SNP was examined, and what the implications of the various genotypes are (which I’ve bolded for your convenience).

In part 1 of this post, we already saw how to filter your R-based 23andme data to view your results for a specific SNP, right? If you already completed all those import steps, you can do the same, just switching in the rsid of interest:

library(dplyr)
filter(genome_data_test, rsid == "rs174537")

g

Run that, and if you are returned a result of either GT or TT then you’ll know you should be especially careful to ensure you are getting a good amount of omega 3 in your diet.

OK, this is super cool, but what if you don’t happen to know a friendly scientist, or don’t know what traits you’re particular interested in –¬† how might you evaluate the SNP implications at scale?

Whilst there’s no substitute for actual expertise, luckily there is a R library called “gwascat“, which enables you to access the¬† NHGRI-EBI Catalog of published genome-wide association studies via a data structure in R. It has a whole lot of info in it, descriptions of the fields you end up with mostly being shown¬†here. The critical point is that it contains a list of SNPs, associated traits and relevant genotypes, together with the references to the publications that found the associations should you want to get more details.

The first thing to do is install gwascat. gwascat is a bioconductor package, rather than the business-user-typical cran packages. So if you’re not a bioconductor user, there’s a slightly different installation routine, which you can see here.

But essentially:

source("https://bioconductor.org/biocLite.R")
biocLite("gwascat")
library(gwascat)

This took a while to install for me, but can just be left to get on with itself.

It sometimes feels like a new genomics study is released almost every few seconds, so the next step may be to get an up-to-date version of the catalog data –¬† I think the one that installs by default is a few years out of date.

Imagine that we’d like our new GWAS catalogue to end up as a data frame called “updated_gwas_data”:

updated_gwas_data <- as.data.frame(makeCurrentGwascat())

This might take a few minutes to run, depending on how fast your download speed is. You can get some idea of the recency once it’s done by checking the latest date that any publication was added to the catalogue.

max(updated_gwas_data$DATE.ADDED.TO.CATALOG)

At the time of writing, this date is May 21st 2018. And what was that study?

library(dplyr)
filter(updated_gwas_data, DATE.ADDED.TO.CATALOG == "2018-05-21") %>% select(STUDY) %>% distinct()

“Key HLA-DRB1-DQB1 haplotypes and role of the BTNL2 gene for response to a hepatitis B vaccine.”

Where can I find that study, if I want to read what it actually said?

library(dplyr)
filter(updated_gwas_data, DATE.ADDED.TO.CATALOG == "2018-05-21") %>% select(LINK) %>% distinct()

www.ncbi.nlm.nih.gov/pubmed/29534301

OK, now we have up-to-date data, let’s figure out how join it to your personal raw genome data we imported in part 1 (or to be precise here, a mockup file in the same format, so as to avoid sharing anyone’s real genomic data here).

The GWAS data lists each SNP (e.g. “rs9302874”) in a field called SNPS. Our imported 23andme data has the same info in the rsid field. Hence we can do a simple join, here using the dplyr library.

Assuming the 23andme data you imported is called “genome_data”, then:

library(dplyr)

output_data <- inner_join(genome_data, updated_gwas_data, by = c("rsid" = "SNPS"))

Note the consequences of the inner join here. 23andme analyses SNPs that don’t appear in this GWAS database, and the GWAS database may contain SNPs that 23andme doesn’t provide for you. In either case, these will be removed in the above file result. There’ll just be rows for SNPs that 23andme does provide you, and that do have an entry in the GWAS database.

Also, the GWAS database may have several rows for a single SNP. It could be that several studies examined a single SNP, or that one study found many traits potentially associated with a SNP. This means your final “output_data” table above will have many rows per for some SNPs.

OK, so at the time of writing there are nearly 70,000 studies in the GWAS database, and over 600,000 SNPs in the 23andme data export. How shall we narrow down this data-mass to find something potentially interesting?

There are many fields in the GWAS database you might care about – the definitions being listed here. For us amateur folks here, DISEASE.TRAIT and STRONGEST.SNP.RISK.ALLELE might be of most interest.

DISEASE.TRAIT gives you a genericish name for the trait that a study investigated whether there was an association with a given SNP (e.g. “Plasma omega-3 polyunsaturated fatty acid levels”, or “Systolic blood pressure”). Note that the values are not actually all “diseases” by the common-sense meaning – unless you consider traits like being tall a type of illness anyway.

STRONGEST.SNP.RISK.ALLELE gives you the specific allele of the SNP that was “most strongly” associated with that trait in the study (or potentially a ? if unknown, but let’s ignore those for now). The format here is to show the name of the SNP first, then append a dash and the allele of interest afterwards e.g. “rs10787517-A” or “rs7977462-C”.

This can easily give the impression of greater specificity than the real world has – only 1 allele ever appears in this field, so if there are multiple associations then only the strongest will be listed. If there are associations in tandem with other alleles or other SNPs, then that information also cannot be fully represented here. Also, it’s not necessarily the case that alleles are additive; so without further research we shouldn’t assume that having 2 of the high risk bases gives increased risk over a single base.

That’s what the journal reference is for – another reason it’s critical you do the reading and seek the help of appropriate genetic professionals before any rejoicing or worrying about your results.

Taking the above example from Roke et al’s Omega 3 study, this GWAS database records the most relevant strongest SNP risk allele for the SNP they analysed as being “rs174537-T”. You’d want to read the study in order to know whether that meant that the TT genotype was the one to watch for, or whether GT had similar implications.

Back to an exploration of your genome – the two most obvious approaches that come to mind are either: 1) check whether your 23andme results suggest an association with a specific trait you’re interested in, or 2) check which traits your results may be associated with.

In either case, it’ll be useful to create a field that highlights whether your 23andme results indicate that you have the “strongest risk allele” for each study. This is one way to help narrow down towards the interesting traits you may have inherited.

The 23andme part of of your dataframe contains your personal allele results in the genotype field. There you’ll see entries like “AC” or “TT”. What we really want to do here is, for every combination of SNP and study, check to see if either of the letters in your genotype match up with the letter part of the strongest risk allele.

One method would be to separate out your “genotype” data field into two individual allele fields (so “AC” becomes “A” and “C”). Next, clean up the strongest risk allele so you only have the individual allele (so “rs10787517-A”¬† becomes “A”). Finally check whether either or both of your personal alleles match the strongest risk allele. If they do, there might be something of interest here.

One method:

library(dplyr)
library(stringr)

output_data$risk_allele_clean <- str_sub(output_data$STRONGEST.SNP.RISK.ALLELE, -1)
output_data$my_allele_1 <- str_sub(output_data$genotype, 1, 1)
output_data$my_allele_2 <- str_sub(output_data$genotype, 2, 2)
output_data$have_risk_allele_count <- if_else(output_data$my_allele_1 == output_data$risk_allele_clean, 1, 0) + if_else(output_data$my_allele_2 == output_data$risk_allele_clean, 1, 0)

Now you have your two individual alleles stored in my_allele_1 and my_allele_2, and the allele for the “highest risk” stored in risk_allele_clean. Risk_allele_clean is the letter part of the GWAS STRONGEST.SNP.RISK.ALLELE field. And finally, the have_risk_allele_count is either 0, 1 or 2 depending on whether your 23andme genotype result at that SNP contains 0, 1 or 2 of the risk alleles.

The previously mentioned DISEASE.TRAIT field contains a summary of the trait involved. So by filtering your dataset to only look for studies about a trait you care about, you can see a summary of the risk allele and whether or not you have it, and the relevant studies that elicited that connection.

I did notice that this trait field can be kind of messy to use. You’ll see several different entries for similar topics; e.g. some studied traits around Body Mass Index are indeed classified as “Body mass index”, others as “BMI in non-smokers” or several other BMI-related phrases. So you might want to try a few different search strings in the below to access everything on the topic you care about.

For example, let’s assume that by now we also inevitably developed a strong interest in omegas and fatty acids. Which SNPs may relate to that topic, and do we personally have the risk allele for any of them?

We can use the str_detect function of the stringr library in order to search for any entries that contain the substring “omega” or “fatty acid”.

library(dplyr)
library(stringr)
select(output_data, rsid, DISEASE.TRAIT, risk_allele = risk_allele_clean, your_geneotype = genotype) %>% 
 filter(str_detect(tolower(DISEASE.TRAIT), "omega") | str_detect(tolower(DISEASE.TRAIT), "fatty acid")) %>%
 distinct()

 

rs

The full table this outputs is actually 149 rows long. That’s a fair few for an amateur to sift through. Maybe we’d prefer to restrict ourselves to the SNP we heard from Dr Roke’s study above was of particular interest:¬† rs174537. Easy: just filter on the rsid:

library(dplyr) 
library(stringr)
select(output_data, rsid, DISEASE.TRAIT, risk_allele = risk_allele_clean, your_geneotype = genotype) %>% 
 filter(str_detect(tolower(DISEASE.TRAIT), "omega") | str_detect(tolower(DISEASE.TRAIT), "fatty acid")) %>%
 distinct() %>%
 filter(rsid == "rs174537")

rof

Maybe you are curious as to what other traits that same SNP might be associated with? Just reverse the criteria for omega and fatty acid strings. Here I also added the journal reference title, in case I wanted to read up more into these trait associations.

library(dplyr)
 select(output_data, rsid, DISEASE.TRAIT, STUDY) %>% 
 filter(!str_detect(tolower(DISEASE.TRAIT), "omega") & !str_detect(tolower(DISEASE.TRAIT), "fatty acid")) %>%
 distinct() %>%
 filter(rsid == "rs174537")

rs174537

Now onto the second approach – this time, you don’t have a specific trait in mind. You’re more interested in discovering which traits have risk alleles that match the respective part of your genome. Please see all the above disclaimers. This is not remotely the same as saying which dreadful diseases you are going to get. But please stay away from this section if you are likely to be worried about seeing information that could even vaguely correspond to health concerns.

We already have our¬†have_risk_allele_count field. If it’s 1 or 2 then you have some sort of match. So, the full list of your matches and the associated studies could be retrieved in a manner like this.

library(dplyr)
filter(output_data, have_risk_allele_count >= 1) %>%
 select(rsid, your_genotype = genotype, strongest_risk_allele = risk_allele_clean, DISEASE.TRAIT, STUDY )

Note that this list is likely to be long. Some risk alleles are very common in some populations, and remember that there may be many studies that relate to a single SNP, or many SNPs referred to by a single study. You might want to pop it in a nice DT Javascript DataTable to allow easy searching and sorting.

library(DT)
library(dplyr)

datatable(
 filter(output_data, have_risk_allele_count >= 1) %>%
 select(rsid, your_genotype = genotype, strongest_risk_allele = risk_allele_clean, DISEASE.TRAIT, STUDY )
)

results in something like:

dt

…which includes a handy search field at the top left of the output.

There are various other fields in the GWAS dataset you might consider using to filter down further. For example, you might be most interested in findings from studies that match your ethnicity, or occasions where you have risk alleles that are rare within the population. After all, we all like to think we’re special snowflakes, so if 95% of the general population have the risk allele for a trait, then that may be less interesting to an amateur genome explorer than one where you are in the lucky or unlucky 1%.

For the former, you might try searching within the INITIAL.SAMPLE.SIZE or REPLICATION.SAMPLE.SIZE fields, which has entries like: “272 Han Chinese ancestry individuals” or “1,180 European ancestry individuals from ~475 families”.

Similar to the caveats on searching the trait fields, one does need to be careful here if you’re looking for a comprehensive set of results. Some entries in the database have blanks in one of these fields, and others don’t specify ethnicities, having entries like “Up to 984 individuals”.

For the proportion of the studied population who had the risk allele, it’s the RISK.ALLELE.FREQUENCY field. Again, this can sometimes be blank or zero. But in theory, where it has a valid value, then, depending on the study design, you might find that lower frequencies are rarer traits.

We can use dplyr‘s arrange and filter functions to sort do the above sort of narrowing-down. For example: what are the top 10 trait / study / SNP combinations you have the risk allele for that were explicitly studied within European folk, ordered by virtue of them having the lowest population frequencies reported in the study?

library(stringr)
library(dplyr)
head(
 filter(output_data,have_risk_allele_count > 0 & (str_detect(tolower(INITIAL.SAMPLE.SIZE), "european") | str_detect(tolower(REPLICATION.SAMPLE.SIZE), "european")) & (RISK.ALLELE.FREQUENCY > 0 & !is.na(RISK.ALLELE.FREQUENCY))) %>%
 arrange(RISK.ALLELE.FREQUENCY) %>%
 select(rsid, your_genotype = genotype, DISEASE.TRAIT, INITIAL.SAMPLE.SIZE,RISK.ALLELE.FREQUENCY)
 , 10)

 

dt2

 

Or perhaps you’d prefer to prioritise the order of the traits you have the risk allele for, for example, based on the number of entries in the GWAS database for that trait where the highest risk allele is one you have. You might argue that these could be some of the most reliably associated traits, in the sense that they would bias towards those that have so far been studied the most, at least within this database.

Let’s go graphical with this one, using the wonderful ggplot2 package.

library(dplyr)
library(ggplot2)

trait_entry_count <- group_by(output_data, DISEASE.TRAIT) %>%
 filter(have_risk_allele_count >= 1) %>%
 summarise(count_of_entries = n())

ggplot(filter(trait_entry_count, count_of_entries > 100), aes(x = reorder(DISEASE.TRAIT, count_of_entries, sum), y = count_of_entries)) +
 geom_col() +
 coord_flip() +
 theme_bw() +
 labs(title = "Which traits do you have the risk allele for\nthat have over 100 entries in the GWAS database?", y = "Count of entries", x = "Trait")

over100

Now, as we’re counting combinations of studies and SNPs per trait here, this is obviously going to be somewhat self-fulfilling as some traits have been featured in way more studies than others. Likewise some traits may have been associated with many more SNPs than others. Also, recalling that many interesting traits seem to be related to a complex mix of SNPs, each of which may only have a tiny effect size, it might be that whilst you do have 10 of the risk alleles for condition X, you also don’t have the other 20 risk alleles that we discovered so far have an association (let alone the 100 weren’t even publish on yet and hence aren’t in this data!).

Maybe then we can sort our output in a different way. How about we count the number of distinct SNPs where you have the risk allele, and then express those as a proportion of the count of all the distinct SNPs for the given trait in the database, whether not you have the risk allele? This would let us say things such as, based (solely) on what in this database, you have 60% of the known risk alleles associated with this trait.

One thing noted in the data, both the 23andme genome data and the gwascat highest risk allele have unusual values in the allele fields – things like ?, -, R, D, I and some numbers based on the fact the “uncleaned” STRONGEST.SNP.RISK.ALLELE didn’t have a -A, -C, -G or -T at the end of the SNP it named. Some of these entries may be meaningful – for example the D and I in the 23andme data refer to deletes and insertions, but won’t match up with anything in the gwascat data. Others may be more messy or missing data, for example 23andme reports “–” if no genotype result was provided for a specific SNP call.

In order to avoid these inflating the proportion’s denominator we’ll just filter down so that we only consider entries where our gwascat-derived “risk_allele_clean” and¬† 23andme-derived “my_allele_1” and “my_allele_2″ are all one of the standard A, C, G or T bases.

Let’s also colour code the results by the rarity of the SNP variant within the studied population. That might provide some insight to exactly what sort of special exception we are as an individual – although some of the GWAS data is missing that field and basic averaging won’t necessarily give the correct weighting, so this part is extra…”directional”.

You are no doubt getting bored with the sheer repetition of caveats here – but it is so important. Whilst these are refinements of sorts, they are simplistic and flawed and you should not even consider concluding something significant about your personal health without seeking professional advice here. This is fun only. Well, for for those of us who could ever possibly classify data as fun anyway.

Here we go, building it up one piece at a time for clarity of some kind:

library(dplyr)
library(ggplot2)

# Summarise proportion of SNPs for a given trait where you have a risk allele

trait_snp_proportion <-  filter(output_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
mutate(you_have_risk_allele = if_else(have_risk_allele_count >= 1, 1, 0)) %>%
 group_by(DISEASE.TRAIT, you_have_risk_allele) %>%
 summarise(count_of_snps = n_distinct(rsid)) %>%
 mutate(total_snps_for_trait = sum(count_of_snps), proportion_of_snps_for_trait = count_of_snps / sum(count_of_snps) * 100) %>%
 filter(you_have_risk_allele == 1) %>%
 arrange(desc(proportion_of_snps_for_trait)) %>%
 ungroup()

# Count the studies per trait in the database

trait_study_count <- filter(output_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
 group_by(DISEASE.TRAIT) %>%
 summarise(count_of_studies = n_distinct(PUBMEDID), mean_risk_allele_freq = mean(RISK.ALLELE.FREQUENCY))

# Merge the above together

trait_snp_proportion <- inner_join(trait_snp_proportion, trait_study_count, by = "DISEASE.TRAIT")

# Plot the traits where there were more than 2 studies and you have risk alleles for more than 70% of the SNPs studied

ggplot(filter(trait_snp_proportion, count_of_studies > 1 & proportion_of_snps_for_trait > 70), aes(x = reorder(DISEASE.TRAIT, proportion_of_snps_for_trait, sum), y = proportion_of_snps_for_trait, fill = mean_risk_allele_freq)) +
 geom_col() +
 coord_flip() +
 theme_bw() + 
 labs(title = "Traits I have more than half of the risk\nalleles studied where > 1 studies involved", 
 y = "% of SNPs with risk allele", x = "Trait", fill = "Mean risk allele frequency") +
 theme(legend.position="bottom")

 

proptraits

Again, beware that the same sort of trait can be expressed in different ways within the data, in which case these entries are not combined. If you wanted to be more comprehensive regarding a specific trait, you might feel inclined to produce your own categorisation first and group by that – e.g. lumping anything BMI or Body Mass Index into your own BMI category via creating a new field.

Happy exploring!

Analysing your 23andme genetic data in R part 1: importing your genome into R

23andme is one of the ever-increasing number of direct to consumer DNA testing companies. You send in a vial of your spit; and they analyse parts of your genome, returning you a bunch of reports on ancestry, traits and – if you wish – health.

Their business is highly regulated, as of course it should be (and some would say it oversteps the mark a little even with that), so they are, quite rightly, legally limited as to what info they can provide back to the consumer. However, the exciting news for us data geeks is that they do allow you to download the raw data behind their analysis. This means you can dig deeper into parts of your genome that their interpretations don’t cover.

It should be said that there is considerable risk involved here, unless – or perhaps even if – you happen to be a genetics expert. The general advice on interpretation for amateurs should be to seek a professional genetic counseller before concluding anything from your DTC test – although in reality that might be easier said than done.

Whilst I might know a bit about how to play with data, I am not at all a genetics expert, so anything below must be taken with a large¬† amount of skepticism. In fact, if you are in the perfectly legitimate camp of “best not to know” people when it comes to DNA analysis, or you feel there is any risk you won’t be able to constrain yourself to treat the innards of your genome as solely a fun piece of analysis and constrain yourself to avoid areas you don’t want to explore, it would be wise not to proceed.

Also, even as an amateur, I’m aware that the science behind a lot of the interpretation of one’s genome is in a nascent period, at best. There are many people or companies that may rather over-hype what is actually known here, perhaps even to the extent of fraud in some cases.

But if you are interested to browse your results, here is my first experience of playing with the 23andme raw data in R.

Firstly, you need to actually obtain your raw 23andme data. A obvious precondition to this is that you have purchased one of their analysis products, set up your 23andme account, and waited until they have delivered the results to you. Assuming that’s all done, you can visit this 23andme page, and press the “Download” button near the top of the screen. You’ll need to wait a while, but eventually they’ll notify you that your file is ready, and let you download a text file of results to your computer. Here, I called my example file “genome.txt”.

Once you have that, it’s time to load it into R!

The text file is in a tab-delimited format, and also contains 19 rows at the top describing the contents of the file in human-readable format. You’ll want to skip those rows before importing it into R. I used the readr package to do this, although it’s just as easy in base R.

A few notes:

  • It imported more successfully if I explicitly told R the data type of each column.
  • One of the column headers (i.e. the field names) starts with a # and includes spaces, which is a nuisance to deal with in R, so I renamed that right away
  • I decided to import it into a dataframe called “genome_data”
library(readr)

genome_data_test <- read_tsv(".\\data_files\\genome.txt", skip = 19, col_types = cols(
 `# rsid` = col_character(),
 chromosome = col_character(),
 position = col_integer(),
 genotype = col_character())
 )

genome_data_test <- rename(genome_data_test, rsid = `# rsid`)

Let’s see what we have!

head(genome_data_test)

genomedatatest

Sidenote: the genome data I am using is a mocked-up example in the 23andme format, rather than anyone’s real genome – so don’t be surprised if you see “impossible” results shown here. Call me paranoid, but I am not sure it’s necessarily a great idea to publicly share someone’s real results online, at least without giving it careful consideration.

OK, so we have a list of your SNP call data. The rsid column is the “Reference SNP cluster ID” used to refer to a specific SNP, the chromosome and position tell you whereabouts that SNP is located, and the genotype tells you which combination of the Adenine, Thymine, Cytosine and Guanine bases you happen have in those positions.

(Again, I am not at all an expert here, so apologies for any incorrect terminology! Please feel free to let me know what I should have written ūüôā )

Now, let’s check that the import went well.

Many of the built in 23andme website reports do actaully¬† list what SNPs they refer to. For instance, if you click on “Scientific Details” on the life-changing trait report which tells you how likely it is that you urine will smell odd to you after eating asparagus, and look for the “marker tested” section, it tells you that it’s looking at the¬†rs4481887 SNP.

Capture

And it also tells you what bases were found there in your test results. Compare that to the data for the same person’s genome imported in R, by filtering your imported data like this:

library(dplyr)
filter(genome_data_test, rsid == "rs4481887")

If the results of that match the results shown in the scientific details of your asparagus urine smell report, yay, things are going OK so far.

OK, so now your 23andme data is safely in R. But why did we do this, and what might it mean? Come back soon for part 2.

R packages for summarising data – part 2

In a recent post, I searched a tiny percentage of the CRAN packages in order to check out the options for R functions that quickly and comprehensively summarise data, in a way conducive to tasks such as data validation and exploratory analytics.

Since then, several generous people have been kind enough to contact me with suggestions of other packages aimed at the same sort of task that I might enjoy. Always being grateful for new ideas, I went ahead and tested them out. Below I’ll summarise what I found.

I also decided to add another type of variable to the small test set from last time – a date. I now have a field “date_1” which is a Date type, with no missing data, and a “date_2” which does have some missing data in it.

My “personal preference” requirements when evaluating these tools for my specific use-cases haven’t changed from last time, other than that I’d love the tool to be able to handle date fields sensibly.

For dates, I’d ideally like to see the summarisation output acknowledge that the fields are indeed dates, show the range of them (i.e. max and min), if any are missing and how many distinct dates are involved. Many a time I have found that I have fewer dates in my data than I thought, or I had a time series that was missing a certain period. Both situations would be critical to know about for a reliable analysis. It would be great if it could somehow indicate if there was a missing block of dates – e.g. if you think you have all of 2017’s data, but actually June is missing, that would be important to know before starting your analysis.

Before we examine the new packages, I wanted to note that one of my favourites from last time, skimr, received a wonderful update, such that it now displays sparkline histograms when describing certain types of data, even on a Windows machine. Update to version 1.0.1 and you too will get to see handy mini-charts as opposed to strange unicode content when summarising. This only makes an already great tool even more great, kudos to all involved.

Let’s check how last time’s favourite for most of my uses cases, skimr, treats dates – and see the lovely fixed histograms!

library(skimr)

skim(data)

skim

This is good. It identifies and separates out the date fields. It shows the min, max and median date to give an idea of the distribution, along with the number of unique dates. This will provide invaluable clues as to how complete your time series is. As with the other data types, it clearly shows how much missing data there is. Very good. Although I might love it even more if it displayed a histogram for dates that showed the volume of records (y) per time period (x; granularity depending on range), giving a quick visual answer to the question of missing time periods.

The skim-by-groups output for dates is equally as sensible:

library(dplyr)
library(skimr)
group_by(data, category) %>% skim()

skimgroup

Now then, onto some new packages! I collated the many kind suggestions, and had a look at:

  • CreateTableOne, from the tableone package
  • desctable from desctable
  • ggpairs from GGally
  • ds_summary_stats from Descriptr
  • I was also going to look at the CompareGroups package, but unfortunately it had apparently been removed from CRAN recently because “check problems were not corrected in time” apparently. Maybe I’ll try again in future.

CreateTableOne, from the tableone package

Documentation

library(tableone)
 CreateTableOne(data = data)

tableone

Here we can easily see the count of observations. It has automatically differentiated between the categorical and numeric variables, and presented likely appropriate summaries on that basis. There are fewer summary stats for the numeric variables than I would have liked to see when I am evaluating data quality and similar tasks.

There’s also no highlighting of missing data. You could summarise that “type” must have some missing data as it has fewer observations than exist in the overall dataset, but that’s subtle and wouldn’t be obvious on a numeric field. However, the counts and percentage breakdown on the categorical variables is very useful, making it a potential dedicated frequency table tool¬†replacement.

The warning message shows that date fields aren’t supported or shown.

A lot of these “limitations” likely come down to the actual purpose this tool is aimed at. The package is called tableone because it’s aimed at producing a typical “table 1” of a biomedical journal paper, which is often a quantitative breakdown of the population being studied, with exactly these measures. That’s a different and more well-specified task than a couple of use-cases I commonly use summary tools for, such as getting a handle on a new dataset, or trying to get some idea of data quality.

This made me realise that perhaps I am over-simplifying to imagine a single “summary tool” would be best for every one of my use-cases. Whilst I don’t think CreateTableOne’s default output is the most comprehensive for judging data quality, at the same time no journal is going to publish the output of skimr directly.

There are a few more tableone tricks though! You can summary() the object it creates, whereupon you get an output that is less journaly, but contains more information.

summary(CreateTableOne(data = data))

summarytableone

This view produces a very clear view of how much data is missing in both absolute and percentage terms, and most of the summary stats (and more!) I wanted for the numeric variables.

tableone can also do nice summarisation by group, using the “strata” parameter. For example:

CreateTableOne(strata = "category", data = data)

stratatableone.PNG

You might note above that not only do you get the summary segmented by group, but you also automatically get p values from a statistical hypothesis  test as to whether the groups differ from each other. By default these are either chi-squared for categorical variables or ANOVA for continuous variables. You can add parameters to use non-parametric or exact tests though if you feel they are more appropriate. This also changes up the summary stats that are shown for numeric variables Рwhich also happens if you apply the nonnormal parameter even without stratification.

For example, if we think “score” should not be treated as normal:

print(CreateTableOne(strata = "category", data = data), nonnormal = "score")

nonnormaltableone

Note how “score” is now summarised by its median & IQR as opposed to “rating”, which is summarised by mean and standard deviation. A “nonnorm” test has also been carried out when checking group differences in score, in this case a Kruskal Wallis test.

The default output does not work with kable, nor is the resulting summarisation in a “tidy” format, so no easy piping the results in or out for further analysis.

desctable, from the desctable package

Documentation

library(desctable)
desctable(data)

desctable

The vignette for desctable describes how it is intended to fulfil a similar function to the aforementioned tableone, but is built to be customisable, manipulable and fit in well with common tidy data tools such as dplyr.

By default, you can see it shows summary data in a tabular format, taking into showing appropriate summaries based on the type of the variables (numeric vs categorical, although it does not show anything other than counts of non-missing data for dates). The basic summaries are shown, although, like tableone, I would prefer a few more shown by default if I was using it as an exploratory tool. Like tableone though, exploratory analysis of messy data is likely not really its primary aim.

The tabular format is quite easy to manipulate downstream, in a “tidy” way. It actually produces a list of dataframes, one containing the variable names and a second containing the summary stats. But you can as.data.frame() it to pop them into a standard usable dataframe.

I did find it a little difficult to get an quick overview at a glance in the console. Everything was displayed perfectly well – but¬† it took me a moment to make my conclusions. Perhaps part of this is because the variables of different types are all displayed together, making me take a second to realise that some fields are essentially hierarchical – row 1 shows 60 “type” fields, and rows 2-6 show the breakdown by value of “type”.

There’s no special highlighting of missing data, although with a bit of mental arithmetic one could work out that if there are N = 64 records in the date_1 field then if there are 57 entries in the date_2 field, we must be missing at least 7 date_2 entries. But there’s no overall summary of record count by default so you would not necessarily know the full count if every field has some missing data.

It works nicely with kable and has features allowing output to markdown and html.

The way it selects which summaries to produce for the numerical fields is clever. It automatically runs a Shapiro-Wilk test on the field contents to determine whether it is normally distributed. If yes, then you get mean and standard deviation. If no, then median and inter-quartile range. I like the idea of tools that nudge you towards using distribution-appropriate summaries and tests, without you having to take the time (and remember) to check distributions yourself, in a transparent fashion.

Group comparison is possible within the same command, by using the dplyr group_by function, and everything is pipeable – which is a very convenient method for those of already immersed in the tidyverse.

group_by(data, category) %>%
 desctable()

desctableg

The output is “wide” with lengthy field names, which makes it slightly hard to follow if you have limited horizontal space. It might also be easier to compare certain summary stats, e.g. the mean of each group, if they were presented next to each other when the columns are wide.

Note also the nice way that it has picked an appropriate statistical test to automatically test for significant differences between groups. Here it picked the Kruskall-Wallis test for the numeric fields as the data was detected as being non-normally distributed.

Although I am not going to dive deep into it here, this function is super customisable. I mentioned above that I might like to see more summary stats shown.  If I wanted to always see the mean, standard deviation and range of numeric fields I could do:

desctable(data,stats = list("N" = length, "Mean" = mean, "SD" = sd, "Min" = min, "Max" = max))

desctablec

You can also include conditional formulae (“show stat x only if this statement is true”), or customise which statistical tests are applied when comparing groups.

If you spent a bit of time working this through, you could probably construct a very useful output for almost any reasonable requirements. All this and more is documented in the vignette, which also shows how it can be used to generate interactive tables via the datatable function of DT.

 

ggpairs, from the GGally package

Documentation

library(GGally)
ggpairs(data)

ggpairs

ggpairs is a very different type of tool than the other summarising tools tried here so far. It’s not designed to tell you the count, mean, standard deviation or other one-number summary stats of each field, or any information on missing data. Rather, it’s like a very fancy pairs plot, showing the interactions of each of your variables with each of the others.

It is sensitive to the types of data available – continuous vs discrete, so makes appropriate choices as to the visualisations to use. These can be customised if you prefer to make your own choices.

The diagonal of the pairs plot Рwhich would compare each variable to itself, does provide univariate information depending on the type of the variable. For example, a categorical variable would, by default,  show a histogram of the its distribution. This can be a great way to spot potential issues such as outliers or unusual distributions. It also can may reveal stats that could otherwise be represented numerically, for instance the range or variance, in a way often more intuitive to humans.

It looks to handle dates as though they were numbers, rather than as a separate date type.

There isn’t exactly a built-in group comparison feature, although the nature of a pairs plot may mean relevant comparisons are done by default. You can also pass ggplot2 aesthetics through, meaning you can essentially compare groups by assigning them different colours for example.

Here for example is the output where I have specified that the colours in the charts should reflect the category variables in my data.

ggpairs(data, mapping = aes(colour = category))

ggpairsc

As the output is entirely graphic, it doesn’t need to work with kable and obviously doesn’t produce a tidy data output.

I wouldn’t categorise this as having the same aim as the other tools mentioned here or in the past post – although I find it excellent for its intended use-case.

After first investigating data with the more conventional summarisation tools to the point I have a mental overview of its contents and cleanliness, I might often run it through ggpairs as the next step. Visualising data is often a great, sometimes overlooked, method to increase your understanding of a dataset. This tool will additionally allow you to get an overview of the relationship between any two of your variables in a fast and efficient way.

ds_summary_stats from descriptr

Documentation

library(descriptr)
ds_summary_stats(data$score)

descriptrss

descriptr is a package that contains several tools to summarise data. The most obvious of these aimed at my use cases is “ds_summary_stats”, shown above. This does only work on numeric variables, but the summary it produces is extremely comprehensive. Certainly every summary statistic I normally look at for numeric data is shown somewhere in the above!

The output it produces is very clear in terms of human readability, although it doesn’t work directly with kable, nor does it produce tidy data if you wished to use the results downstream.

One concern I had regards its reporting of missing data. The above output suggests that my data$score field has 58 entries, and 0 missing data. In reality, the dataframe contains 64 entries and has 6 records with a missing (NA) score. Interestingly enough, the same descriptr package does contain another function called ds_screener, which does correctly note the missing data in this field.

ds_screener(data)

descriptrscreener

This function is in itself a nicely presented way of getting a high-level overview of how your data is structured, perhaps as a precursor to the more in-depth summary statistics that concern the values of the data.

Back on the topic of producing tidy data: another command in the package, ds_multi_stats, does produce tidy, kableable and comprehensive summaries that can include more than one variable – again restricted to numeric variables only.

I also had problems with this command fields that contained missing data – with it giving the classic “missing values and NaN’s not allowed if ‘na.rm’ is FALSE.” error. I have to admit I did not dig deep to resolve this, only confirming that simply adding “na.rm = TRUE” did not fix the problem. But here’s how the output would look, if you didn’t have any missing data.

ds_multi_stats(filter(data, !is.na(score)), score, rating)

descriptrms

There is also a command that has specific functionality to do group comparisons between numeric variables.

ds_group_summary(data$category, data$rating)

descriptrg

This one also requires that you have no missing data in the numeric field you are summarising. The output isn’t for kable or tidy tools, but its presentation is one of the best I’ve seen for making it easy to visually compare any particular summary stat between groups at a glance.

Whilst ds_summary_stats only works with numeric data, the package does contain other tools that work with categorical data. One such function builds frequency tables.

ds_freq_table(data$category)

descriptrft

This would actually have been a great contender for my previous post regarding R packages aimed at producing frequency tables to my taste. It does miss information that I would have liked to have seen regarding the presence of missing data. The percentages it displays are calculated out of the total non-missing data, so you would be wise to first pre-ensure you that your dataset is indeed complete, perhaps with the afore-mentioned ds_screener command.

Some of these commands also define plot methods, allowing you to produce graphical summaries with ease.

My favourite R package for: summarising data

Hot on the heels of delving into the world of R frequency table tools, it’s now time to expand the scope and think about data summary functions in general. One of the first steps analysts should perform when working with a new dataset is to review its contents and shape.

How many records are there? What fields exist? Of which type? Is there missing data? Is the data in a reasonable range? What sort of distribution does it have? Whilst I am a huge fan of data exploration via visualisation, running a summary statistical function over the whole dataset is a great first step to understanding what you have, and whether it’s valid and/or useful.

So, in the usual format, what would I like my data summarisation tool to do in an ideal world? You may note some copy and paste from my previous post. I like consistency ūüôā

  1. Provide a count of how many observations (records) there are.
  2. Show the number, names and types of the fields.
  3. Be able to provide info on as many types of fields as possible (numeric, categorical, character, etc.).
  4. Produce appropriate summary stats depending on the data type. For example, if you have a continuous numeric field, you might want to know the mean. But a “mean” of an unordered categorical field makes no sense.
  5. Deal with missing data transparently. It is often important to know how many of your observations are missing. Other times, you might only care about the statistics derived from those which are not missing.
  6. For numeric data, produce at least these types of summary stats. And not to produce too many more esoteric ones, cluttering up the screen. Of course, what I regard as esoteric may be very different to what you would.
    1. Mean
    2. Median
    3. Range
    4. Some measure of variability, probably standard deviation.
    5. Optionally, some key percentiles
    6. Also optionally, some measures of skew, kurtosis etc.
  7. For categorical data, produce at least these types of summary stats:
    1. Count of distinct categories
    2. A list of the categories – perhaps restricted to the most popular if there are a high number.
    3. Some indication as to the distribution – e.g. does the most popular category contain 10% or 99% of the data?
  8. Be able to summarise a single field or all the fields in a particular dataframe at once, depending on user preference.
  9. Ideally, optionally be able to summarise by group, where group is typically some categorical variable. For example, maybe I want to see a summary of the mean average score in a test, split by whether the test taker was male or female.
  10. If an external library, then be on CRAN or some other well supported network so I can be reasonably confident the library won’t vanish, given how often I want to use it.
  11. Output data in a ‚Äútidy‚ÄĚ but human-readable format. Being a big fan of the tidyverse, it‚Äôd be great if I could pipe the results directly into¬†ggplot,¬†dplyr, or similar, for some quick plots and manipulations. Other times, if working interactively, I‚Äôd like to be able to see the key results at a glance in the R console, without having to use further coding.
  12. Work with ‚Äúkable‚ÄĚ from the¬†Knitr¬†package, or similar table output tools. I often use¬†R markdown¬†and would like the ability to show the summary statistics output in reasonably presentable manner.
  13. Have a sensible set of defaults (aka facilitate my laziness).

What’s in base R?

The obvious place to look is the “summary” command.

This is the output, when run on a very simple data file consisting of two categorical (“type”, “category”) and two numeric (“score”, “rating”) fields. Both type and score have some missing data. The others do not. Rating has a both one¬† particularly high and one particularly low outlier.

summary(data)

basesummary

This isn’t too terrible at all.

It clearly shows we have 4 fields, and it has determined that type and category are categorical, hence displaying the distribution of counts per category. It works out that score and rating are numerical, so gives a different, sensible, summary.

It highlights which fields have missing data. But it doesn’t show the overall count of records, although you could manually work it out by summing up the counts in the categorical variables (but why would you want to?). There’s no standard deviation. And whilst it’s OK to read interactively, it is definitely not “tidy”, pipeable or kable-compatible.

Just as with many other commands, analysing by groups could be done with the base R “by” command. But the output is “vertical”, making it hard to compare the same stats between groups at a glance, especially if there are a large number of categories. To determine the difference in means between category X and category Z in the below would be a lot easier if they were visually closer together. Especially if you had many more than 3 categories.

by(data, data$category, summary)

bybasesummary

So, can we improve on that effort by using libraries that are not automatically installed as part of base R? I tested 5 options. Inevitably, there are many more possibilities, so please feel free to write in if you think I missed an even better one.

  • describe, from the Hmisc package
  • stat.desc from pastecs
  • describe from psych
  • skim from skimr
  • descr and dfSummary from summarytools

Was there a winner from the point of view of fitting nicely to my personal preferences? I think so, although the choice may depend on your specific use-case.

For readability, compatibility with the tidyverse, and ability to use the resulting statistics downstream, I really like the skimr feature set. It also facilitates group comparisons better than most. This is my new favourite.

If you prefer to prioritise the visual quality of the output, at the expense of processing time and flexibility, dfSummary from summarytools is definitely worth a look. It’s a very pleasing way to see a summary of an entire dataset.
Update: thanks to Dominic who left a comment after having fixed the processing time issue very quickly in version 0.8.2

If you don’t enjoy either of those, you are probably fussy :). But for reference, Hmisc’s describe was my most-used variant before conducting this exploration.

describe, from the Hmisc package

Documentation

library(Hmisc)
Hmisc::describe(data)

hmiscdescribe

This clearly provides the count of variables and observations. It works well with both categorical and numerical data, giving appropriate summaries in each case, even adapting its output to take into account for instance how many categories exist in a given field. It shows how much data is missing, if any.

For numeric data, instead of giving the range as such, it shows the highest and lowest 5 entries. I actually like that a lot. It helps to show at a glance whether you have one weird outlier (e.g. a totals row that got accidentally included in the dataframe) or whether there are several values many standard deviations away from the mean. On the subject of deviations, there’s no specific variance or standard deviation value shown – although you can infer much about the distribution from the several percentiles it shows by default.

The output is nicely formatted and spacious for reading interactively, but isn’t tidy or kableable.

There’s no specific summary by group function although again you can pass this function into the by() command to have it run once per group, i.e.¬†by(data, data$type, Hmisc::describe)

The output from that however is very “long” and in order of groups rather than variables naturally, rendering comparisons of the same stat between different groups quite challenging at a glimpse.

stat.desc, from the pastecs package

Documentation

library(pastecs)
stat.desc(data)

statdesc

The first thing to notice is that this only handles numeric variables, producing NA for the fields that are categorical. It does provide all the key stats and missingness info you would usually want for the numeric fields though, and it is great to see measures of uncertainty like confidence intervals and standard errors available. With other parameters you can also apply tests of normality.

It works well with kable. The output is fairly manipulable in terms of being tidy, although the measures show up as row labels as opposed to a true field. You get one column per variable, which may or may not be what you want if passing onwards for further analysis.

There’s no inbuilt group comparison function, although of course the by() function works with it, producing a list containing one copy of the above style of table for each group – again, great if you want to see a summary of a particular group, less great if you want to compare the same statistic across groups.

describe and describeBy, from the psych package

Documentation

library(psych)
psych::describe(data)

psychdescribe

OK, this is different! It has included all the numeric and categorical fields in its output, but the categorical fields show up, somewhat surprisingly if you’re new to the package, with the summary stats you’d normally associate with numeric fields. This is because the default behaviour is to recode categories as numbers, as described in the documentation:

…variables that are categorical or logical are converted to numeric and then described. These variables are marked with an * in the row name…Note that in the case of categories or factors, the numerical ordering is not necessarily the one expected. For instance, if education is coded “high school”, “some college” , “finished college”, then the default coding will lead to these as values of 2, 3, 1. Thus, statistics for those variables marked with * should be interpreted cautiously (if at all).

As the docs indicate, this can be risky! It is certainly risky if you are not expecting it :). I don’t generally have use-cases where I want this to happen automatically, but if you did, and you were very careful how you named your categories, it could be handy for you.

For the genuinely numeric data though, you get most of the key statistics and a few nice extras. It does not indicate where data is missing though.

The output works with kable, and is pretty tidy, outside of the common issue of using rownames to represent the variable the statistics are summarising, if we are being pernickety.

This command does have a specific summary-by-group variation, describeBy. Here’s how we’d use it if we want the stats for each “type” in my dataset, A – E.

psych::describeBy(data, data$type)

psychdescribeby

Everything you need is there, subject to the limitations of the basic describe(). It’s much more compact than using the by() command on some of the other summary tools, but it’s still not super easy to compare the same stat across groups visually.¬† It also does not work with kable and is not tidy.

The “mat” parameter does allow you to produce a matrix output of the above.

psych::describeBy(data, data$type, mat = TRUE)

psychdescribebymat

This is visually less pleasant, but it does enable you to produce a potentially useful dataframe, which you could tidy up or use to produce group comparisons downstream, if you don’t mind a little bit of post-processing.

skim, from the skimr package

Documentation

library(skimr)
skim(data)

skim

At the top skim clearly summarises the record and variable count. It is adept at handling both categorical and numeric data. For readability, I like the way it separates them into different sections dependent on data type, which makes for quick interpretation given that different summary stats are relevant for different data types.

It reports missing data clearly, and has all the most common summary stats I like.

Sidenote: see the paragraph in red below. This issue mentioned in this section is no longer an issue as of skimr 1.0.1, although the skim_with function may still be of interest.

There is what appears to be a strange sequence of unicode-esque characters like <U+2587> shown at the bottom of the output. In reality, these are intended to be a graphical visualisation of distributions using sparklines, hence the column name “hist”, referring to histograms. This is a fantastic idea, especially to see in-line with the other stats in the table. Unfortunately, they do not by default display properly in the Windows environment which is why I see the U+ characters instead.

The skimr documentation details how this is actually a problem with underlying R code rather than this library, which is unfortunate as I suspect this means there cannot be a quick fix. There is a workaround involving changing ones locale, although I have not tried this, and probably won’t before establishing if there would be any side effects in doing so.

In the mean time, if the nonsense-looking U+ characters bother you, you can turn off the column that displays them by changing the default summary that skim uses per data type. There’s a skim_with function that you can use to add your own summary stats into the display, but it also works to remove existing ones. For example, to remove the “hist” column:

skim_with(integer = list(hist = NULL))
skim(data)

skimnohist

Now we don’t see the messy unicode characters, and we won’t for the rest of our skimming session.

UPDATE 2018-01-22 : the geniuses who designed skimr actually did find a way to make the sparklines appear in Windows after all! Just update your skimr version to version 1.0.1 and you’re back in graphical business, as the rightmost column of the integer variables below demonstrate.

skim2

 

 

The output works well with kable. Happily, it also respects the group_by function from dplyr, which means you can produce summaries by group. For example:

group_by(data, category) %>%
 skim()

groupbyskim

Whilst the output is still arranged by the grouping variable before the summary variable, making it slightly inconvenient to visually compare categories, this seems to be the nicest “at a glimpse” way yet to perform that operation without further manipulation.

But if you are OK with a little further manipulation, life becomes surprisingly easy! Although the output above does not look tidy or particularly manipulable, behind the scenes it does create a tidy dataframe-esque representation of each combination of variable and statistic. Here’s the top of what that looks like by default:

mydata <- group_by(data, category) %>%
 skim()

head(mydata, 10)

skimdf

It’s not super-readable to the human eye at a glimpse – but you might be able to tell that it has produced a “long” table that contains one row for every combination of group, variable and summary stat that was shown horizontally in the interactive console display. This means you can use standard methods of dataframe manipulation to programmatically post-process your summary.

For example, sticking to the tidyverse, let’s graphically compare the mean, median and standard deviation of the “score” variable, comparing the results between each value of the 3 “categories” we have in the data.

mydata %>%
 filter(variable == "score", stat %in% c("mean", "median", "sd")) %>%
 ggplot(aes(x = category, y = value)) +
 facet_grid(stat ~ ., scales = "free") +
 geom_col()

skimggplot

descr and dfSummary, from the summarytools package

Documentation

Let’s start with descr.

library(summarytools)
summarytools::descr(data)

descrsummarytools

The first thing I note is that this is another one of the summary functions that (deliberately) only works with numerical data. Here though, a useful red warning showing which columns have thus been ignored is shown at the top. You also get a record count, and a nice selection of standard summary stats for the numeric variables, including information on missing data (for instance Pct.Valid is the proportion of data which isn’t missing).

kable does not work here, although you can recast to a dataframe and later kable that, i.e.

kable(as.data.frame(summarytools::descr(data)))

The data comes out relatively tidy although it does use rownames to represent the summary stat.

mydata <- summarytools::descr(data)
View(mydata)

viewmydata

There is also a transpose option if you prefer to arrange your variables by row and summary stats as columns.

summarytools::descr(data, transpose = TRUE)

descrtranspose

There is no special functionality for group comparisons, although by() works, with the standard limitations.

The summarytools package also includes a fancier, more comprehensive, summarising function called dfSummary, intended to summarise a whole dataframe – which is often exactly what I want to do with this type of summarisation.

dfSummary(data)

summarytoolsdf

This function can deal with both categorical and numeric variables and provides a pretty output in the console with all of the most used summary stats, info on sample sizes and missingness. There’s even a “text graph” intended to show distributions. These graphs are not as beautiful as the sparklines that the skimr function tries to show, but have the advantage that they work right away on Windows machines.

On the downside, the function seems very slow to perform its calculations at the moment. Even though I’m using a relatively tiny dataset, I had to wait an annoyingly large amount of time for the command to complete – perhaps 1-2 minutes, vs other summary functions which complete almost instantly. This may be worth it to you for the clarity of output it produces, and if you are careful to run it once with all the variables and options you are interested in – but it can be quite frustrating when engaged in interactive exploratory analysis where you might have reason to run it several times.

Update 2018-02-10: the processing time issues should be fixed in version 0.82. Thanks very much to Dominic, the package author, for leaving a comment below and performing such a quick fix!

There is no special grouping feature.

Whilst it does work with kable, it doesn’t make for nice output. But don’t despair, there’s a good reason for that. The function has built-in capabilities to output directly into markdown or HTML.

This goes way beyond dumping a load of text into HTML format – instead giving you rather beautiful output like that shown below. This would be perfectly acceptable for sharing with other users, and less-headache inducing than other representations if staring at in order to gain an understanding of your dataset. Again though, it does take a surprisingly long time to generate.

summarytoolsdfhtml.PNG

My favourite R package for: frequency tables

Back for the next part of the “which of the infinite ways of doing a certain task in R do I most like today?” series. This time, what could more more fascinating an aspect of analysis to focus on than: frequency tables?

OK, most topics might actually be more fascinating. Especially when my definition of frequency tables here will restrict itself to 1-dimensional variations, which in theory a primary school kid could calculate manually, given time. But they are such a common tool, that analysts can use for all sorts of data validation and exploratory data analysis jobs, that finding a nice implementation might prove to be a time-and-sanity saving task over a lifetime of counting how many things are of which type.

Here’s the top of an example dataset. Imagine a “tidy” dataset, such that each row is an one observation. I would like to know how many observations (e.g. people) are of¬† which type (e.g. demographic – here a category between A and E inclusive)

Type Person ID
E 1
E 2
B 3
B 4
B 5
B 6
C 7

I want to be able to say things like: “4 of my records are of type E”, or “10% of my records are of type A”. The dataset I will use in my below example is similar to the above table, only with more records, including some with a blank (missing) type.

What would I like my 1-dimensional frequency table tool to do in an ideal world?

  1. Provide a count of how many observations are in which category.
  2. Show the percentages or proportions of total observations that represents
  3. Be able to sort by count, so I see the most popular options at the top – but only when I want to, as sometimes the order of data is meaningful for other reasons.
  4. Show a cumulative %, sorted by count, so I can see quickly that, for example, the top 3 options make up 80% of the data – useful for some swift Pareto analysis and the like.
  5. Deal with missing data transparently. It is often important to know how many of your observations are “missing”. Other times, you might only care about the statistics derived from those which are not missing.
  6. If an external library, then be on CRAN or some other well supported network so I can be reasonably confident the library won’t vanish, given how often I want to use it.
  7. Output data in a “tidy” but human-readable format. Being a big fan of the tidyverse, it’d be great if I could pipe the results directly into ggplot, dplyr, or whatever for some quick plots and manipulations. Other times, if working interactively, I’d like to be able to see the key results at a glance, without having to use further coding.
  8. Work with “kable” from the¬†Knitr¬†package, or similar table output tools. I often use R markdown and would like the ability to show the frequency table output in reasonably presentable manner.
  9. Have a sensible set of defaults (aka facilitate my laziness).

 

So what options come by default with base R?

Most famously, perhaps the “table” command.

table(data$Type)

table

A super simple way to count up the number of records by type. But it doesn’t show percentages or any sort of cumulation. By default it hasn’t highlighted that there are some records with missing data. It does have a useNA parameter that will show that though if desired.

table(data$Type, useNA = "ifany")

tablena

The output also isn’t tidy and doesn’t work well with Knitr.

The table command can be wrapped in the prop.table command to show proportions.

prop.table(table(data$Type))

proptable

But you’d need to run both commands to understand the count and percentages, and the latter inherits many of the limitations from the former.

So what’s available outside of base R? I tested 5 options, although there are, of course , countless more. In no particular order:

  • tabyl, from the janitor package
  • tab1, from epidisplay
  • freq, from summarytools
  • CrossTable, from gmodels
  • freq, from questionr

Because I am fussy, I managed to find some slight personal niggle with all of them, so it’s hard to pick an overall personal winner for all circumstances. Several came very close. I would recommend looking at any of the janitor, summarytools and questionr package functions outlined below if you have similar requirements and tastes to me.

tabyl, from the janitor package

Documentation

library(janitor)
tabyl(data$Type, sort = TRUE)

janitor

This is a pretty good start! By default, it shows counts, percents, and percent of non-missing data. It can optionally sort in order of frequency. It the output is tidy, and works with kable just fine. The only thing missing really is a cumulative percentage option. But it’s a great improvement over base table.

I do find myself constantly misspelling “tabyl” as “taybl” though, which is annoying, but not really something I can really criticise anyone else for.

tab1, from the epidisplay package

Documentation

library(epiDisplay)
tab1(data$Type, sort.group = "decreasing", cum.percent = TRUE)

epidisplayepidisplay2

This one is pretty fully featured. It even (optionally) generates a visual frequency chart output as you can see above. It shows the frequencies, proportions and cumulative proportions both with and without missing data. It can sort in order of frequency, and has a totals row so you know how many observations you have all in.

However it isn’t very tidy by default, and doesn’t work with knitr. I also don’t really like the column names it assigns, although one can certainly claim that’s pure personal preference.

A greater issue may be that the cumulative columns don’t seem to work as I would expect when the table is sorted, as in the above example. The first entry in the table is “E”, because that’s the largest category. However, it isn’t 100% of the non-missing dataset, as you might infer from the fifth numerical column. In reality it’s 31.7%, per column 4.

As far as I can tell, the function is working out the cumulative frequencies before sorting the table – so as category E is the last category in the data file it has calculated that by the time you reach the end of category E you have 100% of the non-missing data in hand. I can’t envisage a situation where you would want this behaviour, but I’m open to correction if anyone can.

freq, from the summarytools package

Documentation

library(summarytools)
summarytools::freq(data$Type, order = "freq")

summarytools

This looks pretty great. Has all the variations of counts, percents and missing-data output I want – here you can interpret the “% valid” column as “% of all non-missing”. Very readable in the console, and works well with Knitr. In fact it has some further nice formatting options that I wasn’t particularly looking for.

It it pretty much tidy, although has a minor niggle in that the output always includes the total row. It’s often important to know your totals, but if you’re piping it to other tools or charts, you may have to use another command to filter that row out each time, as there doesn’t seem to be an obvious way to prevent it being included with the rest of the dataset when running it directly.

Update 2018-04-28: thanks to Roland in the comments below pointing out that a new feature to disable the totals display has been added: set the “totals” parameter to false, and the totals row won’t show up, potential making it easier to pass on for further analysis.

CrossTable, from the gmodels library

Documentation

library(gmodels)
CrossTable(data$Type)

crosstable

Here the results are displayed in a horizontal format, a bit like the base “table”. Here though, the proportions are clearly shown, albeit not with a cumulative version. It doesn’t highlight that there are missing values, and isn’t “tidy”. You can get it to display a vertical version (add the parameter max.width = 1 ) which is visually distinctive, but untidy in the usual R tidyverse sense.

It’s not a great tool for my particular requirements here, but most likely this is because, as you may guess from the command name, it’s not particularly designed for 1-way frequency tables. If you are crosstabulating multiple dimensions it may provide a powerful and visually accessible way to see counts, proportions and even run hypothesis tests.

freq, from the questionr package

Documentation

library(questionr)
questionr::freq(data$Type, cum = TRUE, sort = "dec", total = TRUE)

questionr

Counts, percentages, cumulative percentages, missing values data, yes, all here! The table can optionally be sorted in descending frequency, and works well with kable.

It is mostly tidy, but also has an annoyance in that the category values themselves (A -E are row labels rather than a standalone column. This means you may have to pop them into in a new column for best use in any downstream tidy tools. That’s easy enough with e.g. dplyr’s add_rownames command. But that is another processing step to remember, which is not a huge selling point.

There is a total row at the bottom, but it’s optional, so just don’t use the “total” parameter if you plan to pass the data onwards in a way where you don’t want to risk double-counting your totals. There’s an “exclude” parameter if you want to remove any particular categories from analysis before performing the calculations as well as a couple of extra formatting options that might be handy.

My favourite R package for: correlation

R is a wonderful, flexible, if somewhat arcane tool for analytics of all kinds. Part of its power, yet also its ability to bewilder, comes from the fact that there are so many ways of doing the same, or similar, things. Many of these ways are instantly available thanks to many heroes of the R world creating and distributing free libraries to refine existing functionality and add new abilities.

Looking at a list of one from the most popular sources for these packages, CRAN, shows that their particular collection gets new entries on a several-times-per-day basis, and there are 11,407 of them at the time of writing.

With that intimidating stat in mind, I will keep a few notes on this blog as to my current favourite base or package-based methods for some common analytics tasks. Of course these may change regularly, based on new releases or my personal whims. But for now, let’s tackle correlations. Here I mean simple statistical correlations between 2 sets of data, the most famous one of which is likely the Pearson correlation coefficient, aka Pearson’s R.

What would I like to see in my ideal version of a correlation calculator? Here’s a few of my personal preferences in no particular order.

  1. Can deal with multiple correlation tests at once. For example, maybe I have 5 variables and I’d like to see the correlation between each one of them with each of the other 4 variables).
  2. Visualises the results nicely, for example in a highlighted correlation matrix. Default R often produces informative but somewhat uninspiring text output. I have got spoiled with the luxury of data visualisation tools so after a heavy day’s analysis I prefer to take advantage of the many ways dataviz can make analytic output easier to decipher for humans.
  3. If the output is indeed a dataviz, I have a slight preference for it to use ggplot charting all other things being equal. Ending up with a proper ggplot object is nice both in terms of the default visual settings vs some other forms of R chart, and also that you can then in theory use ggplot methods to adapt or add to it.
  4. Can produce p values, confidence intervals, or some other way of suggesting whether any correlations found are statistically significant or not.
  5. Can handle at least a couple of types of correlation calculations, the most common of which are probably Pearson correlation coefficient and Spearman’s rank correlation coefficient.

Default R has a couple of correlation commands built in to it. The most common is probably “cor“. Here’s an example of what it produces, using a test dataset named test_data of 5 variables, named a, b, c, d and e (which are in columns .

cor(test_data)

cor

So, it does multiple tests at once, and can handle Pearson, Spearman and Kendall correlation calculations, via changing the “method” parameter (which defaults to Pearson if you don’t specify, as in my example). But it doesn’t show the statistical significance of the correlations, and a potentially large text table of 8-place decimal numbers is not the sort of visualised output that would help me interpret the results in a particularly efficient way.

A second relevant default R command is “cor.test“. This one only allows you to make a single correlation comparison, for example between variable a and variable b.

cor.test(test_data$a, test_data$b)

cortest

So here we see it does return both a p value and a confidence interval to help us judge the significance of the correlation size noted. You can change the alternative hypothesis and confidence interval range via parameters. It can also do the same 3 types of correlation that “cor” supports. But, as noted, it can only compare two variables at once without further commands. And the output is again a bunch of text. That is really OK here, as you are focusing only on one comparison. But it’s going to be pretty tedious to run and decipher if you want to compare each one of a few variables against each of the others.

So, is there a package solution that makes me happy? As you might guess, yes, there’s actually a few contenders. But my current favourite is probably “ggcorrplot“. The manual is here, and there’s a handy usage guide here.

Suffice to say:

  1. It allows you to compare several variables against each other at the same time.
  2. It visualises the variables in a colour-coded correlation matrix
  3. The visualisation is a ggplot
  4. It can produce p values, using the accompanying function cor_pmat(), which can then be shown on the visualisation in various ways.
  5. It uses the results from the built in cor() function, so can handle the same 3 types of correlation.

There’s a bunch of options to select from, but here’s the default output

# calculate correlations using cor and store the results
corrdata <- cor(test_data)

# use the package's cor_pmat function to calculate p-values for the correlations
p.mat <- cor_pmat(test_data)

# produce a nice highlighted correlation matrix
ggcorrplot(corrdata, title = "Correlation matrix for test data")

The results look like:

ggcordefault

You can see it produces a correlation matrix, colour coded as to the direction and strength of the correlations. It doesn’t show anything about the statistical significance. Kind of pretty for an overview glance, but it could be rather more informative.

I much prefer to use a couple of options that show the actual correlation values and the significance; the ones I most commonly use probably being this set.

ggcorrplot(corrdata, title = "Correlation matrix for test data", lab=TRUE, p.mat = p.mat, sig.level = .05)

ggcorrplot

Here, the correlation coefficients are superimposed on the grid, so you can check immediately the strength of the correlation rather than try and compare to the colour scale.

You can also see that some of the cells are crossed out (for example the correlation between variable c and variable e in the above). This means that the correlation detected is not considered to be significant at the 0.05 level. That level can be changed, or the insignificant correlations be totally hidden if you prefer to not get distracted by them in the first place.

Retrieving Adobe SiteCatalyst data with R

Adobe¬†SiteCatalyst (part of Adobe Analytics) is a nicely comprehensive tool for tracking user interactions upon one’s website, app and more. However, in the past I’ve had a fair amount of trouble de-siloing its¬†potentially immensely¬†useful data into external tools, such that I could connect, link and¬†process it for insights over and above those you can get within the default web tool (which, to be fair, is itself improving over time).

I’ve written in the past¬†about my happiness when Alteryx released a data connector allowing one to access Sitecatalyst data from inside their own tool. That’s still great, but the tool is necessarily constrained to the specific tasks the creator designed it to do, and subject to the same API limits as everyone else is.¬†I have no doubt that there are ways and means to get around that in Alteryx (after all, it speaks R). But sometimes, when trying to automate operations,¬†coding in something like R¬†might actually be easier…or at least cheaper!

With that in mind, after having a recent requirement to analyse web browsing data at a individual customer level, I successfully experimented with the awesome RSiteCatalyst package, and have some notes below as to the method which worked well for me.

Note that RSiteCatalyst is naturally subject to the usual Adobe API limits – the main one that causes me sadness being the inability to retrieve over 50k rows at a time¬†– but, due to the incessant flexibility of R¬†and the comprehensiveness of this package, I’ve not encountered a problem I couldn’t solve just yet.

So, how to set up?

Install the RSiteCatalyst package

First, open up R, or your favourite R environment, and install the RSiteCatalyst package (the first time you use it) or load the library (each new session).

if(!require(RSiteCatalyst)) install.packages("RSiteCatalyst")
library(RSiteCatalyst)

Log in to your SiteCatalyst account

You next need to authenticate against your Adobe Sitecatalyst installation, using the SCAuth function. There’s an old way involving a shared secret, and a new way using OAuth. The latter is to be preferred, but at the¬†time I first looked at it there seemed to be a web service issue that prevented the OAuth process completing. At the moment then, I can confirm that the old method still works!

 key <- "username:company>"
 secret <- "https://sc.omniture.com/p/suite/1.3/index.html?a=User.GetAccountInfo >"
 SCAuth(key, secret)

 

Retrieve metadata about your installation

Once you’re in, there’s a plethora of¬†commands to retrieve useful metadata, and then to run and retrieve various types of report data. For¬†several such commands, you’ll need to know the ID of the¬†Adobe Report Suite concerned, which is fortunately as easy as:

suites <- GetReportSuites()

whereby you’ll receive a dataframe containing all available report suites by title and ID.

If you already know the title of the report suite you’re interested in then you can grab the ID¬†directly with something like:

my.rsid <- suites[suites$site_title=="My favourite website",1]

You can then find out which elements are available within your report suite:

elements.available <- GetElements(my.rsid)

and later on which metrics are available for a given element

metrics.available <- GetMetrics(my.rsid, elements = "<<myFavouriteElement>>")

Retrieve Adobe Analytics report data

There are a few different RSitecatalyst functions you can call, depending on the type of report you’re interested in. In each case they start with “queue”, as what you’re actually doing is submitting a data request to the Sitecatalyst reporting queue.

If your request is pretty quick, you can wait a few seconds and get the results sent back to you immediately in R. If it’s going to take a very long time, you can instead store a report request ID and then use the GetReport function to go back and get it later, once¬†it’s finished.

The current list of queue functions, which are named such that Adobe aficionados will probably be able to guess which type of data they facilitate, is:

  • QueueDataWarehouse
  • QueueFallout
  • QueueOvertime
  • QueuePathing
  • QueueRanked
  • QueueSummary
  • QueueTrended

Here I’ll show just a couple of examples – but the full official documentation for all of them and much more besides is available¬†at Cran.

Firstly, an “over¬† time” report to get me the daily pageview and visit counts my site received in the first half of 2016.

Here’s how the documentation describes this type of report:

“A QueueOvertime report is a report where the only granularity allowed is time. This report allows for a single report suite, time granularity, multiple metrics, and a single segment”

An example would be a day by day count of total pageviews to your site.

Remember that above we set the variable “my.rsid” to the Report Suite ID of the report suite I am looking at. So:

dailyStats <- QueueOvertime(my.rsid,
                            date.from = "2016-01-01",
                            date.to = "2016-06-30",
                            metrics = c("pageviews","visits"),
                            date.granularity = "day"
                            )

The parameters are fairly self-evident, especially in combination with the documentation. There are many more available, including the ability to isolate by segment and to forecast data.

Your site will probably have some different metrics available to mine, depending on how the Adobe admin set it up, but the basics like page views and visits should be accessible pretty much anywhere.

What you’ll get back with the above command, i.e. the contents of the dailyStats variable after it has run, is a dataframe in this sort of format:

datetime name year month day segment.id segment.name pageviews visits
01/01/2016 Fri. 1 Jan. 2016 2016 1 1 100 75
02/01/2016 Sat. 2 Jan. 2016 2016 1 2 200 150
03/01/2016 Sun. 3 Jan. 2016 2016 1 3 250 180

Which you can then go on to process as you would any other such snippet of data in R.

My second, slightly more complex, example concerns the QueueRanked function. It’s described like this:

A QueueRanked report is a report that shows the ranking of values for one or more elements relative to a metric, aggregated over the time period selected.

That’s a bit more vague, but was useful to me in my original goal of identifying specifically which customers logged into my website in December, and how many times they visited.

A key feature of this function is that you can ask for the results from rank [x] to rank [y], instead of just the top [n].

This is super-useful where, like I did,  you expect to get over 50k rows. 50k is maximum row limit you can retrieve in one request via the Adobe API, which this R package uses. But R is full of the typical program language features like loops, thus allowing one to iterate through the commands to retrieve for instance results 1-50,000, then results 50,001 -100,000, then 100,001 Р150,000 and so on.

So, I built a loop that would generate these “ranked reports”, starting at row ‘i’ and giving me the next ‘count.step’ records, where count.step = 50000, the maximum I’m allowed to retrieve in one go.

Thus, I’d call the function repeatedly, each time asking for the next 50,000 records. At some point, when there were no more customers to download, I’d get a blank report sent back to me. At that point, I know I have everything so quit the loop.

I wanted to retrieve the ID of the customer using the website, which in my setup is stored in an custom element called “prop1”. All that sort of detail is controlled by your Adobe Sitecatalyst administrator, should you have exactly the same sort of requirement as I did – so best go ask them which element to look in, as there’s no real chance your setup is identical to mine at that level.

Nonetheless, the code pattern below could likely be used without much modification in order to iterate through any SiteCatalyst data that exceeds the API row limits.

 

count.limit <- 500000 #the max number of records we're interested in
count.step <- 50000 #how many records to retrieve per request, must not exceed 50k
count.start <- 1 #which record number to start with
CustomerList <- NULL #a variable to store the results in
fromDate <- "2016-12-01"
toDate <- "2016-12-31"

for(i in seq(1, count.limit, by = count.step)) {
  print(paste("Requesting rows",i, "through", i + count.step - 1))

  tempCustomerList <- QueueRanked(my.rsid,
                          date.from = fromDate,
                          date.to = toDate,
                          metrics = "visits",
                          elements = "prop1",
                          top = count.step,
                          start = i
                          )

  if  (nrow(tempCustomerList) == 0 ) {   # no more rows were returned - presumably we have them all now
    print("Last batch had no rows, exiting loop")
    break
  }
   
  tempCustomerList$batch.start.row <- i

  CustomerList <- rbind(customerList, tempCustomerList)

}

 

After running this, you should end up with a “CustomerList” dataframe that looks something like this, where “name” is the value of prop1, in my case the customer ID:

name url visits segment.id segment.name batch.start.row
ID123 10 1
ID234 5 1

which, again, you can process as though it was any standard type of R data.

Clustering categorical data with R

Clustering is one of the most common¬†unsupervised machine learning tasks. In¬†Wikipedia‘s¬†current words, it is:

the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups

Most “advanced analytics” tools have some ability to cluster in them. For example,¬†Alteryx has K-Centroids Analysis.¬†R, Python, SPSS, Statistica and any other proper data sciencey tools all likely have¬†many methods¬†–¬†and even Tableau, although not necessarily aimed at the same market, just added a user-friendly clustering facility. ¬†You can do¬†the calculations in Excel, should you really want to (although why not cheat and use a nice addin¬†if you want to save time?).

However,¬†many of the more famous clustering algorithms, especially the ever-present K-Means algorithm, are really better for clustering objects that have¬†quantitative numeric¬†fields, rather than those that are categorical.¬†I’m not going delve¬†into the¬†details of why¬†here, but, simplistically, they tend to be based on concepts like¬†Euclidean distance¬†–¬†and in that domain, it’s¬†conceptually difficult¬†to say that [bird] is Euclideanly “closer” to [fish] than¬†[animal]; vs the much more straightforward¬†task of¬†knowing that an income of ¬£100k is nearer to one of ¬£90k than it is to¬†50p.¬†IBM has a bit more about that¬†here.

But, sometimes you really want to cluster categorical data! Luckily, algorithms for that exist, even if they are rather less widespread than typical k-means stuff.

R being R, of course it has¬†a ton of libraries that might help you out. Below¬†are a couple I’ve used, and¬†a few notes as to the very basics of how to use them – not that it’s too difficult¬†once you’ve found¬†them. The art of selecting the optimum parameters for the very finest of clusters though¬†is still yours to master, just like it is on most quantitative clustering.

The K-Modes algorithm

Like k-means, but with modes, see ūüôā ? A paper called ‘Extensions to the k-Means Algorithm for Clustering¬†Large Data Sets with Categorical Values‘ by Huang gives¬†the gory details.

Luckily though, a R implementation is available within the klaR package. The klaR documentation is available in PDF format here and certainly worth a read.

But simplistically, you’re looking at passing a matrix or dataframe into the “kmodes” function.

Imagine you have a CSV file something like:

RecordID FieldA FieldB FieldC FieldD
1 0 0 0 1
2 0 0 0 0
3 0 0 0 1
4 1 1 0 0

Here’s how you might read it in, and cluster the records¬†based on the contents of fields¬†“FieldA”, “FieldB”, “FieldC”, and “FieldD”.

install.packages("klaR")
library(klaR)
setwd("C:/Users/Adam/CatCluster/kmodes")
data.to.cluster <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-kmodes(data.to.cluster[,2:5], 3, iter.max = 10, weighted = FALSE ) #don't use the record ID as a clustering variable!

Here I’ve asked for 3 clusters to be found, which is the second argument of the kmodes function. Just like k-means, you can specify as many as you want¬†so you have a few variations to compare the quality or real-world utility of.

This is the full list of parameters to kmodes, per the documentation.

kmodes(data, modes, iter.max = 10, weighted = FALSE)
  • data: A matrix or data frame of categorical data. Objects have to be in rows, variables
    in columns.
  • modes: Either the number of modes or a set of initial (distinct) cluster modes. If a
    number, a random set of (distinct) rows in data is chosen as the initial modes.
  • iter.max: The maximum number of iterations allowed.
  • weighted: Whether usual simple-matching distance between objects is used, or a weighted¬†version of this distance.

What do you get back?

Well, the kmodes function returns you a list, with the most interesting entries being:

  • cluster: A vector of integers indicating the cluster to which each object is allocated.
  • size: The number of objects in each cluster.
  • modes: A matrix of cluster modes.
  • withindiff: The within-cluster simple-matching distance for each cluster

Here’s an example what it looks like when output to the console:

K-modes clustering with 3 clusters of sizes 3, 5, 12

Cluster modes:
 FieldA FieldB FieldC FieldD
1 1 0 0 0
2 1 0 1 1
3 0 0 0 0

Clustering vector:
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
 3 3 3 1 3 1 2 3 3 3 2 2 2 3 3 2 1 3 3 3

Within cluster simple-matching distance by cluster:
[1] 2 2 8

Available components:
[1] "cluster" "size" "modes" "withindiff" "iterations" "weighted"

So, if you want to append your newly found clusters onto the original dataset, you can just add the cluster back onto your original dataset as a new column, and perhaps write it out as a file to analyse elsewhere, like this:

cluster.output <- cbind(data.to.cluster,cluster.results$cluster)
write.csv(cluster.output, file = "kmodes clusters.csv", row.names = TRUE)

 

The ROCK algorithm

Some heavy background reading on Rock is available in this presentation by Guha et al.

Again,¬†a benevolent genius has popped an implementation into R for our use. This time you can find it in package “cba”. The PDF docs for cba are here.

But the most simplistic usage is very similar to k-modes, albeit with different optional parameters based on the how the algorithms differ.

Here’s what you’d do to cluster the same data as above, and write it back out, this time with the Rock clusters appended. Note here that ideally you’re specifically passing in a matrix to the rockCluster function.

install.packages("cba")
library(cba)
setwd("C:/Users/Adam/CatCluster/rock")
data.to.cluster <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-rockCluster(as.matrix(data.to.cluster[,2:5]), 3 )
cluster.output <- cbind(data.to.cluster,cluster.results$cl)
write.csv(cluster.output, file = "Rock clusters.csv", row.names = TRUE)

The full list of parameters to the relevant function, rockCluster is:

rockCluster(x, n, beta = 1-theta, theta = 0.5, fun = "dist", funArgs = list(method="binary"), debug = FALSE)
  • x: a data matrix; for rockLink an object of class dist.
  • n: the number of desired clusters.
  • beta: optional distance threshold
  • theta: neighborhood parameter in the range [0,1).
  • fun: distance function to use.
  • funArgs: a list of named parameter arguments to fun.
  • debug: turn on/off debugging output.

This is the output, which is of class “rock”, when printed to the screen:

data: x 
 beta: 0.5 
theta: 0.5 
 fun: dist 
 args: list(method = "binary") 
 1 2 3 
14 5 1

The object is¬†a¬†list, and its most useful component is probably “cl”, which is a factor containing the assignments of clusters to your data.

Of course once you have the csv files generated in the above ways, it’s just bog-standard data – so you’re¬†free to visualise in R, or any other tool.