Extracting the date and time a UUID was created with Bigquery SQL (with a brief foray into the history of the Gregorian calendar)

I was recently working with records in a database that were identified by a Universally Unique Identifier, aka a UUID. These IDs are strings of characters that look something like “31ae75f0-cbe0-11e8-b568-0800200c9a66”.

I needed to know which records were generated during in a particular time period, but sadly there was no field about dates to be found. Unambiguously a database design flaw given what I needed to do – but it did lead me to discover that at least “version 1 UUIDs” have a date and time of creation embedded within them.

So how do we get from 31ae75f0-cbe0-11e8-b568-0800200c9a66 to the date and time the UUID was generated? I’d say “simple”, but it isn’t exactly. Thanks to Wikipedia, and famkruithof.net for the behind-the-scenes info of how this works.

So, the key components of the UUID to note by position are those highlighted below:

31ae75f0-cbe0-11e8-b568-0800200c9a66

Take the highlighted parts of the UUID aside, reversing the order of the chunks, so as to get:

1e8cbe031ae75f0

There’s your “60-bit timestamp, being the number of 100-nanosecond intervals since midnight 15 October 1582” (thanks Wikipedia).

Rambling sidenote alert:

In case you’re wondering why 15 October 1582 is so special, then it’s because that was the date that the previous “Julian” calendar system was first replaced with the newer “Gregorian” calendar system, now the most widely used calendar throughout the world, under the diktat of Pope Gregory XIII.

Why? The Julian calendar had worked on the premise that the average year is 365.25 days long (the 0.25 days being compensated for by the existence of a leap day every 4 years).

However, that’s slightly different to the true length of the solar year, which is 365 days, 5 hours, 48 minutes, 45.25 seconds. Consequently. it was noticed that there was some “drift” whereby the date the calendar noted that the equinoxes should occur slowly became out of sync with real life observations of the equinox. Effectively, as the Britannica notes, this discrepancy causes the “calendar dates of the seasons to regress almost one day per century”. This is relevant to religions such as Catholicism in that it affects the calculation of, for instance, when to celebrate Easter.

The Gregorian calendar made a couple of changes, perhaps most notably introducing the rule that a century year (e.g. 1900) only counts as a leap year if its number is divisible by 400 with no remainder, instead of adhering to the Julian system where it only needs to be divisible by 4.  On average, this shortens the length of the measured year by  0.0075 days, which keeps it in better sync with the reality of the solar year. It’s still not perfect, but leads to a much lower rate of drift of around 1 day per 3,030 years.

In order to account for the drift that had occurred by the time of this realisation, the Pope also advocated for fast forwarding the date by a few days to catch up with reality. So for Gregorian advocates, there was literally never a 14 October 1582. Overnight, the date skipped from October 4th 1582 through to October 15 1582, at least in countries where this system was accepted right away (and subsequently by the relevant ISO standards that most computing systems adhere to).

Not everywhere was keen to adopt this system right away – perhaps unsurprisingly Catholic countries were more likely to take the leap quicker. But type 1 UUIDs presumably don’t care too much about religious politics.

End of rambling side note

Note that the base value of this timestamp, 15 October 1582, is a different date than the classic  January 1st, 1970-based timestamp you may know and love from Unix-type systems, which many databases, including Google BigQuery, work with. So it needs conversion.

Let’s start by getting it into decimal (using one of many web-based converters – I picked this one for no particular reason).

1e8cbe031ae75f0 hex = 137583952401430000 decimal

This is in units of 100 nanoseconds. I really don’t care about nanosecond granularity for my use case, so let’s divide it by 10,000,000 to get to seconds.

137583952401430000  100-nanoseconds intervals = 13758395240.1 seconds

This is now the number of seconds that have elapsed between the start of October 15 1582 and the date / time my UUID was created.

To get it into a more conventional  0 = 1970-01-01 based timestamp format, we then need to subtract the number of seconds between October 15 1582 and January 1st 1970 from it (12,219,292,800, as it happens):

13758395240 - 12219292800 = 1539102440

So now you have the number of seconds since Jan 1st 1970 and the time your UUID was generated. Pop it into a Unix timestamp interpreter (this one, for example) to translate it into something human-understandable, and you’ll discover I generated that UUID on October 9th 2018. Hooray.

Sidenote: I generated my test UUID on this site, if you ever need a valid one a hurry.

Flicking between websites and calculators is clearly a painful and inefficient way to do this operation, especially if you have a whole column of UUIDs to decode. Luckily many databases, including Bigquery, have the necessary functions to do it en masse.

Below is some Bigquery Standard SQL code that works for me – with most thanks and credit due to Roland Bouman, whose MySQL code version was nice and easy to adapt to the Bigquery SQL dialect.

In the below, imagine your UUID is stored in the field “uuid”. Pop this into your query, and the “extracted_timestamp” field will now contain the timestamp showing when the UUID was generated.

TIMESTAMP_SECONDS (
CAST (
(
CAST(
CONCAT('0x', -- starting the value with 0x will ensure Bigquery realises it's a hex string
SUBSTR(uuid, 16, 3), -- extract relevant parts of the UUID needed to recreate timestamp
SUBSTR(uuid, 10, 4),
SUBSTR(uuid, 1, 8)
)
AS INT64) -- convert hex to dec
/ 10000000) -- divide to get from 100 nanoseconds periods to seconds
- (141427 * 24 * 60 * 60) -- rebase from start of Gregorian calendar (1582-10-15) to unixtime (1970-01-01)
AS INT64 )
) AS extracted_timestamp

 

Advertisements

Average age at menarche by country

A question came up recently about variations in the age at menarche – the first occurrence of menstruation for a female human –  with regards to the environment. A comparison by country seemed like a reasonable first step in noting whether there were in fact any significant, potentially environmental, differences in this age.

A quick Google search seemed to suggest a nice reliable chart summarising this data by country was surprisingly (?) hard to find. However, delving more into academic publishing world, I did find a relevant paper called “International variability of ages at menarche and menopause: patterns and main determinants” (Thomas et al, in Hum Biol. 2001 Apr;73(2):271-90), which stated that the purpose of their study was:

to review published studies on the variability of age at menarche and age at menopause throughout the world, and to identify the main causes for age variation in the timing of these events.

Score! And sure enough, it does contain a lengthyish datatable showing the average age of menarche by country from a survey of prior literature that they did. However, it’s not super easy to get a immediately get a sense of the magnitude of any variation by looking at a datatable from a scanned PDF that was printed over 3 pages. Instead, I decided to extract the data from that table and visualise it in a couple of charts which are shown below, in case anyone else has a similar curiosity about the subject.

Firstly a simple bar chart. You may wish to click through to get a larger copy with more readable axes:

Bar chart

And now plotted geographically:

Age at menarche by country map

 

There’s also an interactive dashboard version available if you click here, where you can filter, highlight, hover and otherwise interact with the data.

If you have a use for the underlying data yourself, I uploaded it in a machine-readable format to data.world, whereby you can connect directly to it from various analytics tools, use the querying or visualisation tools on the site itself, or download it for your own use.  Again, this data is all consolidated by the authors of the Human Biology study above, so full credit is due to them.

A couple of notes to be aware of:

Firstly, the study I took the data from was published in 2001 i.e. is not particularly recent. The menarche age averages shown were collected from other studies, which obviously took place before 2001 – some even decades before, and in countries that technically no longer exist! They therefore may not reflect current values.

It’s generally considered that menarche ages have changed over time to some extent – for example a study concerning self-reported age of menarche in the US “Has age at menarche changed? Results from the National Health and Nutrition Examination Survey (NHANES) 1999-2004.” (McDowell et al, J Adolesc Health. 2007 Mar;40(3):227-31) concluded that “Mean age of menarche declined by .9 year overall in women born before 1920 compared to women born in 1980-84” (and also that ages and changes differed by ethnicity, showing that country averages may mask underlying structural differences). A few other such studies I skimmed, usually focusing on individual countries, also tended to show trends in the same direction. One might hypothesize that countries where the general population has been through even more radical living-condition shifts than residents of the US over the past few decades may have seen larger changes than those reported by McDowell et al.

Secondly, some of the averages reported are means, others are medians. These may not be directly comparable. The bar chart I’ve shown above differentiates between those based on the colour of the bar. In the interactive version there’s the ability to filter to only show results based on a single one of those aggregation types.

So now onto the interesting theoretical part – OK, menarche age may differ between countries, but why? As noted, that was part of the question driving the authors of the source study, so you should certainly read their whole study to get their take. In summary though, they created a sequence of linear models.

The first one shows a negative association between life expectancy and menarche age.  OK, but what factors drive life expectancy in the first place, that also correlate with menarche?

They produced 2 alternative models to investigate that. The first was a combination of illiteracy rate (with a positive correlation) and vegetable calorie consumption (negative). The second kept illiteracy in, but switched in the country’s average gross national product for vegetable calorie consumption.

Vegetable calorie consumption is perhaps somewhat intuitive – it’s been previously found that what their paper describes as “good nutritional conditions” tend to lower the age of menarche, with the body’s fat:mass ratio being potentially involved. There are many papers on that topic – not being remotely a subject matter expert here I’ll pick one at random – “Body weight and the initiation of puberty” (Baker, Clin Obstet Gynecol. 1985 Sep;28(3):573-9.), which concluded that malnutrition may “retard the onset of menarche”.

But the larger influence in the model was the country’s illiteracy rate. Does reading books induce menstruation? Well, hey, probably not directly, so it likely proxies for something that may more plausibly affect menarche. The theory the researchers present here is that societies with higher illiteracy also tend to have a higher incidence of child labour. This labour is often physical in nature, implying a higher rate of energy expenditure. Citing previous evidence that excessive exercise may affect the fat balance, again altering menarche, they form a new hypothesis that it’s in fact the energy balance within a person that may affect menarche.

Analysing your 23andme genetic data in R part 2: exploring the traits associated with your genome

In part one of this mini-series, you heroically obtained and imported your 23andme raw genome data into R. Fun as that was, let’s see if we can learn something interesting from it.  After all, 23andme does automatically provide several genomic analysis reports, but – for many sensible reasons – it is certainly limited in what it can show when compared to the entire literature of this exciting field.

It would be tiresome for me to repeat all the caveats you can find in part one, and all over the more responsible parts of the internet. But do remember that in the grand scheme of things we likely know only somewhere between “nothing” and “a little” so far about the implications of most of the SNPs you imported into R in part one.

Whilst there are rare exceptions, for the most part it seems like many “interesting” real-world human traits are a product of many variations in different SNPs, each of which have a relatively tiny effect size. In these cases, if you happen to have a T somewhere rather than an A, it’s usually unlikely on its own to produce to any huge “real world” outcome you’d notice alone. Or if it does, we don’t necessarily know it yet. Or perhaps it could, but only if you have certain other bases in certain other positions, many of which 23andme may not report on – even if we knew which ones were critical. That was a long and meandering sentence which could be summarised as “things are complicated”.

Note also that 23andme do provide a disclaimer on the raw data, as follows:

This data has undergone a general quality review however only a subset of markers have been individually validated for accuracy. As such, the data from 23andMe’s Browse Raw Data feature is suitable only for research, educational, and informational use and not for medical or other use.

So, all in all, what follows should be treated as a fun-only investigation. You should first seek the services of genetic medical professionals, including the all-important genetic counsellor, if you have any real concerns about what you might find.

But, for the data thrill-seekers,  let’s start by finding some info as which genotypes at which SNPs are thought to have some potential association with a known trait. This is part of what 23andme does for you in their standard reporting. However, they have legal obligations and true experts working on this. I, on the other hand, recently read “Genetics for Dummies”, so tread carefully.

How do we find out about the associations between your SNP results and any subsequent phenotypes? The obvious place, especially if you’re interested in a particular trait, is the ever-growing published scientific literature on genetics. Being privileged enough to count amongst my good friends a scientist who actually published papers involving this subject, I started there. Here, for example, is Dr Kaitlin Roke’s paper: “Evaluating Changes in Omega-3 Fatty Acid Intake after Receiving Personal FADS1 Genetic Information: A Randomized Nutrigenetic Intervention“. Extra credit is also very much due to her for being kind enough to help me out with some background knowledge.

Part of that fascinating study involved explaining to members of the public what the implications of the differing alleles possible at a specific SNP, rs174537, were with regards to the levels of Omega 3 fatty acids typically found in a person’s body, and how well the relevant conversion process proceeds. This may have dietary implications in terms of how much the person concerned should focus on increasing the amount of omega-3 laden food they eat – although it would be remiss of me to fail to mention the good doctor’s general advice that mostly everyone needs to up their levels anyway!

Anyway, to quote from their wonderfully clear description of the implications of the studied SNP:

…the document provided a brief overview of the reported difference in omega-3 FA levels in relation to a common SNP in FADS1 (rs174537)…individuals who are homozygote GG allele carriers have been reported to have more EPA in their bodies and an increased ability to convert ALA into EPA and DHA while individuals with at least one copy of the minor allele (GT or TT) were shown to have less EPA in their bodies and a reduced ability to convert ALA into EPA and DHA

Awesome, so here we have a definitive explanation as to which SNP was examined, and what the implications of the various genotypes are (which I’ve bolded for your convenience).

In part 1 of this post, we already saw how to filter your R-based 23andme data to view your results for a specific SNP, right? If you already completed all those import steps, you can do the same, just switching in the rsid of interest:

library(dplyr)
filter(genome_data_test, rsid == "rs174537")

g

Run that, and if you are returned a result of either GT or TT then you’ll know you should be especially careful to ensure you are getting a good amount of omega 3 in your diet.

OK, this is super cool, but what if you don’t happen to know a friendly scientist, or don’t know what traits you’re particular interested in –  how might you evaluate the SNP implications at scale?

Whilst there’s no substitute for actual expertise, luckily there is a R library called “gwascat“, which enables you to access the  NHGRI-EBI Catalog of published genome-wide association studies via a data structure in R. It has a whole lot of info in it, descriptions of the fields you end up with mostly being shown here. The critical point is that it contains a list of SNPs, associated traits and relevant genotypes, together with the references to the publications that found the associations should you want to get more details.

The first thing to do is install gwascat. gwascat is a bioconductor package, rather than the business-user-typical cran packages. So if you’re not a bioconductor user, there’s a slightly different installation routine, which you can see here.

But essentially:

source("https://bioconductor.org/biocLite.R")
biocLite("gwascat")
library(gwascat)

This took a while to install for me, but can just be left to get on with itself.

It sometimes feels like a new genomics study is released almost every few seconds, so the next step may be to get an up-to-date version of the catalog data –  I think the one that installs by default is a few years out of date.

Imagine that we’d like our new GWAS catalogue to end up as a data frame called “updated_gwas_data”:

updated_gwas_data <- as.data.frame(makeCurrentGwascat())

This might take a few minutes to run, depending on how fast your download speed is. You can get some idea of the recency once it’s done by checking the latest date that any publication was added to the catalogue.

max(updated_gwas_data$DATE.ADDED.TO.CATALOG)

At the time of writing, this date is May 21st 2018. And what was that study?

library(dplyr)
filter(updated_gwas_data, DATE.ADDED.TO.CATALOG == "2018-05-21") %>% select(STUDY) %>% distinct()

“Key HLA-DRB1-DQB1 haplotypes and role of the BTNL2 gene for response to a hepatitis B vaccine.”

Where can I find that study, if I want to read what it actually said?

library(dplyr)
filter(updated_gwas_data, DATE.ADDED.TO.CATALOG == "2018-05-21") %>% select(LINK) %>% distinct()

www.ncbi.nlm.nih.gov/pubmed/29534301

OK, now we have up-to-date data, let’s figure out how join it to your personal raw genome data we imported in part 1 (or to be precise here, a mockup file in the same format, so as to avoid sharing anyone’s real genomic data here).

The GWAS data lists each SNP (e.g. “rs9302874”) in a field called SNPS. Our imported 23andme data has the same info in the rsid field. Hence we can do a simple join, here using the dplyr library.

Assuming the 23andme data you imported is called “genome_data”, then:

library(dplyr)

output_data <- inner_join(genome_data, updated_gwas_data, by = c("rsid" = "SNPS"))

Note the consequences of the inner join here. 23andme analyses SNPs that don’t appear in this GWAS database, and the GWAS database may contain SNPs that 23andme doesn’t provide for you. In either case, these will be removed in the above file result. There’ll just be rows for SNPs that 23andme does provide you, and that do have an entry in the GWAS database.

Also, the GWAS database may have several rows for a single SNP. It could be that several studies examined a single SNP, or that one study found many traits potentially associated with a SNP. This means your final “output_data” table above will have many rows per for some SNPs.

OK, so at the time of writing there are nearly 70,000 studies in the GWAS database, and over 600,000 SNPs in the 23andme data export. How shall we narrow down this data-mass to find something potentially interesting?

There are many fields in the GWAS database you might care about – the definitions being listed here. For us amateur folks here, DISEASE.TRAIT and STRONGEST.SNP.RISK.ALLELE might be of most interest.

DISEASE.TRAIT gives you a genericish name for the trait that a study investigated whether there was an association with a given SNP (e.g. “Plasma omega-3 polyunsaturated fatty acid levels”, or “Systolic blood pressure”). Note that the values are not actually all “diseases” by the common-sense meaning – unless you consider traits like being tall a type of illness anyway.

STRONGEST.SNP.RISK.ALLELE gives you the specific allele of the SNP that was “most strongly” associated with that trait in the study (or potentially a ? if unknown, but let’s ignore those for now). The format here is to show the name of the SNP first, then append a dash and the allele of interest afterwards e.g. “rs10787517-A” or “rs7977462-C”.

This can easily give the impression of greater specificity than the real world has – only 1 allele ever appears in this field, so if there are multiple associations then only the strongest will be listed. If there are associations in tandem with other alleles or other SNPs, then that information also cannot be fully represented here. Also, it’s not necessarily the case that alleles are additive; so without further research we shouldn’t assume that having 2 of the high risk bases gives increased risk over a single base.

That’s what the journal reference is for – another reason it’s critical you do the reading and seek the help of appropriate genetic professionals before any rejoicing or worrying about your results.

Taking the above example from Roke et al’s Omega 3 study, this GWAS database records the most relevant strongest SNP risk allele for the SNP they analysed as being “rs174537-T”. You’d want to read the study in order to know whether that meant that the TT genotype was the one to watch for, or whether GT had similar implications.

Back to an exploration of your genome – the two most obvious approaches that come to mind are either: 1) check whether your 23andme results suggest an association with a specific trait you’re interested in, or 2) check which traits your results may be associated with.

In either case, it’ll be useful to create a field that highlights whether your 23andme results indicate that you have the “strongest risk allele” for each study. This is one way to help narrow down towards the interesting traits you may have inherited.

The 23andme part of of your dataframe contains your personal allele results in the genotype field. There you’ll see entries like “AC” or “TT”. What we really want to do here is, for every combination of SNP and study, check to see if either of the letters in your genotype match up with the letter part of the strongest risk allele.

One method would be to separate out your “genotype” data field into two individual allele fields (so “AC” becomes “A” and “C”). Next, clean up the strongest risk allele so you only have the individual allele (so “rs10787517-A”  becomes “A”). Finally check whether either or both of your personal alleles match the strongest risk allele. If they do, there might be something of interest here.

One method:

library(dplyr)
library(stringr)

output_data$risk_allele_clean <- str_sub(output_data$STRONGEST.SNP.RISK.ALLELE, -1)
output_data$my_allele_1 <- str_sub(output_data$genotype, 1, 1)
output_data$my_allele_2 <- str_sub(output_data$genotype, 2, 2)
output_data$have_risk_allele_count <- if_else(output_data$my_allele_1 == output_data$risk_allele_clean, 1, 0) + if_else(output_data$my_allele_2 == output_data$risk_allele_clean, 1, 0)

Now you have your two individual alleles stored in my_allele_1 and my_allele_2, and the allele for the “highest risk” stored in risk_allele_clean. Risk_allele_clean is the letter part of the GWAS STRONGEST.SNP.RISK.ALLELE field. And finally, the have_risk_allele_count is either 0, 1 or 2 depending on whether your 23andme genotype result at that SNP contains 0, 1 or 2 of the risk alleles.

The previously mentioned DISEASE.TRAIT field contains a summary of the trait involved. So by filtering your dataset to only look for studies about a trait you care about, you can see a summary of the risk allele and whether or not you have it, and the relevant studies that elicited that connection.

I did notice that this trait field can be kind of messy to use. You’ll see several different entries for similar topics; e.g. some studied traits around Body Mass Index are indeed classified as “Body mass index”, others as “BMI in non-smokers” or several other BMI-related phrases. So you might want to try a few different search strings in the below to access everything on the topic you care about.

For example, let’s assume that by now we also inevitably developed a strong interest in omegas and fatty acids. Which SNPs may relate to that topic, and do we personally have the risk allele for any of them?

We can use the str_detect function of the stringr library in order to search for any entries that contain the substring “omega” or “fatty acid”.

library(dplyr)
library(stringr)
select(output_data, rsid, DISEASE.TRAIT, risk_allele = risk_allele_clean, your_geneotype = genotype) %>% 
 filter(str_detect(tolower(DISEASE.TRAIT), "omega") | str_detect(tolower(DISEASE.TRAIT), "fatty acid")) %>%
 distinct()

 

rs

The full table this outputs is actually 149 rows long. That’s a fair few for an amateur to sift through. Maybe we’d prefer to restrict ourselves to the SNP we heard from Dr Roke’s study above was of particular interest:  rs174537. Easy: just filter on the rsid:

library(dplyr) 
library(stringr)
select(output_data, rsid, DISEASE.TRAIT, risk_allele = risk_allele_clean, your_geneotype = genotype) %>% 
 filter(str_detect(tolower(DISEASE.TRAIT), "omega") | str_detect(tolower(DISEASE.TRAIT), "fatty acid")) %>%
 distinct() %>%
 filter(rsid == "rs174537")

rof

Maybe you are curious as to what other traits that same SNP might be associated with? Just reverse the criteria for omega and fatty acid strings. Here I also added the journal reference title, in case I wanted to read up more into these trait associations.

library(dplyr)
 select(output_data, rsid, DISEASE.TRAIT, STUDY) %>% 
 filter(!str_detect(tolower(DISEASE.TRAIT), "omega") & !str_detect(tolower(DISEASE.TRAIT), "fatty acid")) %>%
 distinct() %>%
 filter(rsid == "rs174537")

rs174537

Now onto the second approach – this time, you don’t have a specific trait in mind. You’re more interested in discovering which traits have risk alleles that match the respective part of your genome. Please see all the above disclaimers. This is not remotely the same as saying which dreadful diseases you are going to get. But please stay away from this section if you are likely to be worried about seeing information that could even vaguely correspond to health concerns.

We already have our have_risk_allele_count field. If it’s 1 or 2 then you have some sort of match. So, the full list of your matches and the associated studies could be retrieved in a manner like this.

library(dplyr)
filter(output_data, have_risk_allele_count >= 1) %>%
 select(rsid, your_genotype = genotype, strongest_risk_allele = risk_allele_clean, DISEASE.TRAIT, STUDY )

Note that this list is likely to be long. Some risk alleles are very common in some populations, and remember that there may be many studies that relate to a single SNP, or many SNPs referred to by a single study. You might want to pop it in a nice DT Javascript DataTable to allow easy searching and sorting.

library(DT)
library(dplyr)

datatable(
 filter(output_data, have_risk_allele_count >= 1) %>%
 select(rsid, your_genotype = genotype, strongest_risk_allele = risk_allele_clean, DISEASE.TRAIT, STUDY )
)

results in something like:

dt

…which includes a handy search field at the top left of the output.

There are various other fields in the GWAS dataset you might consider using to filter down further. For example, you might be most interested in findings from studies that match your ethnicity, or occasions where you have risk alleles that are rare within the population. After all, we all like to think we’re special snowflakes, so if 95% of the general population have the risk allele for a trait, then that may be less interesting to an amateur genome explorer than one where you are in the lucky or unlucky 1%.

For the former, you might try searching within the INITIAL.SAMPLE.SIZE or REPLICATION.SAMPLE.SIZE fields, which has entries like: “272 Han Chinese ancestry individuals” or “1,180 European ancestry individuals from ~475 families”.

Similar to the caveats on searching the trait fields, one does need to be careful here if you’re looking for a comprehensive set of results. Some entries in the database have blanks in one of these fields, and others don’t specify ethnicities, having entries like “Up to 984 individuals”.

For the proportion of the studied population who had the risk allele, it’s the RISK.ALLELE.FREQUENCY field. Again, this can sometimes be blank or zero. But in theory, where it has a valid value, then, depending on the study design, you might find that lower frequencies are rarer traits.

We can use dplyr‘s arrange and filter functions to sort do the above sort of narrowing-down. For example: what are the top 10 trait / study / SNP combinations you have the risk allele for that were explicitly studied within European folk, ordered by virtue of them having the lowest population frequencies reported in the study?

library(stringr)
library(dplyr)
head(
 filter(output_data,have_risk_allele_count > 0 & (str_detect(tolower(INITIAL.SAMPLE.SIZE), "european") | str_detect(tolower(REPLICATION.SAMPLE.SIZE), "european")) & (RISK.ALLELE.FREQUENCY > 0 & !is.na(RISK.ALLELE.FREQUENCY))) %>%
 arrange(RISK.ALLELE.FREQUENCY) %>%
 select(rsid, your_genotype = genotype, DISEASE.TRAIT, INITIAL.SAMPLE.SIZE,RISK.ALLELE.FREQUENCY)
 , 10)

 

dt2

 

Or perhaps you’d prefer to prioritise the order of the traits you have the risk allele for, for example, based on the number of entries in the GWAS database for that trait where the highest risk allele is one you have. You might argue that these could be some of the most reliably associated traits, in the sense that they would bias towards those that have so far been studied the most, at least within this database.

Let’s go graphical with this one, using the wonderful ggplot2 package.

library(dplyr)
library(ggplot2)

trait_entry_count <- group_by(output_data, DISEASE.TRAIT) %>%
 filter(have_risk_allele_count >= 1) %>%
 summarise(count_of_entries = n())

ggplot(filter(trait_entry_count, count_of_entries > 100), aes(x = reorder(DISEASE.TRAIT, count_of_entries, sum), y = count_of_entries)) +
 geom_col() +
 coord_flip() +
 theme_bw() +
 labs(title = "Which traits do you have the risk allele for\nthat have over 100 entries in the GWAS database?", y = "Count of entries", x = "Trait")

over100

Now, as we’re counting combinations of studies and SNPs per trait here, this is obviously going to be somewhat self-fulfilling as some traits have been featured in way more studies than others. Likewise some traits may have been associated with many more SNPs than others. Also, recalling that many interesting traits seem to be related to a complex mix of SNPs, each of which may only have a tiny effect size, it might be that whilst you do have 10 of the risk alleles for condition X, you also don’t have the other 20 risk alleles that we discovered so far have an association (let alone the 100 weren’t even publish on yet and hence aren’t in this data!).

Maybe then we can sort our output in a different way. How about we count the number of distinct SNPs where you have the risk allele, and then express those as a proportion of the count of all the distinct SNPs for the given trait in the database, whether not you have the risk allele? This would let us say things such as, based (solely) on what in this database, you have 60% of the known risk alleles associated with this trait.

One thing noted in the data, both the 23andme genome data and the gwascat highest risk allele have unusual values in the allele fields – things like ?, -, R, D, I and some numbers based on the fact the “uncleaned” STRONGEST.SNP.RISK.ALLELE didn’t have a -A, -C, -G or -T at the end of the SNP it named. Some of these entries may be meaningful – for example the D and I in the 23andme data refer to deletes and insertions, but won’t match up with anything in the gwascat data. Others may be more messy or missing data, for example 23andme reports “–” if no genotype result was provided for a specific SNP call.

In order to avoid these inflating the proportion’s denominator we’ll just filter down so that we only consider entries where our gwascat-derived “risk_allele_clean” and  23andme-derived “my_allele_1” and “my_allele_2” are all one of the standard A, C, G or T bases.

Let’s also colour code the results by the rarity of the SNP variant within the studied population. That might provide some insight to exactly what sort of special exception we are as an individual – although some of the GWAS data is missing that field and basic averaging won’t necessarily give the correct weighting, so this part is extra…”directional”.

You are no doubt getting bored with the sheer repetition of caveats here – but it is so important. Whilst these are refinements of sorts, they are simplistic and flawed and you should not even consider concluding something significant about your personal health without seeking professional advice here. This is fun only. Well, for for those of us who could ever possibly classify data as fun anyway.

Here we go, building it up one piece at a time for clarity of some kind:

library(dplyr)
library(ggplot2)

# Summarise proportion of SNPs for a given trait where you have a risk allele

trait_snp_proportion <-  filter(output_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
mutate(you_have_risk_allele = if_else(have_risk_allele_count >= 1, 1, 0)) %>%
 group_by(DISEASE.TRAIT, you_have_risk_allele) %>%
 summarise(count_of_snps = n_distinct(rsid)) %>%
 mutate(total_snps_for_trait = sum(count_of_snps), proportion_of_snps_for_trait = count_of_snps / sum(count_of_snps) * 100) %>%
 filter(you_have_risk_allele == 1) %>%
 arrange(desc(proportion_of_snps_for_trait)) %>%
 ungroup()

# Count the studies per trait in the database

trait_study_count <- filter(output_data, risk_allele_clean %in% c("C" ,"A", "G", "T") & my_allele_1 %in% c("C" ,"A", "G", "T") & my_allele_2 %in% c("C" ,"A", "G", "T") ) %>%
 group_by(DISEASE.TRAIT) %>%
 summarise(count_of_studies = n_distinct(PUBMEDID), mean_risk_allele_freq = mean(RISK.ALLELE.FREQUENCY))

# Merge the above together

trait_snp_proportion <- inner_join(trait_snp_proportion, trait_study_count, by = "DISEASE.TRAIT")

# Plot the traits where there were more than 2 studies and you have risk alleles for more than 70% of the SNPs studied

ggplot(filter(trait_snp_proportion, count_of_studies > 1 & proportion_of_snps_for_trait > 70), aes(x = reorder(DISEASE.TRAIT, proportion_of_snps_for_trait, sum), y = proportion_of_snps_for_trait, fill = mean_risk_allele_freq)) +
 geom_col() +
 coord_flip() +
 theme_bw() + 
 labs(title = "Traits I have more than half of the risk\nalleles studied where > 1 studies involved", 
 y = "% of SNPs with risk allele", x = "Trait", fill = "Mean risk allele frequency") +
 theme(legend.position="bottom")

 

proptraits

Again, beware that the same sort of trait can be expressed in different ways within the data, in which case these entries are not combined. If you wanted to be more comprehensive regarding a specific trait, you might feel inclined to produce your own categorisation first and group by that – e.g. lumping anything BMI or Body Mass Index into your own BMI category via creating a new field.

Happy exploring!

Analysing your 23andme genetic data in R part 1: importing your genome into R

23andme is one of the ever-increasing number of direct to consumer DNA testing companies. You send in a vial of your spit; and they analyse parts of your genome, returning you a bunch of reports on ancestry, traits and – if you wish – health.

Their business is highly regulated, as of course it should be (and some would say it oversteps the mark a little even with that), so they are, quite rightly, legally limited as to what info they can provide back to the consumer. However, the exciting news for us data geeks is that they do allow you to download the raw data behind their analysis. This means you can dig deeper into parts of your genome that their interpretations don’t cover.

It should be said that there is considerable risk involved here, unless – or perhaps even if – you happen to be a genetics expert. The general advice on interpretation for amateurs should be to seek a professional genetic counseller before concluding anything from your DTC test – although in reality that might be easier said than done.

Whilst I might know a bit about how to play with data, I am not at all a genetics expert, so anything below must be taken with a large  amount of skepticism. In fact, if you are in the perfectly legitimate camp of “best not to know” people when it comes to DNA analysis, or you feel there is any risk you won’t be able to constrain yourself to treat the innards of your genome as solely a fun piece of analysis and constrain yourself to avoid areas you don’t want to explore, it would be wise not to proceed.

Also, even as an amateur, I’m aware that the science behind a lot of the interpretation of one’s genome is in a nascent period, at best. There are many people or companies that may rather over-hype what is actually known here, perhaps even to the extent of fraud in some cases.

But if you are interested to browse your results, here is my first experience of playing with the 23andme raw data in R.

Firstly, you need to actually obtain your raw 23andme data. A obvious precondition to this is that you have purchased one of their analysis products, set up your 23andme account, and waited until they have delivered the results to you. Assuming that’s all done, you can visit this 23andme page, and press the “Download” button near the top of the screen. You’ll need to wait a while, but eventually they’ll notify you that your file is ready, and let you download a text file of results to your computer. Here, I called my example file “genome.txt”.

Once you have that, it’s time to load it into R!

The text file is in a tab-delimited format, and also contains 19 rows at the top describing the contents of the file in human-readable format. You’ll want to skip those rows before importing it into R. I used the readr package to do this, although it’s just as easy in base R.

A few notes:

  • It imported more successfully if I explicitly told R the data type of each column.
  • One of the column headers (i.e. the field names) starts with a # and includes spaces, which is a nuisance to deal with in R, so I renamed that right away
  • I decided to import it into a dataframe called “genome_data”
library(readr)

genome_data_test <- read_tsv(".\\data_files\\genome.txt", skip = 19, col_types = cols(
 `# rsid` = col_character(),
 chromosome = col_character(),
 position = col_integer(),
 genotype = col_character())
 )

genome_data_test <- rename(genome_data_test, rsid = `# rsid`)

Let’s see what we have!

head(genome_data_test)

genomedatatest

Sidenote: the genome data I am using is a mocked-up example in the 23andme format, rather than anyone’s real genome – so don’t be surprised if you see “impossible” results shown here. Call me paranoid, but I am not sure it’s necessarily a great idea to publicly share someone’s real results online, at least without giving it careful consideration.

OK, so we have a list of your SNP call data. The rsid column is the “Reference SNP cluster ID” used to refer to a specific SNP, the chromosome and position tell you whereabouts that SNP is located, and the genotype tells you which combination of the Adenine, Thymine, Cytosine and Guanine bases you happen have in those positions.

(Again, I am not at all an expert here, so apologies for any incorrect terminology! Please feel free to let me know what I should have written 🙂 )

Now, let’s check that the import went well.

Many of the built in 23andme website reports do actaully  list what SNPs they refer to. For instance, if you click on “Scientific Details” on the life-changing trait report which tells you how likely it is that you urine will smell odd to you after eating asparagus, and look for the “marker tested” section, it tells you that it’s looking at the rs4481887 SNP.

Capture

And it also tells you what bases were found there in your test results. Compare that to the data for the same person’s genome imported in R, by filtering your imported data like this:

library(dplyr)
filter(genome_data_test, rsid == "rs4481887")

If the results of that match the results shown in the scientific details of your asparagus urine smell report, yay, things are going OK so far.

OK, so now your 23andme data is safely in R. But why did we do this, and what might it mean? Come back soon for part 2.

R packages for summarising data – part 2

In a recent post, I searched a tiny percentage of the CRAN packages in order to check out the options for R functions that quickly and comprehensively summarise data, in a way conducive to tasks such as data validation and exploratory analytics.

Since then, several generous people have been kind enough to contact me with suggestions of other packages aimed at the same sort of task that I might enjoy. Always being grateful for new ideas, I went ahead and tested them out. Below I’ll summarise what I found.

I also decided to add another type of variable to the small test set from last time – a date. I now have a field “date_1” which is a Date type, with no missing data, and a “date_2” which does have some missing data in it.

My “personal preference” requirements when evaluating these tools for my specific use-cases haven’t changed from last time, other than that I’d love the tool to be able to handle date fields sensibly.

For dates, I’d ideally like to see the summarisation output acknowledge that the fields are indeed dates, show the range of them (i.e. max and min), if any are missing and how many distinct dates are involved. Many a time I have found that I have fewer dates in my data than I thought, or I had a time series that was missing a certain period. Both situations would be critical to know about for a reliable analysis. It would be great if it could somehow indicate if there was a missing block of dates – e.g. if you think you have all of 2017’s data, but actually June is missing, that would be important to know before starting your analysis.

Before we examine the new packages, I wanted to note that one of my favourites from last time, skimr, received a wonderful update, such that it now displays sparkline histograms when describing certain types of data, even on a Windows machine. Update to version 1.0.1 and you too will get to see handy mini-charts as opposed to strange unicode content when summarising. This only makes an already great tool even more great, kudos to all involved.

Let’s check how last time’s favourite for most of my uses cases, skimr, treats dates – and see the lovely fixed histograms!

library(skimr)

skim(data)

skim

This is good. It identifies and separates out the date fields. It shows the min, max and median date to give an idea of the distribution, along with the number of unique dates. This will provide invaluable clues as to how complete your time series is. As with the other data types, it clearly shows how much missing data there is. Very good. Although I might love it even more if it displayed a histogram for dates that showed the volume of records (y) per time period (x; granularity depending on range), giving a quick visual answer to the question of missing time periods.

The skim-by-groups output for dates is equally as sensible:

library(dplyr)
library(skimr)
group_by(data, category) %>% skim()

skimgroup

Now then, onto some new packages! I collated the many kind suggestions, and had a look at:

  • CreateTableOne, from the tableone package
  • desctable from desctable
  • ggpairs from GGally
  • ds_summary_stats from Descriptr
  • I was also going to look at the CompareGroups package, but unfortunately it had apparently been removed from CRAN recently because “check problems were not corrected in time” apparently. Maybe I’ll try again in future.

CreateTableOne, from the tableone package

Documentation

library(tableone)
 CreateTableOne(data = data)

tableone

Here we can easily see the count of observations. It has automatically differentiated between the categorical and numeric variables, and presented likely appropriate summaries on that basis. There are fewer summary stats for the numeric variables than I would have liked to see when I am evaluating data quality and similar tasks.

There’s also no highlighting of missing data. You could summarise that “type” must have some missing data as it has fewer observations than exist in the overall dataset, but that’s subtle and wouldn’t be obvious on a numeric field. However, the counts and percentage breakdown on the categorical variables is very useful, making it a potential dedicated frequency table tool replacement.

The warning message shows that date fields aren’t supported or shown.

A lot of these “limitations” likely come down to the actual purpose this tool is aimed at. The package is called tableone because it’s aimed at producing a typical “table 1” of a biomedical journal paper, which is often a quantitative breakdown of the population being studied, with exactly these measures. That’s a different and more well-specified task than a couple of use-cases I commonly use summary tools for, such as getting a handle on a new dataset, or trying to get some idea of data quality.

This made me realise that perhaps I am over-simplifying to imagine a single “summary tool” would be best for every one of my use-cases. Whilst I don’t think CreateTableOne’s default output is the most comprehensive for judging data quality, at the same time no journal is going to publish the output of skimr directly.

There are a few more tableone tricks though! You can summary() the object it creates, whereupon you get an output that is less journaly, but contains more information.

summary(CreateTableOne(data = data))

summarytableone

This view produces a very clear view of how much data is missing in both absolute and percentage terms, and most of the summary stats (and more!) I wanted for the numeric variables.

tableone can also do nice summarisation by group, using the “strata” parameter. For example:

CreateTableOne(strata = "category", data = data)

stratatableone.PNG

You might note above that not only do you get the summary segmented by group, but you also automatically get p values from a statistical hypothesis  test as to whether the groups differ from each other. By default these are either chi-squared for categorical variables or ANOVA for continuous variables. You can add parameters to use non-parametric or exact tests though if you feel they are more appropriate. This also changes up the summary stats that are shown for numeric variables – which also happens if you apply the nonnormal parameter even without stratification.

For example, if we think “score” should not be treated as normal:

print(CreateTableOne(strata = "category", data = data), nonnormal = "score")

nonnormaltableone

Note how “score” is now summarised by its median & IQR as opposed to “rating”, which is summarised by mean and standard deviation. A “nonnorm” test has also been carried out when checking group differences in score, in this case a Kruskal Wallis test.

The default output does not work with kable, nor is the resulting summarisation in a “tidy” format, so no easy piping the results in or out for further analysis.

desctable, from the desctable package

Documentation

library(desctable)
desctable(data)

desctable

The vignette for desctable describes how it is intended to fulfil a similar function to the aforementioned tableone, but is built to be customisable, manipulable and fit in well with common tidy data tools such as dplyr.

By default, you can see it shows summary data in a tabular format, taking into showing appropriate summaries based on the type of the variables (numeric vs categorical, although it does not show anything other than counts of non-missing data for dates). The basic summaries are shown, although, like tableone, I would prefer a few more shown by default if I was using it as an exploratory tool. Like tableone though, exploratory analysis of messy data is likely not really its primary aim.

The tabular format is quite easy to manipulate downstream, in a “tidy” way. It actually produces a list of dataframes, one containing the variable names and a second containing the summary stats. But you can as.data.frame() it to pop them into a standard usable dataframe.

I did find it a little difficult to get an quick overview at a glance in the console. Everything was displayed perfectly well – but  it took me a moment to make my conclusions. Perhaps part of this is because the variables of different types are all displayed together, making me take a second to realise that some fields are essentially hierarchical – row 1 shows 60 “type” fields, and rows 2-6 show the breakdown by value of “type”.

There’s no special highlighting of missing data, although with a bit of mental arithmetic one could work out that if there are N = 64 records in the date_1 field then if there are 57 entries in the date_2 field, we must be missing at least 7 date_2 entries. But there’s no overall summary of record count by default so you would not necessarily know the full count if every field has some missing data.

It works nicely with kable and has features allowing output to markdown and html.

The way it selects which summaries to produce for the numerical fields is clever. It automatically runs a Shapiro-Wilk test on the field contents to determine whether it is normally distributed. If yes, then you get mean and standard deviation. If no, then median and inter-quartile range. I like the idea of tools that nudge you towards using distribution-appropriate summaries and tests, without you having to take the time (and remember) to check distributions yourself, in a transparent fashion.

Group comparison is possible within the same command, by using the dplyr group_by function, and everything is pipeable – which is a very convenient method for those of already immersed in the tidyverse.

group_by(data, category) %>%
 desctable()

desctableg

The output is “wide” with lengthy field names, which makes it slightly hard to follow if you have limited horizontal space. It might also be easier to compare certain summary stats, e.g. the mean of each group, if they were presented next to each other when the columns are wide.

Note also the nice way that it has picked an appropriate statistical test to automatically test for significant differences between groups. Here it picked the Kruskall-Wallis test for the numeric fields as the data was detected as being non-normally distributed.

Although I am not going to dive deep into it here, this function is super customisable. I mentioned above that I might like to see more summary stats shown.  If I wanted to always see the mean, standard deviation and range of numeric fields I could do:

desctable(data,stats = list("N" = length, "Mean" = mean, "SD" = sd, "Min" = min, "Max" = max))

desctablec

You can also include conditional formulae (“show stat x only if this statement is true”), or customise which statistical tests are applied when comparing groups.

If you spent a bit of time working this through, you could probably construct a very useful output for almost any reasonable requirements. All this and more is documented in the vignette, which also shows how it can be used to generate interactive tables via the datatable function of DT.

 

ggpairs, from the GGally package

Documentation

library(GGally)
ggpairs(data)

ggpairs

ggpairs is a very different type of tool than the other summarising tools tried here so far. It’s not designed to tell you the count, mean, standard deviation or other one-number summary stats of each field, or any information on missing data. Rather, it’s like a very fancy pairs plot, showing the interactions of each of your variables with each of the others.

It is sensitive to the types of data available – continuous vs discrete, so makes appropriate choices as to the visualisations to use. These can be customised if you prefer to make your own choices.

The diagonal of the pairs plot – which would compare each variable to itself, does provide univariate information depending on the type of the variable. For example, a categorical variable would, by default,  show a histogram of the its distribution. This can be a great way to spot potential issues such as outliers or unusual distributions. It also can may reveal stats that could otherwise be represented numerically, for instance the range or variance, in a way often more intuitive to humans.

It looks to handle dates as though they were numbers, rather than as a separate date type.

There isn’t exactly a built-in group comparison feature, although the nature of a pairs plot may mean relevant comparisons are done by default. You can also pass ggplot2 aesthetics through, meaning you can essentially compare groups by assigning them different colours for example.

Here for example is the output where I have specified that the colours in the charts should reflect the category variables in my data.

ggpairs(data, mapping = aes(colour = category))

ggpairsc

As the output is entirely graphic, it doesn’t need to work with kable and obviously doesn’t produce a tidy data output.

I wouldn’t categorise this as having the same aim as the other tools mentioned here or in the past post – although I find it excellent for its intended use-case.

After first investigating data with the more conventional summarisation tools to the point I have a mental overview of its contents and cleanliness, I might often run it through ggpairs as the next step. Visualising data is often a great, sometimes overlooked, method to increase your understanding of a dataset. This tool will additionally allow you to get an overview of the relationship between any two of your variables in a fast and efficient way.

ds_summary_stats from descriptr

Documentation

library(descriptr)
ds_summary_stats(data$score)

descriptrss

descriptr is a package that contains several tools to summarise data. The most obvious of these aimed at my use cases is “ds_summary_stats”, shown above. This does only work on numeric variables, but the summary it produces is extremely comprehensive. Certainly every summary statistic I normally look at for numeric data is shown somewhere in the above!

The output it produces is very clear in terms of human readability, although it doesn’t work directly with kable, nor does it produce tidy data if you wished to use the results downstream.

One concern I had regards its reporting of missing data. The above output suggests that my data$score field has 58 entries, and 0 missing data. In reality, the dataframe contains 64 entries and has 6 records with a missing (NA) score. Interestingly enough, the same descriptr package does contain another function called ds_screener, which does correctly note the missing data in this field.

ds_screener(data)

descriptrscreener

This function is in itself a nicely presented way of getting a high-level overview of how your data is structured, perhaps as a precursor to the more in-depth summary statistics that concern the values of the data.

Back on the topic of producing tidy data: another command in the package, ds_multi_stats, does produce tidy, kableable and comprehensive summaries that can include more than one variable – again restricted to numeric variables only.

I also had problems with this command fields that contained missing data – with it giving the classic “missing values and NaN’s not allowed if ‘na.rm’ is FALSE.” error. I have to admit I did not dig deep to resolve this, only confirming that simply adding “na.rm = TRUE” did not fix the problem. But here’s how the output would look, if you didn’t have any missing data.

ds_multi_stats(filter(data, !is.na(score)), score, rating)

descriptrms

There is also a command that has specific functionality to do group comparisons between numeric variables.

ds_group_summary(data$category, data$rating)

descriptrg

This one also requires that you have no missing data in the numeric field you are summarising. The output isn’t for kable or tidy tools, but its presentation is one of the best I’ve seen for making it easy to visually compare any particular summary stat between groups at a glance.

Whilst ds_summary_stats only works with numeric data, the package does contain other tools that work with categorical data. One such function builds frequency tables.

ds_freq_table(data$category)

descriptrft

This would actually have been a great contender for my previous post regarding R packages aimed at producing frequency tables to my taste. It does miss information that I would have liked to have seen regarding the presence of missing data. The percentages it displays are calculated out of the total non-missing data, so you would be wise to first pre-ensure you that your dataset is indeed complete, perhaps with the afore-mentioned ds_screener command.

Some of these commands also define plot methods, allowing you to produce graphical summaries with ease.

Books I read in 2017

Long term readers (hi!) may recall my failure to achieve the target I had of reading 50 books in 2016. I had joined the 2016 Goodreads reading challenge, logged my reading activity, and hence had access to the data needed track my progress at the end of the year. It turns out that 41 books is less than 50.

Being a glutton for punishment, I signed up again in 2017, with the same cognitively terrifying 50 book target – basically one a week, although I cannot allow myself to think that way. It is now 2018, so time to review how I did.

Goodreads allows you to log which books you are reading and when you finished them. The finish date is what counts for the challenge. Nefarious readers may spot a few potential exploits here, especially if competing for only 1 year. However, I tried to play the game in good faith (but did I actually do so?  Perhaps the data will reveal!).

As you go through the year, Goodreads will update you on how you are doing with your challenge. Or for us nerd types, you can download a much more detailed and useful CSV. There’s also a the Goodreads API to explore, if that floats your boat.

Similarly to last year, I went with the CSV.  I did have to hand-edit the CSV a little, both to fill in a little missing data that appears to be absent from the Goodreads dataset, and also to add couple of extra data fields that I wanted to track that Goodreads doesn’t natively support. I then popped the CSV into a Tableau dashboard, which you can explore interactively by clicking here.

Results time!

How much did I read

Joyful times! In 2017 I got to, and even exceeded, my target! 55 books read.

In comparison to my 2016 results, I got ahead right from the start of the year, and widened the gap notably in Q2. You can see a similar boost to that witnessed in 2016 around the time of the summer holidays, weeks 33-35ish. Not working is clearly good for one’s reading obligations.

What were the characteristics of the books I read?

Although page count is a pretty vague and manipulable measure – different books have different physical sizes, font sizes, spacing, editions – it is one of the few measures where data is easily available so we’ll go with that. In the case of eBooks or audio books (more on this later) without set “pages” I used the page count of the respective paper version. I fully acknowledge this rigour of this analysis as falling under “fun” rather than “science”.

So the first revelation is that this year’s average pages per read book was 300, a roughly 10% decrease from last year’s average book. Hmm. Obviously, if everything else remains the same,  the target of 50 books is easier to meet if you read shorter books! Size doesn’t always reflect complexity or any other influence around time to complete of course.

I hadn’t deliberately picked short books – in fact, being aware of this incentive I had tried to be conscious of avoiding doing this, and concentrate on reading what I wanted to read, not just what boosts the stats. However, even outside of this challenge, I (most likely?) only have a certain number of years to live, and hence do feel a natural bias towards selecting shorter books if everything else about them was to be perfectly equal. Why plough through 500 pages if you can get the same level of insight about a topic in 150?

The reassuring news is that, despite the shorter average length of book, I did read 20% more pages in total. This suggests I probably have upped the abstract “quantity” of reading, rather than just inflated the book count by picking short books. There was also a little less variation in page count between books this year than last by some measures.

In the distribution charts, you can see a spike of books at around 150 pages long this year which didn’t show up last year. I didn’t note a common theme in these books, but a relatively high proportion of them were audio books.

Although I am an avid podcast listener, I am not a huge fan of audio books as a rule. I love the idea as a method to acquire knowledge whilst doing endless chores or other semi-mindless activities. I would encourage anyone else with an interest of entering book contents into their brain to give them a whirl. But, for me, in practice I struggle to focus on them in any multi-tasking scenario, so end up hitting rewind a whole lot. And if I am in a situation where I can dedicate full concentration to informational intake, I’d rather use my eyes than my ears. For one, it’s so much faster, which is an important consideration when one has a book target!  With all that, the fact that audio books are over-represented in the lower page-counts for me is perhaps therefore not surprising. I know my limits.

I have heard tell that some people may consider audio books as invalid for the book challenge. In defence, I offer up that Goodreads doesn’t seem to feel this way in their blog post on the 2018 challenge. Besides, this isn’t the Olympics – at least no-one has sent me a gold medal yet – so everyone can make their own personal choice. For me, if it’s a method to get a book’s contents into my brain, I’ll happily take it. I just know I have to be very discriminating with regards to selecting audio books I can be sure I will be able to focus on. Even I would personally regard it cheating to log a book that happened to be audio-streaming in the background when I was asleep. If you don’t know what the book was about, you can’t count it.

So, what did I read about?

What did I read

Book topics are not always easy to categorise. The categories I used here are mainly the same as last year, based entirely on my 2-second opinion rather than any comprehensive Dewey Decimal-like system. This means some sort of subjectivity was necessary. Is a book on political philosophy regarded as politics or philosophy? Rather than spend too much time fretting about classification, I just made a call one way or the other. Refer to above comment re fun vs science.

The main changes I noted were indeed a move away from pure philosophical entries towards those of a political tone. Likewise, a new category entrant was seen this year in “health”. I developed an interest in improving one’s mental well-being via mindfulness and meditation type subjects, which led me to read a couple of books on this, as well as sleep, which I have classified as health.

Despite me continuing to subjectively feel that I read the large majority of books in eBook form, I actually moved even further away from that being true this year. Slightly under half were in that form. That decrease has largely been taken up by the afore-mentioned audio books, of which I apparently read (listened?) 10 this year. Similarly to last year, 2 of the audio entries were actually “Great Courses“, which are more like a sequence of university-style lectures, with an accompanying book containing notes and summaries.

My books have also been slightly less popular with the general Goodreads-rating audience this year, although not dramatically so.

Now, back to the subject of reading shorter books in order to make it easier to hit my target: the sheer sense of relief I felt when I finished book #50 and hence could go wild with relaxed, long and slow reading, made me concerned as to whether I had managed to beat that bias or not. I wondered whether as I got nearer to my target, the length of the books I selected might have risen, even though this was not my intention.

Below, the top chart shows that average page count by book completed on a monthly basis, year on year.

Book length ofer time

 

The 2016 data risks producing somewhat invalid conclusions, especially if interpreted without reference to the bottom “count of books” chart, mainly because of the existence of a  September 2016, a month where I read a single book that happened to be over 1,000 pages long.

I also hadn’t actually decided to participate in the book challenge at the start of 2016. I was logging my books, but just for fun (imagine that!). I don’t remember quite when it was suggested I should explicitly join then challenge, but before then it’s less likely I felt pressure to read faster or shorter.

Let’s look then only at 2017:

Book length ofer time2Sidenote: What happened in July?! I only read one book, and it wasn’t especially long. I can only assume Sally Scholz’s intro to feminism must have been particularly thought-provoking.

For reference, I hit book #50 in November this year. There does seem some suggestion in the data that indeed that I did read longer books as time went on, despite my mental disavowal of doing such.

Stats geeks might like to know that the line of best fit shown in the top chart above could be argued to represent that 30% of the variation in book length over time, with each month cumulatively adding on an estimate of an extra 14 pages above a base of 211 pages.  It should be stated that I didn’t spend too long considering the best model or fact-checking the relevant assumptions for this dataset. Instead just pressed “insert trend line” in Tableau and let it decide :).

I’m afraid the regression should not be considered as being traditionally statistically significant at the 0.05 level though, having a p-value of – wait for it – 0.06. Fortunately, for my intention to publish the above in Nature :), I think people are increasingly aware of the silliness of uncontextual hardline p-value criteria and/or publication bias.

Nonetheless, as I participate in the 2018 challenge – now at 52 books, properly one a week – I shall be conscious of this trend and double-up my efforts to keep reading based on quality rather than length. Of course, I remain very open – some might say hopeful! – that one sign of a quality author is that they can convey their material in a way that would be described as concise. You generous readers of my ramblings may detect some hypocrisy here.

For any really interested readers out there, you can once more see the full list of the books I read, plus links to the relevant Goodreads description pages, on the last tab of the interactive viz.

My favourite R package for: summarising data

Hot on the heels of delving into the world of R frequency table tools, it’s now time to expand the scope and think about data summary functions in general. One of the first steps analysts should perform when working with a new dataset is to review its contents and shape.

How many records are there? What fields exist? Of which type? Is there missing data? Is the data in a reasonable range? What sort of distribution does it have? Whilst I am a huge fan of data exploration via visualisation, running a summary statistical function over the whole dataset is a great first step to understanding what you have, and whether it’s valid and/or useful.

So, in the usual format, what would I like my data summarisation tool to do in an ideal world? You may note some copy and paste from my previous post. I like consistency 🙂

  1. Provide a count of how many observations (records) there are.
  2. Show the number, names and types of the fields.
  3. Be able to provide info on as many types of fields as possible (numeric, categorical, character, etc.).
  4. Produce appropriate summary stats depending on the data type. For example, if you have a continuous numeric field, you might want to know the mean. But a “mean” of an unordered categorical field makes no sense.
  5. Deal with missing data transparently. It is often important to know how many of your observations are missing. Other times, you might only care about the statistics derived from those which are not missing.
  6. For numeric data, produce at least these types of summary stats. And not to produce too many more esoteric ones, cluttering up the screen. Of course, what I regard as esoteric may be very different to what you would.
    1. Mean
    2. Median
    3. Range
    4. Some measure of variability, probably standard deviation.
    5. Optionally, some key percentiles
    6. Also optionally, some measures of skew, kurtosis etc.
  7. For categorical data, produce at least these types of summary stats:
    1. Count of distinct categories
    2. A list of the categories – perhaps restricted to the most popular if there are a high number.
    3. Some indication as to the distribution – e.g. does the most popular category contain 10% or 99% of the data?
  8. Be able to summarise a single field or all the fields in a particular dataframe at once, depending on user preference.
  9. Ideally, optionally be able to summarise by group, where group is typically some categorical variable. For example, maybe I want to see a summary of the mean average score in a test, split by whether the test taker was male or female.
  10. If an external library, then be on CRAN or some other well supported network so I can be reasonably confident the library won’t vanish, given how often I want to use it.
  11. Output data in a “tidy” but human-readable format. Being a big fan of the tidyverse, it’d be great if I could pipe the results directly into ggplotdplyr, or similar, for some quick plots and manipulations. Other times, if working interactively, I’d like to be able to see the key results at a glance in the R console, without having to use further coding.
  12. Work with “kable” from the Knitr package, or similar table output tools. I often use R markdown and would like the ability to show the summary statistics output in reasonably presentable manner.
  13. Have a sensible set of defaults (aka facilitate my laziness).

What’s in base R?

The obvious place to look is the “summary” command.

This is the output, when run on a very simple data file consisting of two categorical (“type”, “category”) and two numeric (“score”, “rating”) fields. Both type and score have some missing data. The others do not. Rating has a both one  particularly high and one particularly low outlier.

summary(data)

basesummary

This isn’t too terrible at all.

It clearly shows we have 4 fields, and it has determined that type and category are categorical, hence displaying the distribution of counts per category. It works out that score and rating are numerical, so gives a different, sensible, summary.

It highlights which fields have missing data. But it doesn’t show the overall count of records, although you could manually work it out by summing up the counts in the categorical variables (but why would you want to?). There’s no standard deviation. And whilst it’s OK to read interactively, it is definitely not “tidy”, pipeable or kable-compatible.

Just as with many other commands, analysing by groups could be done with the base R “by” command. But the output is “vertical”, making it hard to compare the same stats between groups at a glance, especially if there are a large number of categories. To determine the difference in means between category X and category Z in the below would be a lot easier if they were visually closer together. Especially if you had many more than 3 categories.

by(data, data$category, summary)

bybasesummary

So, can we improve on that effort by using libraries that are not automatically installed as part of base R? I tested 5 options. Inevitably, there are many more possibilities, so please feel free to write in if you think I missed an even better one.

  • describe, from the Hmisc package
  • stat.desc from pastecs
  • describe from psych
  • skim from skimr
  • descr and dfSummary from summarytools

Was there a winner from the point of view of fitting nicely to my personal preferences? I think so, although the choice may depend on your specific use-case.

For readability, compatibility with the tidyverse, and ability to use the resulting statistics downstream, I really like the skimr feature set. It also facilitates group comparisons better than most. This is my new favourite.

If you prefer to prioritise the visual quality of the output, at the expense of processing time and flexibility, dfSummary from summarytools is definitely worth a look. It’s a very pleasing way to see a summary of an entire dataset.
Update: thanks to Dominic who left a comment after having fixed the processing time issue very quickly in version 0.8.2

If you don’t enjoy either of those, you are probably fussy :). But for reference, Hmisc’s describe was my most-used variant before conducting this exploration.

describe, from the Hmisc package

Documentation

library(Hmisc)
Hmisc::describe(data)

hmiscdescribe

This clearly provides the count of variables and observations. It works well with both categorical and numerical data, giving appropriate summaries in each case, even adapting its output to take into account for instance how many categories exist in a given field. It shows how much data is missing, if any.

For numeric data, instead of giving the range as such, it shows the highest and lowest 5 entries. I actually like that a lot. It helps to show at a glance whether you have one weird outlier (e.g. a totals row that got accidentally included in the dataframe) or whether there are several values many standard deviations away from the mean. On the subject of deviations, there’s no specific variance or standard deviation value shown – although you can infer much about the distribution from the several percentiles it shows by default.

The output is nicely formatted and spacious for reading interactively, but isn’t tidy or kableable.

There’s no specific summary by group function although again you can pass this function into the by() command to have it run once per group, i.e. by(data, data$type, Hmisc::describe)

The output from that however is very “long” and in order of groups rather than variables naturally, rendering comparisons of the same stat between different groups quite challenging at a glimpse.

stat.desc, from the pastecs package

Documentation

library(pastecs)
stat.desc(data)

statdesc

The first thing to notice is that this only handles numeric variables, producing NA for the fields that are categorical. It does provide all the key stats and missingness info you would usually want for the numeric fields though, and it is great to see measures of uncertainty like confidence intervals and standard errors available. With other parameters you can also apply tests of normality.

It works well with kable. The output is fairly manipulable in terms of being tidy, although the measures show up as row labels as opposed to a true field. You get one column per variable, which may or may not be what you want if passing onwards for further analysis.

There’s no inbuilt group comparison function, although of course the by() function works with it, producing a list containing one copy of the above style of table for each group – again, great if you want to see a summary of a particular group, less great if you want to compare the same statistic across groups.

describe and describeBy, from the psych package

Documentation

library(psych)
psych::describe(data)

psychdescribe

OK, this is different! It has included all the numeric and categorical fields in its output, but the categorical fields show up, somewhat surprisingly if you’re new to the package, with the summary stats you’d normally associate with numeric fields. This is because the default behaviour is to recode categories as numbers, as described in the documentation:

…variables that are categorical or logical are converted to numeric and then described. These variables are marked with an * in the row name…Note that in the case of categories or factors, the numerical ordering is not necessarily the one expected. For instance, if education is coded “high school”, “some college” , “finished college”, then the default coding will lead to these as values of 2, 3, 1. Thus, statistics for those variables marked with * should be interpreted cautiously (if at all).

As the docs indicate, this can be risky! It is certainly risky if you are not expecting it :). I don’t generally have use-cases where I want this to happen automatically, but if you did, and you were very careful how you named your categories, it could be handy for you.

For the genuinely numeric data though, you get most of the key statistics and a few nice extras. It does not indicate where data is missing though.

The output works with kable, and is pretty tidy, outside of the common issue of using rownames to represent the variable the statistics are summarising, if we are being pernickety.

This command does have a specific summary-by-group variation, describeBy. Here’s how we’d use it if we want the stats for each “type” in my dataset, A – E.

psych::describeBy(data, data$type)

psychdescribeby

Everything you need is there, subject to the limitations of the basic describe(). It’s much more compact than using the by() command on some of the other summary tools, but it’s still not super easy to compare the same stat across groups visually.  It also does not work with kable and is not tidy.

The “mat” parameter does allow you to produce a matrix output of the above.

psych::describeBy(data, data$type, mat = TRUE)

psychdescribebymat

This is visually less pleasant, but it does enable you to produce a potentially useful dataframe, which you could tidy up or use to produce group comparisons downstream, if you don’t mind a little bit of post-processing.

skim, from the skimr package

Documentation

library(skimr)
skim(data)

skim

At the top skim clearly summarises the record and variable count. It is adept at handling both categorical and numeric data. For readability, I like the way it separates them into different sections dependent on data type, which makes for quick interpretation given that different summary stats are relevant for different data types.

It reports missing data clearly, and has all the most common summary stats I like.

Sidenote: see the paragraph in red below. This issue mentioned in this section is no longer an issue as of skimr 1.0.1, although the skim_with function may still be of interest.

There is what appears to be a strange sequence of unicode-esque characters like <U+2587> shown at the bottom of the output. In reality, these are intended to be a graphical visualisation of distributions using sparklines, hence the column name “hist”, referring to histograms. This is a fantastic idea, especially to see in-line with the other stats in the table. Unfortunately, they do not by default display properly in the Windows environment which is why I see the U+ characters instead.

The skimr documentation details how this is actually a problem with underlying R code rather than this library, which is unfortunate as I suspect this means there cannot be a quick fix. There is a workaround involving changing ones locale, although I have not tried this, and probably won’t before establishing if there would be any side effects in doing so.

In the mean time, if the nonsense-looking U+ characters bother you, you can turn off the column that displays them by changing the default summary that skim uses per data type. There’s a skim_with function that you can use to add your own summary stats into the display, but it also works to remove existing ones. For example, to remove the “hist” column:

skim_with(integer = list(hist = NULL))
skim(data)

skimnohist

Now we don’t see the messy unicode characters, and we won’t for the rest of our skimming session.

UPDATE 2018-01-22 : the geniuses who designed skimr actually did find a way to make the sparklines appear in Windows after all! Just update your skimr version to version 1.0.1 and you’re back in graphical business, as the rightmost column of the integer variables below demonstrate.

skim2

 

 

The output works well with kable. Happily, it also respects the group_by function from dplyr, which means you can produce summaries by group. For example:

group_by(data, category) %>%
 skim()

groupbyskim

Whilst the output is still arranged by the grouping variable before the summary variable, making it slightly inconvenient to visually compare categories, this seems to be the nicest “at a glimpse” way yet to perform that operation without further manipulation.

But if you are OK with a little further manipulation, life becomes surprisingly easy! Although the output above does not look tidy or particularly manipulable, behind the scenes it does create a tidy dataframe-esque representation of each combination of variable and statistic. Here’s the top of what that looks like by default:

mydata <- group_by(data, category) %>%
 skim()

head(mydata, 10)

skimdf

It’s not super-readable to the human eye at a glimpse – but you might be able to tell that it has produced a “long” table that contains one row for every combination of group, variable and summary stat that was shown horizontally in the interactive console display. This means you can use standard methods of dataframe manipulation to programmatically post-process your summary.

For example, sticking to the tidyverse, let’s graphically compare the mean, median and standard deviation of the “score” variable, comparing the results between each value of the 3 “categories” we have in the data.

mydata %>%
 filter(variable == "score", stat %in% c("mean", "median", "sd")) %>%
 ggplot(aes(x = category, y = value)) +
 facet_grid(stat ~ ., scales = "free") +
 geom_col()

skimggplot

descr and dfSummary, from the summarytools package

Documentation

Let’s start with descr.

library(summarytools)
summarytools::descr(data)

descrsummarytools

The first thing I note is that this is another one of the summary functions that (deliberately) only works with numerical data. Here though, a useful red warning showing which columns have thus been ignored is shown at the top. You also get a record count, and a nice selection of standard summary stats for the numeric variables, including information on missing data (for instance Pct.Valid is the proportion of data which isn’t missing).

kable does not work here, although you can recast to a dataframe and later kable that, i.e.

kable(as.data.frame(summarytools::descr(data)))

The data comes out relatively tidy although it does use rownames to represent the summary stat.

mydata <- summarytools::descr(data)
View(mydata)

viewmydata

There is also a transpose option if you prefer to arrange your variables by row and summary stats as columns.

summarytools::descr(data, transpose = TRUE)

descrtranspose

There is no special functionality for group comparisons, although by() works, with the standard limitations.

The summarytools package also includes a fancier, more comprehensive, summarising function called dfSummary, intended to summarise a whole dataframe – which is often exactly what I want to do with this type of summarisation.

dfSummary(data)

summarytoolsdf

This function can deal with both categorical and numeric variables and provides a pretty output in the console with all of the most used summary stats, info on sample sizes and missingness. There’s even a “text graph” intended to show distributions. These graphs are not as beautiful as the sparklines that the skimr function tries to show, but have the advantage that they work right away on Windows machines.

On the downside, the function seems very slow to perform its calculations at the moment. Even though I’m using a relatively tiny dataset, I had to wait an annoyingly large amount of time for the command to complete – perhaps 1-2 minutes, vs other summary functions which complete almost instantly. This may be worth it to you for the clarity of output it produces, and if you are careful to run it once with all the variables and options you are interested in – but it can be quite frustrating when engaged in interactive exploratory analysis where you might have reason to run it several times.

Update 2018-02-10: the processing time issues should be fixed in version 0.82. Thanks very much to Dominic, the package author, for leaving a comment below and performing such a quick fix!

There is no special grouping feature.

Whilst it does work with kable, it doesn’t make for nice output. But don’t despair, there’s a good reason for that. The function has built-in capabilities to output directly into markdown or HTML.

This goes way beyond dumping a load of text into HTML format – instead giving you rather beautiful output like that shown below. This would be perfectly acceptable for sharing with other users, and less-headache inducing than other representations if staring at in order to gain an understanding of your dataset. Again though, it does take a surprisingly long time to generate.

summarytoolsdfhtml.PNG

My favourite R package for: frequency tables

Back for the next part of the “which of the infinite ways of doing a certain task in R do I most like today?” series. This time, what could more more fascinating an aspect of analysis to focus on than: frequency tables?

OK, most topics might actually be more fascinating. Especially when my definition of frequency tables here will restrict itself to 1-dimensional variations, which in theory a primary school kid could calculate manually, given time. But they are such a common tool, that analysts can use for all sorts of data validation and exploratory data analysis jobs, that finding a nice implementation might prove to be a time-and-sanity saving task over a lifetime of counting how many things are of which type.

Here’s the top of an example dataset. Imagine a “tidy” dataset, such that each row is an one observation. I would like to know how many observations (e.g. people) are of  which type (e.g. demographic – here a category between A and E inclusive)

Type Person ID
E 1
E 2
B 3
B 4
B 5
B 6
C 7

I want to be able to say things like: “4 of my records are of type E”, or “10% of my records are of type A”. The dataset I will use in my below example is similar to the above table, only with more records, including some with a blank (missing) type.

What would I like my 1-dimensional frequency table tool to do in an ideal world?

  1. Provide a count of how many observations are in which category.
  2. Show the percentages or proportions of total observations that represents
  3. Be able to sort by count, so I see the most popular options at the top – but only when I want to, as sometimes the order of data is meaningful for other reasons.
  4. Show a cumulative %, sorted by count, so I can see quickly that, for example, the top 3 options make up 80% of the data – useful for some swift Pareto analysis and the like.
  5. Deal with missing data transparently. It is often important to know how many of your observations are “missing”. Other times, you might only care about the statistics derived from those which are not missing.
  6. If an external library, then be on CRAN or some other well supported network so I can be reasonably confident the library won’t vanish, given how often I want to use it.
  7. Output data in a “tidy” but human-readable format. Being a big fan of the tidyverse, it’d be great if I could pipe the results directly into ggplot, dplyr, or whatever for some quick plots and manipulations. Other times, if working interactively, I’d like to be able to see the key results at a glance, without having to use further coding.
  8. Work with “kable” from the Knitr package, or similar table output tools. I often use R markdown and would like the ability to show the frequency table output in reasonably presentable manner.
  9. Have a sensible set of defaults (aka facilitate my laziness).

 

So what options come by default with base R?

Most famously, perhaps the “table” command.

table(data$Type)

table

A super simple way to count up the number of records by type. But it doesn’t show percentages or any sort of cumulation. By default it hasn’t highlighted that there are some records with missing data. It does have a useNA parameter that will show that though if desired.

table(data$Type, useNA = "ifany")

tablena

The output also isn’t tidy and doesn’t work well with Knitr.

The table command can be wrapped in the prop.table command to show proportions.

prop.table(table(data$Type))

proptable

But you’d need to run both commands to understand the count and percentages, and the latter inherits many of the limitations from the former.

So what’s available outside of base R? I tested 5 options, although there are, of course , countless more. In no particular order:

  • tabyl, from the janitor package
  • tab1, from epidisplay
  • freq, from summarytools
  • CrossTable, from gmodels
  • freq, from questionr

Because I am fussy, I managed to find some slight personal niggle with all of them, so it’s hard to pick an overall personal winner for all circumstances. Several came very close. I would recommend looking at any of the janitor, summarytools and questionr package functions outlined below if you have similar requirements and tastes to me.

tabyl, from the janitor package

Documentation

library(janitor)
tabyl(data$Type, sort = TRUE)

janitor

This is a pretty good start! By default, it shows counts, percents, and percent of non-missing data. It can optionally sort in order of frequency. It the output is tidy, and works with kable just fine. The only thing missing really is a cumulative percentage option. But it’s a great improvement over base table.

I do find myself constantly misspelling “tabyl” as “taybl” though, which is annoying, but not really something I can really criticise anyone else for.

tab1, from the epidisplay package

Documentation

library(epiDisplay)
tab1(data$Type, sort.group = "decreasing", cum.percent = TRUE)

epidisplayepidisplay2

This one is pretty fully featured. It even (optionally) generates a visual frequency chart output as you can see above. It shows the frequencies, proportions and cumulative proportions both with and without missing data. It can sort in order of frequency, and has a totals row so you know how many observations you have all in.

However it isn’t very tidy by default, and doesn’t work with knitr. I also don’t really like the column names it assigns, although one can certainly claim that’s pure personal preference.

A greater issue may be that the cumulative columns don’t seem to work as I would expect when the table is sorted, as in the above example. The first entry in the table is “E”, because that’s the largest category. However, it isn’t 100% of the non-missing dataset, as you might infer from the fifth numerical column. In reality it’s 31.7%, per column 4.

As far as I can tell, the function is working out the cumulative frequencies before sorting the table – so as category E is the last category in the data file it has calculated that by the time you reach the end of category E you have 100% of the non-missing data in hand. I can’t envisage a situation where you would want this behaviour, but I’m open to correction if anyone can.

freq, from the summarytools package

Documentation

library(summarytools)
summarytools::freq(data$Type, order = "freq")

summarytools

This looks pretty great. Has all the variations of counts, percents and missing-data output I want – here you can interpret the “% valid” column as “% of all non-missing”. Very readable in the console, and works well with Knitr. In fact it has some further nice formatting options that I wasn’t particularly looking for.

It it pretty much tidy, although has a minor niggle in that the output always includes the total row. It’s often important to know your totals, but if you’re piping it to other tools or charts, you may have to use another command to filter that row out each time, as there doesn’t seem to be an obvious way to prevent it being included with the rest of the dataset when running it directly.

Update 2018-04-28: thanks to Roland in the comments below pointing out that a new feature to disable the totals display has been added: set the “totals” parameter to false, and the totals row won’t show up, potential making it easier to pass on for further analysis.

CrossTable, from the gmodels library

Documentation

library(gmodels)
CrossTable(data$Type)

crosstable

Here the results are displayed in a horizontal format, a bit like the base “table”. Here though, the proportions are clearly shown, albeit not with a cumulative version. It doesn’t highlight that there are missing values, and isn’t “tidy”. You can get it to display a vertical version (add the parameter max.width = 1 ) which is visually distinctive, but untidy in the usual R tidyverse sense.

It’s not a great tool for my particular requirements here, but most likely this is because, as you may guess from the command name, it’s not particularly designed for 1-way frequency tables. If you are crosstabulating multiple dimensions it may provide a powerful and visually accessible way to see counts, proportions and even run hypothesis tests.

freq, from the questionr package

Documentation

library(questionr)
questionr::freq(data$Type, cum = TRUE, sort = "dec", total = TRUE)

questionr

Counts, percentages, cumulative percentages, missing values data, yes, all here! The table can optionally be sorted in descending frequency, and works well with kable.

It is mostly tidy, but also has an annoyance in that the category values themselves (A -E are row labels rather than a standalone column. This means you may have to pop them into in a new column for best use in any downstream tidy tools. That’s easy enough with e.g. dplyr’s add_rownames command. But that is another processing step to remember, which is not a huge selling point.

There is a total row at the bottom, but it’s optional, so just don’t use the “total” parameter if you plan to pass the data onwards in a way where you don’t want to risk double-counting your totals. There’s an “exclude” parameter if you want to remove any particular categories from analysis before performing the calculations as well as a couple of extra formatting options that might be handy.

My favourite R package for: correlation

R is a wonderful, flexible, if somewhat arcane tool for analytics of all kinds. Part of its power, yet also its ability to bewilder, comes from the fact that there are so many ways of doing the same, or similar, things. Many of these ways are instantly available thanks to many heroes of the R world creating and distributing free libraries to refine existing functionality and add new abilities.

Looking at a list of one from the most popular sources for these packages, CRAN, shows that their particular collection gets new entries on a several-times-per-day basis, and there are 11,407 of them at the time of writing.

With that intimidating stat in mind, I will keep a few notes on this blog as to my current favourite base or package-based methods for some common analytics tasks. Of course these may change regularly, based on new releases or my personal whims. But for now, let’s tackle correlations. Here I mean simple statistical correlations between 2 sets of data, the most famous one of which is likely the Pearson correlation coefficient, aka Pearson’s R.

What would I like to see in my ideal version of a correlation calculator? Here’s a few of my personal preferences in no particular order.

  1. Can deal with multiple correlation tests at once. For example, maybe I have 5 variables and I’d like to see the correlation between each one of them with each of the other 4 variables).
  2. Visualises the results nicely, for example in a highlighted correlation matrix. Default R often produces informative but somewhat uninspiring text output. I have got spoiled with the luxury of data visualisation tools so after a heavy day’s analysis I prefer to take advantage of the many ways dataviz can make analytic output easier to decipher for humans.
  3. If the output is indeed a dataviz, I have a slight preference for it to use ggplot charting all other things being equal. Ending up with a proper ggplot object is nice both in terms of the default visual settings vs some other forms of R chart, and also that you can then in theory use ggplot methods to adapt or add to it.
  4. Can produce p values, confidence intervals, or some other way of suggesting whether any correlations found are statistically significant or not.
  5. Can handle at least a couple of types of correlation calculations, the most common of which are probably Pearson correlation coefficient and Spearman’s rank correlation coefficient.

Default R has a couple of correlation commands built in to it. The most common is probably “cor“. Here’s an example of what it produces, using a test dataset named test_data of 5 variables, named a, b, c, d and e (which are in columns .

cor(test_data)

cor

So, it does multiple tests at once, and can handle Pearson, Spearman and Kendall correlation calculations, via changing the “method” parameter (which defaults to Pearson if you don’t specify, as in my example). But it doesn’t show the statistical significance of the correlations, and a potentially large text table of 8-place decimal numbers is not the sort of visualised output that would help me interpret the results in a particularly efficient way.

A second relevant default R command is “cor.test“. This one only allows you to make a single correlation comparison, for example between variable a and variable b.

cor.test(test_data$a, test_data$b)

cortest

So here we see it does return both a p value and a confidence interval to help us judge the significance of the correlation size noted. You can change the alternative hypothesis and confidence interval range via parameters. It can also do the same 3 types of correlation that “cor” supports. But, as noted, it can only compare two variables at once without further commands. And the output is again a bunch of text. That is really OK here, as you are focusing only on one comparison. But it’s going to be pretty tedious to run and decipher if you want to compare each one of a few variables against each of the others.

So, is there a package solution that makes me happy? As you might guess, yes, there’s actually a few contenders. But my current favourite is probably “ggcorrplot“. The manual is here, and there’s a handy usage guide here.

Suffice to say:

  1. It allows you to compare several variables against each other at the same time.
  2. It visualises the variables in a colour-coded correlation matrix
  3. The visualisation is a ggplot
  4. It can produce p values, using the accompanying function cor_pmat(), which can then be shown on the visualisation in various ways.
  5. It uses the results from the built in cor() function, so can handle the same 3 types of correlation.

There’s a bunch of options to select from, but here’s the default output

# calculate correlations using cor and store the results
corrdata <- cor(test_data)

# use the package's cor_pmat function to calculate p-values for the correlations
p.mat <- cor_pmat(test_data)

# produce a nice highlighted correlation matrix
ggcorrplot(corrdata, title = "Correlation matrix for test data")

The results look like:

ggcordefault

You can see it produces a correlation matrix, colour coded as to the direction and strength of the correlations. It doesn’t show anything about the statistical significance. Kind of pretty for an overview glance, but it could be rather more informative.

I much prefer to use a couple of options that show the actual correlation values and the significance; the ones I most commonly use probably being this set.

ggcorrplot(corrdata, title = "Correlation matrix for test data", lab=TRUE, p.mat = p.mat, sig.level = .05)

ggcorrplot

Here, the correlation coefficients are superimposed on the grid, so you can check immediately the strength of the correlation rather than try and compare to the colour scale.

You can also see that some of the cells are crossed out (for example the correlation between variable c and variable e in the above). This means that the correlation detected is not considered to be significant at the 0.05 level. That level can be changed, or the insignificant correlations be totally hidden if you prefer to not get distracted by them in the first place.

Transactions by active subscribers formulae in Tableau

This blog returns back from the dead (dormant?) with a quick note-to-self on how to do something that sounds simple but proved slightly complicated in practice, using Tableau.

Here’s a scenario, although many others would fit the same pattern. Imagine you have a business that is subscription based, where people can subscribe and cancel whenever they wish. Whilst subscribed, your customer can buy products from you anytime they want.  They can’t buy products if not subscribed.

What you want to know is, for a given cohort of subscribers, how many of those people who are still subscribed purchased a product within their first, second…nth month?

So we want to be able to say things like: “out of the X people that started a subscription last year, Y% of those who were still subscribed for at least Z months bought a product in their Zth month”.

It’s the “who were still subscribed” part that made this a little tricky, at least with the datasource I was dealing with.

Here’s a trivially small example of what I had – a file that has 1 row per sale per customer.

Subscriber ID Length of subscription Month of subscription Transaction type
1 5 1 Sale
1 5 2 Sale
1 5 3 Sale
2 7 1 Sale
2 7 6 Sale
3 1 1 Sale
4 8 1 Sale
4 8 2 Sale
4 8 4 Sale
4 8 5 Sale
5 1 Sale
5 2 Sale
5 3 Sale
5 8 Sale
5 9 Sale

For simplicity, let’s assume every customer has at least one sale. The columns tell you:

  • the ID number of the subscriber
  • the length of the subscription from start to finish, in months. If the length is blank then it means it’s still active today so we don’t know how long it will end up being.
  • the month number of the product sale
  • a transaction type, which for our purposes is always “sale”

Example interpretation: subscriber ID 1 had a subscription that lasted 5 months. They purchased a product in month 1, 2 and 3 (but not 4 or 5).

It’d be easy to know that you had 5 people at the start (count distinct subscriber ID), and that you had 2 transactions in month 3 (count distinct subscriber ID where month of subscription = 3). But how many of those 5 people were still subscribed at that point?

Because this example is so small, you can easily do that by eye. You can see in the data table that we had one subscription, ID 3, that only had a subscription length of 1 month. Everyone else stayed longer than 3 months – so there were 4 subscriptions left at month 3.

The figure we want to know is what proportion of the active subscribers at month 3 bought a product. The correct answer is the number of subscriptions making a product purchase at month 3 divided by the number of subscriptions still active at month 3. Here, that’ s 2 / 4 = 50%.

So how do we get that in Tableau, with millions of rows of data? As you might guess, one method involves the slightly-dreaded “table calculations“. Layout is usually important with table calculations. Here’s one way that works. We’ll build it up step by step, although you can of course combine many of these steps into one big fancy formula if you insist.

Firstly, I modified the data source (by unioning) so that when a subscription was cancelled it generated a “cancelled subscription” transaction.  That looked something like this after it was done.

Subscriber ID Length of subscription Month of subscription Transaction type
1 5 1 Sale
1 5 2 Sale
1 5 3 Sale
1 5 5 Cancelled subscription
2 7 1 Sale
2 7 6 Sale
2 7 7 Cancelled subscription
3 1 1 Sale
3 1 1 Cancelled subscription
4 8 1 Sale
4 8 2 Sale
4 8 4 Sale
4 8 5 Sale
4 8 8 Cancelled subscription
5 1 Sale
5 2 Sale
5 3 Sale
5 8 Sale
5 9 Sale

Note there’s the original sales transactions and now a new “cancel” row for every subscription that was cancelled. In these transactions the “month of subscription” is set to the actual month the subscription was cancelled, which we know from the field “Length of subscription”

Here are the formulae we’ll need to work out, for any given month, how many people were still active, and how many of those bought something:

  • The total count of subscribers in the cohort:
    Count of distinct subscribers in cohort

    { FIXED : [Count of distinct subscribers]}
  • The number of subscribers who cancelled in the given month:
    Count of distinct subscribers cancelling

     COUNTD(
     IF [Transaction type] = "Cancelled subscription"
     THEN [Subscriber ID]
     ELSE NULL
     END
     )
  • Derived from those two figures, the number of subscribers who are still active at the given month:
    Count of subscribers still active

    AVG([Count of distinct subscribers in cohort]) - (RUNNING_SUM([Count of distinct subscribers cancelling])- [Count of distinct subscribers cancelling])
  • The number of subscribers who made a purchase in the given month:
    Count of distinct subscribers making purchase

     COUNTD(
     IF [Transaction type] = "Sale" THEN [Subscriber ID]
     ELSE NULL
     END
     )
  • Finally, derived from the last two results, the proportion of subscribers who made a purchase as a percentage of those who are still active
    Proportion of distinct active subscribers making purchase

    [Count of distinct subscribers making purchase] / [Count of subscribers still active]

Let’s check if it the logic worked, by building a simple text table. Lay months on rows, and the above formulae as columns.

Table proof

That seems to match expectations. We’re certainly seeing the 50% of actives making a purchase on month 3 that were manually calculated above.

Plot a line chart with month of subscription on columns and proportion of distinct active subscribers making purchase on rows, and there we have the classic rebased propensity to purchase curve.

Propensity curve

(although this data being very small and very fake makes the curve look very peculiar!)

Note that we first experimented with this back in ye olde days of Tableau, before the incredible Level Of Detail calculations were available. I have found many cases where it’s worth re-visiting past table calculation work and considering if LoD expressions would work better, and this may well be one of them.