My favourite R package for: correlation

R is a wonderful, flexible, if somewhat arcane tool for analytics of all kinds. Part of its power, yet also its ability to bewilder, comes from the fact that there are so many ways of doing the same, or similar, things. Many of these ways are instantly available thanks to many heroes of the R world creating and distributing free libraries to refine existing functionality and add new abilities.

Looking at a list of one from the most popular sources for these packages, CRAN, shows that their particular collection gets new entries on a several-times-per-day basis, and there are 11,407 of them at the time of writing.

With that intimidating stat in mind, I will keep a few notes on this blog as to my current favourite base or package-based methods for some common analytics tasks. Of course these may change regularly, based on new releases or my personal whims. But for now, let’s tackle correlations. Here I mean simple statistical correlations between 2 sets of data, the most famous one of which is likely the Pearson correlation coefficient, aka Pearson’s R.

What would I like to see in my ideal version of a correlation calculator? Here’s a few of my personal preferences in no particular order.

  1. Can deal with multiple correlation tests at once. For example, maybe I have 5 variables and I’d like to see the correlation between each one of them with each of the other 4 variables).
  2. Visualises the results nicely, for example in a highlighted correlation matrix. Default R often produces informative but somewhat uninspiring text output. I have got spoiled with the luxury of data visualisation tools so after a heavy day’s analysis I prefer to take advantage of the many ways dataviz can make analytic output easier to decipher for humans.
  3. If the output is indeed a dataviz, I have a slight preference for it to use ggplot charting all other things being equal. Ending up with a proper ggplot object is nice both in terms of the default visual settings vs some other forms of R chart, and also that you can then in theory use ggplot methods to adapt or add to it.
  4. Can produce p values, confidence intervals, or some other way of suggesting whether any correlations found are statistically significant or not.
  5. Can handle at least a couple of types of correlation calculations, the most common of which are probably Pearson correlation coefficient and Spearman’s rank correlation coefficient.

Default R has a couple of correlation commands built in to it. The most common is probably “cor“. Here’s an example of what it produces, using a test dataset named test_data of 5 variables, named a, b, c, d and e (which are in columns .

cor(test_data)

cor

So, it does multiple tests at once, and can handle Pearson, Spearman and Kendall correlation calculations, via changing the “method” parameter (which defaults to Pearson if you don’t specify, as in my example). But it doesn’t show the statistical significance of the correlations, and a potentially large text table of 8-place decimal numbers is not the sort of visualised output that would help me interpret the results in a particularly efficient way.

A second relevant default R command is “cor.test“. This one only allows you to make a single correlation comparison, for example between variable a and variable b.

cor.test(test_data$a, test_data$b)

cortest

So here we see it does return both a p value and a confidence interval to help us judge the significance of the correlation size noted. You can change the alternative hypothesis and confidence interval range via parameters. It can also do the same 3 types of correlation that “cor” supports. But, as noted, it can only compare two variables at once without further commands. And the output is again a bunch of text. That is really OK here, as you are focusing only on one comparison. But it’s going to be pretty tedious to run and decipher if you want to compare each one of a few variables against each of the others.

So, is there a package solution that makes me happy? As you might guess, yes, there’s actually a few contenders. But my current favourite is probably “ggcorrplot“. The manual is here, and there’s a handy usage guide here.

Suffice to say:

  1. It allows you to compare several variables against each other at the same time.
  2. It visualises the variables in a colour-coded correlation matrix
  3. The visualisation is a ggplot
  4. It can produce p values, using the accompanying function cor_pmat(), which can then be shown on the visualisation in various ways.
  5. It uses the results from the built in cor() function, so can handle the same 3 types of correlation.

There’s a bunch of options to select from, but here’s the default output

# calculate correlations using cor and store the results
corrdata <- cor(test_data)

# use the package's cor_pmat function to calculate p-values for the correlations
p.mat <- cor_pmat(test_data)

# produce a nice highlighted correlation matrix
ggcorrplot(corrdata, title = "Correlation matrix for test data")

The results look like:

ggcordefault

You can see it produces a correlation matrix, colour coded as to the direction and strength of the correlations. It doesn’t show anything about the statistical significance. Kind of pretty for an overview glance, but it could be rather more informative.

I much prefer to use a couple of options that show the actual correlation values and the significance; the ones I most commonly use probably being this set.

ggcorrplot(corrdata, title = "Correlation matrix for test data", lab=TRUE, p.mat = p.mat, sig.level = .05)

ggcorrplot

Here, the correlation coefficients are superimposed on the grid, so you can check immediately the strength of the correlation rather than try and compare to the colour scale.

You can also see that some of the cells are crossed out (for example the correlation between variable c and variable e in the above). This means that the correlation detected is not considered to be significant at the 0.05 level. That level can be changed, or the insignificant correlations be totally hidden if you prefer to not get distracted by them in the first place.

Advertisements

Transactions by active subscribers formulae in Tableau

This blog returns back from the dead (dormant?) with a quick note-to-self on how to do something that sounds simple but proved slightly complicated in practice, using Tableau.

Here’s a scenario, although many others would fit the same pattern. Imagine you have a business that is subscription based, where people can subscribe and cancel whenever they wish. Whilst subscribed, your customer can buy products from you anytime they want. ¬†They can’t buy products if not subscribed.

What you want to know is, for a given cohort of subscribers, how many of those people who are still subscribed purchased a product within their first, second…nth month?

So we want to be able to say things like: “out of the X people that started a subscription last year, Y% of those who were still subscribed for at least Z months bought a product in their Zth month”.

It’s the “who were still subscribed” part that made this a little tricky, at least with the datasource I was dealing with.

Here’s a trivially small example of what I had – a file that has 1 row per sale per customer.

Subscriber ID Length of subscription Month of subscription Transaction type
1 5 1 Sale
1 5 2 Sale
1 5 3 Sale
2 7 1 Sale
2 7 6 Sale
3 1 1 Sale
4 8 1 Sale
4 8 2 Sale
4 8 4 Sale
4 8 5 Sale
5 1 Sale
5 2 Sale
5 3 Sale
5 8 Sale
5 9 Sale

For simplicity, let’s assume every customer has at least one sale. The columns tell you:

  • the ID number of the subscriber
  • the length of the subscription from start to finish, in months. If the length is blank then it means it’s still active today so we don’t know how long it will end up being.
  • the month number of the product sale
  • a transaction type, which for our purposes is always “sale”

Example interpretation: subscriber ID 1 had a subscription that lasted 5 months. They purchased a product in month 1, 2 and 3 (but not 4 or 5).

It’d be easy to know that you had 5 people at the start (count distinct subscriber ID), and that you had 2 transactions in month 3 (count distinct subscriber ID where month of subscription = 3). But how many of those 5 people were still subscribed at that point?

Because this example is so small, you can easily do that by eye. You can see in the data table that we had one subscription, ID 3, that only had a subscription length of 1 month. Everyone else stayed longer than 3 months – so there were 4 subscriptions left at month 3.

The figure we want to know is what proportion of the active subscribers at month 3 bought a product. The correct answer is the number of subscriptions making a product purchase at month 3 divided by the number of subscriptions still active at month 3. Here, that’ s 2 / 4 = 50%.

So how do we get that in Tableau, with millions of rows of data? As you might guess, one method involves the slightly-dreaded “table calculations“. Layout is usually important with table calculations. Here’s one way that works. We’ll build it up step by step, although you can of course combine many of these steps into one big fancy formula if you insist.

Firstly, I modified the data source (by unioning) so that when a subscription was cancelled it generated a “cancelled subscription” transaction. ¬†That looked something like this after it was done.

Subscriber ID Length of subscription Month of subscription Transaction type
1 5 1 Sale
1 5 2 Sale
1 5 3 Sale
1 5 5 Cancelled subscription
2 7 1 Sale
2 7 6 Sale
2 7 7 Cancelled subscription
3 1 1 Sale
3 1 1 Cancelled subscription
4 8 1 Sale
4 8 2 Sale
4 8 4 Sale
4 8 5 Sale
4 8 8 Cancelled subscription
5 1 Sale
5 2 Sale
5 3 Sale
5 8 Sale
5 9 Sale

Note there’s the original sales transactions and now a new “cancel” row for every subscription that was cancelled. In these transactions the “month of subscription” is set to the actual month the subscription was cancelled, which we know from the field “Length of subscription”

Here are the formulae we’ll need to work out, for any given month, how many people were still active, and how many of those bought something:

  • The total count of subscribers in the cohort:
    Count of distinct subscribers in cohort

    { FIXED : [Count of distinct subscribers]}
  • The number of subscribers who cancelled in the given month:
    Count of distinct subscribers cancelling

     COUNTD(
     IF [Transaction type] = "Cancelled subscription"
     THEN [Subscriber ID]
     ELSE NULL
     END
     )
  • Derived from those two figures, the number of subscribers who are still active at the given month:
    Count of subscribers still active

    AVG([Count of distinct subscribers in cohort]) - (RUNNING_SUM([Count of distinct subscribers cancelling])- [Count of distinct subscribers cancelling])
  • The number of subscribers who made a purchase in the given month:
    Count of distinct subscribers making purchase

     COUNTD(
     IF [Transaction type] = "Sale" THEN [Subscriber ID]
     ELSE NULL
     END
     )
  • Finally, derived from the last two results, the proportion of subscribers who made a purchase as a percentage of those who are still active
    Proportion of distinct active subscribers making purchase

    [Count of distinct subscribers making purchase] / [Count of subscribers still active]

Let’s check if it the logic worked, by building a simple text table. Lay months on rows, and the above formulae as columns.

Table proof

That seems to match expectations. We’re certainly seeing the 50% of actives making a purchase on month 3 that were manually calculated above.

Plot a line chart with month of subscription on columns and proportion of distinct active subscribers making purchase on rows, and there we have the classic rebased propensity to purchase curve.

Propensity curve

(although this data being very small and very fake makes the curve look very peculiar!)

Note that we first experimented with this back in ye olde days of Tableau, before the incredible Level Of Detail calculations were available. I have found many cases where it’s worth re-visiting past table calculation work and considering if LoD expressions would work better, and this may well be one of them.