My favourite R package for: correlation

R is a wonderful, flexible, if somewhat arcane tool for analytics of all kinds. Part of its power, yet also its ability to bewilder, comes from the fact that there are so many ways of doing the same, or similar, things. Many of these ways are instantly available thanks to many heroes of the R world creating and distributing free libraries to refine existing functionality and add new abilities.

Looking at a list of one from the most popular sources for these packages, CRAN, shows that their particular collection gets new entries on a several-times-per-day basis, and there are 11,407 of them at the time of writing.

With that intimidating stat in mind, I will keep a few notes on this blog as to my current favourite base or package-based methods for some common analytics tasks. Of course these may change regularly, based on new releases or my personal whims. But for now, let’s tackle correlations. Here I mean simple statistical correlations between 2 sets of data, the most famous one of which is likely the Pearson correlation coefficient, aka Pearson’s R.

What would I like to see in my ideal version of a correlation calculator? Here’s a few of my personal preferences in no particular order.

  1. Can deal with multiple correlation tests at once. For example, maybe I have 5 variables and I’d like to see the correlation between each one of them with each of the other 4 variables).
  2. Visualises the results nicely, for example in a highlighted correlation matrix. Default R often produces informative but somewhat uninspiring text output. I have got spoiled with the luxury of data visualisation tools so after a heavy day’s analysis I prefer to take advantage of the many ways dataviz can make analytic output easier to decipher for humans.
  3. If the output is indeed a dataviz, I have a slight preference for it to use ggplot charting all other things being equal. Ending up with a proper ggplot object is nice both in terms of the default visual settings vs some other forms of R chart, and also that you can then in theory use ggplot methods to adapt or add to it.
  4. Can produce p values, confidence intervals, or some other way of suggesting whether any correlations found are statistically significant or not.
  5. Can handle at least a couple of types of correlation calculations, the most common of which are probably Pearson correlation coefficient and Spearman’s rank correlation coefficient.

Default R has a couple of correlation commands built in to it. The most common is probably “cor“. Here’s an example of what it produces, using a test dataset named test_data of 5 variables, named a, b, c, d and e (which are in columns .

cor(test_data)

cor

So, it does multiple tests at once, and can handle Pearson, Spearman and Kendall correlation calculations, via changing the “method” parameter (which defaults to Pearson if you don’t specify, as in my example). But it doesn’t show the statistical significance of the correlations, and a potentially large text table of 8-place decimal numbers is not the sort of visualised output that would help me interpret the results in a particularly efficient way.

A second relevant default R command is “cor.test“. This one only allows you to make a single correlation comparison, for example between variable a and variable b.

cor.test(test_data$a, test_data$b)

cortest

So here we see it does return both a p value and a confidence interval to help us judge the significance of the correlation size noted. You can change the alternative hypothesis and confidence interval range via parameters. It can also do the same 3 types of correlation that “cor” supports. But, as noted, it can only compare two variables at once without further commands. And the output is again a bunch of text. That is really OK here, as you are focusing only on one comparison. But it’s going to be pretty tedious to run and decipher if you want to compare each one of a few variables against each of the others.

So, is there a package solution that makes me happy? As you might guess, yes, there’s actually a few contenders. But my current favourite is probably “ggcorrplot“. The manual is here, and there’s a handy usage guide here.

Suffice to say:

  1. It allows you to compare several variables against each other at the same time.
  2. It visualises the variables in a colour-coded correlation matrix
  3. The visualisation is a ggplot
  4. It can produce p values, using the accompanying function cor_pmat(), which can then be shown on the visualisation in various ways.
  5. It uses the results from the built in cor() function, so can handle the same 3 types of correlation.

There’s a bunch of options to select from, but here’s the default output

# calculate correlations using cor and store the results
corrdata <- cor(test_data)

# use the package's cor_pmat function to calculate p-values for the correlations
p.mat <- cor_pmat(test_data)

# produce a nice highlighted correlation matrix
ggcorrplot(corrdata, title = "Correlation matrix for test data")

The results look like:

ggcordefault

You can see it produces a correlation matrix, colour coded as to the direction and strength of the correlations. It doesn’t show anything about the statistical significance. Kind of pretty for an overview glance, but it could be rather more informative.

I much prefer to use a couple of options that show the actual correlation values and the significance; the ones I most commonly use probably being this set.

ggcorrplot(corrdata, title = "Correlation matrix for test data", lab=TRUE, p.mat = p.mat, sig.level = .05)

ggcorrplot

Here, the correlation coefficients are superimposed on the grid, so you can check immediately the strength of the correlation rather than try and compare to the colour scale.

You can also see that some of the cells are crossed out (for example the correlation between variable c and variable e in the above). This means that the correlation detected is not considered to be significant at the 0.05 level. That level can be changed, or the insignificant correlations be totally hidden if you prefer to not get distracted by them in the first place.

Advertisements

Transactions by active subscribers formulae in Tableau

This blog returns back from the dead (dormant?) with a quick note-to-self on how to do something that sounds simple but proved slightly complicated in practice, using Tableau.

Here’s a scenario, although many others would fit the same pattern. Imagine you have a business that is subscription based, where people can subscribe and cancel whenever they wish. Whilst subscribed, your customer can buy products from you anytime they want.  They can’t buy products if not subscribed.

What you want to know is, for a given cohort of subscribers, how many of those people who are still subscribed purchased a product within their first, second…nth month?

So we want to be able to say things like: “out of the X people that started a subscription last year, Y% of those who were still subscribed for at least Z months bought a product in their Zth month”.

It’s the “who were still subscribed” part that made this a little tricky, at least with the datasource I was dealing with.

Here’s a trivially small example of what I had – a file that has 1 row per sale per customer.

Subscriber ID Length of subscription Month of subscription Transaction type
1 5 1 Sale
1 5 2 Sale
1 5 3 Sale
2 7 1 Sale
2 7 6 Sale
3 1 1 Sale
4 8 1 Sale
4 8 2 Sale
4 8 4 Sale
4 8 5 Sale
5 1 Sale
5 2 Sale
5 3 Sale
5 8 Sale
5 9 Sale

For simplicity, let’s assume every customer has at least one sale. The columns tell you:

  • the ID number of the subscriber
  • the length of the subscription from start to finish, in months. If the length is blank then it means it’s still active today so we don’t know how long it will end up being.
  • the month number of the product sale
  • a transaction type, which for our purposes is always “sale”

Example interpretation: subscriber ID 1 had a subscription that lasted 5 months. They purchased a product in month 1, 2 and 3 (but not 4 or 5).

It’d be easy to know that you had 5 people at the start (count distinct subscriber ID), and that you had 2 transactions in month 3 (count distinct subscriber ID where month of subscription = 3). But how many of those 5 people were still subscribed at that point?

Because this example is so small, you can easily do that by eye. You can see in the data table that we had one subscription, ID 3, that only had a subscription length of 1 month. Everyone else stayed longer than 3 months – so there were 4 subscriptions left at month 3.

The figure we want to know is what proportion of the active subscribers at month 3 bought a product. The correct answer is the number of subscriptions making a product purchase at month 3 divided by the number of subscriptions still active at month 3. Here, that’ s 2 / 4 = 50%.

So how do we get that in Tableau, with millions of rows of data? As you might guess, one method involves the slightly-dreaded “table calculations“. Layout is usually important with table calculations. Here’s one way that works. We’ll build it up step by step, although you can of course combine many of these steps into one big fancy formula if you insist.

Firstly, I modified the data source (by unioning) so that when a subscription was cancelled it generated a “cancelled subscription” transaction.  That looked something like this after it was done.

Subscriber ID Length of subscription Month of subscription Transaction type
1 5 1 Sale
1 5 2 Sale
1 5 3 Sale
1 5 5 Cancelled subscription
2 7 1 Sale
2 7 6 Sale
2 7 7 Cancelled subscription
3 1 1 Sale
3 1 1 Cancelled subscription
4 8 1 Sale
4 8 2 Sale
4 8 4 Sale
4 8 5 Sale
4 8 8 Cancelled subscription
5 1 Sale
5 2 Sale
5 3 Sale
5 8 Sale
5 9 Sale

Note there’s the original sales transactions and now a new “cancel” row for every subscription that was cancelled. In these transactions the “month of subscription” is set to the actual month the subscription was cancelled, which we know from the field “Length of subscription”

Here are the formulae we’ll need to work out, for any given month, how many people were still active, and how many of those bought something:

  • The total count of subscribers in the cohort:
    Count of distinct subscribers in cohort

    { FIXED : [Count of distinct subscribers]}
  • The number of subscribers who cancelled in the given month:
    Count of distinct subscribers cancelling

     COUNTD(
     IF [Transaction type] = "Cancelled subscription"
     THEN [Subscriber ID]
     ELSE NULL
     END
     )
  • Derived from those two figures, the number of subscribers who are still active at the given month:
    Count of subscribers still active

    AVG([Count of distinct subscribers in cohort]) - (RUNNING_SUM([Count of distinct subscribers cancelling])- [Count of distinct subscribers cancelling])
  • The number of subscribers who made a purchase in the given month:
    Count of distinct subscribers making purchase

     COUNTD(
     IF [Transaction type] = "Sale" THEN [Subscriber ID]
     ELSE NULL
     END
     )
  • Finally, derived from the last two results, the proportion of subscribers who made a purchase as a percentage of those who are still active
    Proportion of distinct active subscribers making purchase

    [Count of distinct subscribers making purchase] / [Count of subscribers still active]

Let’s check if it the logic worked, by building a simple text table. Lay months on rows, and the above formulae as columns.

Table proof

That seems to match expectations. We’re certainly seeing the 50% of actives making a purchase on month 3 that were manually calculated above.

Plot a line chart with month of subscription on columns and proportion of distinct active subscribers making purchase on rows, and there we have the classic rebased propensity to purchase curve.

Propensity curve

(although this data being very small and very fake makes the curve look very peculiar!)

Note that we first experimented with this back in ye olde days of Tableau, before the incredible Level Of Detail calculations were available. I have found many cases where it’s worth re-visiting past table calculation work and considering if LoD expressions would work better, and this may well be one of them.

 

 

Lessons from what happened before Snow’s famous cholera map changed the world

Anyone who studies any amount of the history of, or the best practice for, data visualisation will almost certainly come across a handful of “classic” vizzes. These specific transformations of data-into-diagram have stuck with us through the mists of time in order to become examples that teachers, authors, conference speakers and the like repeatedly pick to illustrate certain key points about the power of dataviz.

A classic when it comes to geospatial analysis is John Snow’s “Cholera map”. Back in the 1850s, it was noted that some areas of the country had a lot more people dying from cholera than other places. At the time, cholera’s transmission mechanism was unknown, so no-one really knew why. And if you don’t know why something’s happening, it’s usually hard to take action against it.

Snow’s map took data that had been gathered about people who had died of cholera, and overlaid the locations where these people resided against a street map of a particularly badly affected part of London. He then added a further data layer denoting the local water supplies.

snowmap

(High-resolution versions available here).

By adding the geospatial element to the visualisation, geographic clusters showed up that provided evidence to suggest that use of a specific local drinking-water source, the now-famous Broad Street public well, was the key common factor for sufferers of this local peak of cholera infection.

Whilst at the time scientists hadn’t yet proven a mechanism for contagion, it turned out later that the well was indeed contaminated, in this case with cholera-infected nappies. When locals pumped water from it to drink, many therefore tragically succumbed to the disease.

Even without understanding the biological process driving the outbreak – nobody knew about germs back then –  seeing this data-driven evidence caused  the authorities to remove the Broad Street pump handle, people could no longer drink the contaminated water, and lives were saved. It’s an example of how data visualisation can open ones’ eyes to otherwise hidden knowledge, in this case with life-or-death consequences.

But what one hears a little less about perhaps is that this wasn’t the first data-driven analysis to confront the same problem. Any real-world practising data analyst might be unsurprised to hear that there’s a bit more to the story than a swift sequence of problem identification -> data gathering -> analysis determining the root cause ->  action being taken.

Snow wasn’t working in a bubble. Another gentleman, by the name of William Farr, whilst working at the General Register Office, had set up a system that recorded people’s deaths along with their cause. This input seems to have been a key enabler of Snow’s analysis.

Lesson 1: sharing data is a Very Good Thing. This is why the open data movement is so important, amongst other reasons. What if Snow hadn’t been able examine Farr’s dataset – could lives have been lost? How would the field of epidemiology have developed without data sharing?

In most cases, no single person can reasonably be expected to both be the original source of all the data they need and then go on to analyse it optimally. “Gathering data” does not even necessarily involve the same set of skills as “analysing data” does – although of course a good data practitioner should usually understand some of the theory of both.

As it happens, William Farr had gone beyond collecting the data. Being of a statistical bent, he had actually already used the same dataset himself to analytically tackle the same question – why are there relatively more cholera deaths in some places than others? He’d actually already found what appeared to be an answer. It later turned out that his conclusion wasn’t correct – but it certainly wasn’t obvious at the time. In fact, it likely seemed more intuitively correct than Snow’s theory back then.

Lesson 2: Here then is a real life example then of the value of analytical iteration. Just because one person has looked at a given dataset doesn’t mean that it’s worthless to have someone else re-analyse it – even if the former analyst has established a conclusion. This is especially important when the stakes are high, and the answer in hand hasn’t been “proven” by virtue of any resulting action confirming the mechanism. We can be pleased that Snow didn’t just think “oh, someone’s already looked at it” and move on to some shiny new activity.

So what was Farr’s original conclusion? Farr had analysed his dataset, again in a geospatial context, and seen a compelling association between the elevation of a piece of land and the number of cholera deaths suffered by people who live on it. In this case, when the land was lower (vs sea level for example) then cholera deaths seemed to increase.

In June 1852, Farr published a paper entitled “Influence of Elevation on the Fatality of Cholera“. It included this table:

farrtable

The relationship seems quite clear; cholera deaths per 10k persons goes up dramatically as the elevation of the land goes down.

Here’s the same data, this time visualised in the form of a linechart, from a 1961 keynote address on “the epidemiology of airborne infection”, published in Bacteriology Reviews. Note the “observed mortality” line.

farrchart.gif

Based on that data, his elevation theory seems a plausible candidate, right?

You might notice that the re-vizzed chart also contains a line concerning the calculated death rate according to “miasma theory”, which seems to have an outcome very similar on this metric to the actual cholera death rate. Miasma was a leading theory of disease-spread back in the nineteenth century, with a pedigree encompassing many centuries. As the London Science Museum tells us:

In miasma theory, diseases were caused by the presence in the air of a miasma, a poisonous vapour in which were suspended particles of decaying matter that was characterised by its foul smell.

This theory was later replaced with the knowledge of germs, but at the time the miasma theory was a strong contender for explaining the distribution of disease. This was probably helped because some potential actions one might take to reduce “miasma” evidently would overlap with those of dealing with germs.

After analysing associations between cholera and multiple geo-variables (crowding, wealth, poor-rate and more), Farr’s paper selects the miasma explanation as the most important one, in a style that seems  quite poetic these days:

From an eminence, on summer evenings, when the sun has set, exhalations are often seen rising at the bottoms of valleys, over rivers, wet meadows, or low streets; the thickness of the fog diminishing and disappearing in upper air. The evaporation is most abundant in the day; but so long as the temperature of the air is high, it sustains the vapour in an invisible body, which is, according to common observation, less noxious while penetrated by sunlight and heat, than when the watery vapour has lost its elasticity, and floats about surcharged with organic compounds, in the chill and darkness of night.

The amount of organic matter, then, in the atmosphere we breathe, and in the waters, will differ at different elevations; and the law which regulates its distribution will bear some resemblance to the law regulating the mortality from cholera at the various elevations.

As we discover later, miasma theory wasn’t correct, and it certainly didn’t offer the optimum answer to addressing the cluster of cholera cases Snow examined.But there was nothing impossible or idiotic about Farr’s work. He (as far as I can see at a glance) gathered accurate enough data and analysed them in a reasonable way. He was testing a hypothesis that was based on the common sense at the time he was working, and found a relationship that does, descriptively, exist.

Lesson 3: correlation is not causation (I bet you’ve never heard that before 🙂 ). Obligatory link to the wonderful Spurious Correlations site.

Lesson 4: just because an analysis seems to support a widely held theory, it doesn’t mean that the theory must be true.

It’s very easy to lay down tools once we seem to have shown that what we have observed is explained by a common theory. Here though we can think of Karl Popper’s views of scientific knowledge being derived via falsification. If there are multiple competing theories in play, the we shouldn’t assume certainty that the dominant one is correct until we have come up with a way of proving the case either way. Sometimes, it’s a worthwhile exercise to try to disprove your findings.

Lesson 5: the most obvious interpretation of the same dataset may vary depending on temporal or other context.

If I was to ask a current-day analyst (who was unfamiliar with the case) to take a look at Farr’s data and provide a view with regards to the explanation of the differences in cholera death rates, then it’s quite possible they’d note the elevation link. I would hope so. But it’s unlikely that, even if they used precisely the same analytical approach, they would suggest that miasma theory is the answer. Whilst I’m hesitant to claim there’s anything that no-one believes, for the most part analysts will probably place an extremely low weight on discredited scientific theories from a couple of centuries ago when it comes to explaining what data shows.

This is more than an idealistic principle – parallels, albeit usually with less at stake, can happen in day-to-day business analysis. Preexisting knowledge changes over time, and differs between groups. Who hasn’t seen (or had of being) the poor analyst who revealed a deep, even dramatic, insight into business performance predicated on data which was later revealed to have been affected by something entirely different.

For my part, I would suggest to learn what’s normal, and apply double-scepticism (but not total disregard!) when you see something that isn’t. This is where domain knowledge is critical to add value to your technical analytical skills. Honestly, it’s more likely that some ETL process messed up your data warehouse, or your store manager is misreporting data, than overnight 100% of the public stopped buying anything at all from your previously highly successful store for instance.

Again, here is an argument for sharing one’s data, holding discussions with people outside of your immediate peer group, and re-analysing data later in time if the context has substantively changed. Although it’s now closed, back in the deep depths of computer data viz history (i.e. the year 2007), IBM launched a data visualisation platform called “Many Eyes”. I was never an avid user, but the concept and name rather enthralled me.

Many Eyes aims to democratize visualization by providing a forum for any users of the site to explore, discuss, and collaborate on visual content…

Sadly, I’m afraid it’s now closed. But other avenues of course exist.

In the data-explanation world, there’s another driving force of change – the development of new technologies for inferring meaning from datapoints. I use “technology” here in the widest possible sense, meaning not necessarily a new version of your favourite dataviz software or a faster computer (not that those don’t help), but also the development of new algorithms, new mathematical processes, new statistical models, new methods of communication, modes of thought and so on.

One statistical model, commonplace in predictive analysis today, is logistic regression. This technique was developed in the 1950s, so was obviously unavailable as a tool for Farr to use a hundred years beforehand. However, in 2004, Bingham et al. published a paper that re-analysed Farr’s data, but this time using logistic regression. Now, even here they still find a notable relationship between elevation and the cholera death rate, reinforcing the idea that Farr’s work was meaningful – but nonetheless conclude that:

Modern logistic regression that makes best use of all the data, however, shows that three variables are independently associated with mortality from cholera. On the basis of the size of effect, it is suggested that water supply most strongly invited further consideration.

Lesson 6: reanalysing data using new “technology” may lead to new or better insights (as long as the new technology is itself more meritorious in some way than the preexisting technology, which is not always the case!).

But anyway, even without such modern-day developments, Snow’s analysis was conducted, and provided evidence that a particular water supply was causing a concentration of cholera cases in a particular district of London. He immediately got the authorities to remove the handle of the contaminated pump, hence preventing its use, and hundreds of people were immediately saved from drinking its foul water and dying.

That’s the story, right? Well, the key events themselves seem to be true, and it remains a great example of that all-too-rare phenomena of data analysis leading to direct action. But it overlooks the point that, by the time the pump was disabled, the local cholera epidemic had already largely subsided.

The International Journal of Epidemiology published a commentary regarding the Broad Street pump in 2002, which included a chart using data taken from Whitehead’s “Remarks on the outbreak of cholera in Broad Street, Golden Square, London, in 1854” paper, which was published in 1867. The chart shows, quite vividly, that by the date that the handle of the pump was removed, the local cholera epidemic that it drove was likely largely over.

handle

As Whitehead wrote:

It is commonly supposed, and sometimes asserted even at meetings of Medical Societies, that the Broad Street outbreak of cholera in 1854 was arrested in mid-career by the closing of the pump in that street. That this is a mistake is sufficiently shown by the following table, which, though incomplete, proves that the outbreak had already reached its climax, and had been steadily on the decline for several days before the pump-handle was removed

Lesson 7: timely analysis is often vital – but if it was genuinely important to analyse urgently, then it’s likely important to take action on the findings equally as fast.

It seems plausible that if the handle had been removed a few days earlier, many more lives could have been saved. This was particularly difficult in this case, as Snow had the unenviable task of persuading the authorities too take action based on a theory that was counter to the prevailing medical wisdom at the time. At least any modern-day analysts can take some solace in the knowledge that even our highest regarded dataviz heroes had some frustration in persuading decision makers to actually act on their findings.

This is not at all to reduce Snow’s impact on the world. His work clearly provided evidence that helped lead to germ theory, which we now hold to be the explanatory factor in cases like these. The implications of this are obviously huge. We save lives based on that knowledge.

Even in the short term, the removal of the handle, whilst too late for much of the initial outbreak, may well have prevented a deadly new outbreak. Whitehead happily acknowledged this in his article.

Here I must not omit to mention that if the removal of the pump-handle had nothing to do with checking the outbreak which had already run its course, it had probably everything to do with preventing a new outbreak; for the father of the infant, who slept in the same kitchen, was attacked with cholera on the very day (Sept. 8th) on which the pump-handle was removed. There can be no doubt that his discharges found their way into the cesspool, and thence into the well. But, thanks to Dr. Snow, the handle was then gone.

Lesson 8: even if it looks like your analysis was ignored until it was too late to solve the immediate problem, don’t be too disheartened –  it may well contribute towards great things in the future.

Retrieving Adobe SiteCatalyst data with R

Adobe SiteCatalyst (part of Adobe Analytics) is a nicely comprehensive tool for tracking user interactions upon one’s website, app and more. However, in the past I’ve had a fair amount of trouble de-siloing its potentially immensely useful data into external tools, such that I could connect, link and process it for insights over and above those you can get within the default web tool (which, to be fair, is itself improving over time).

I’ve written in the past about my happiness when Alteryx released a data connector allowing one to access Sitecatalyst data from inside their own tool. That’s still great, but the tool is necessarily constrained to the specific tasks the creator designed it to do, and subject to the same API limits as everyone else is. I have no doubt that there are ways and means to get around that in Alteryx (after all, it speaks R). But sometimes, when trying to automate operations, coding in something like might actually be easier…or at least cheaper!

With that in mind, after having a recent requirement to analyse web browsing data at a individual customer level, I successfully experimented with the awesome RSiteCatalyst package, and have some notes below as to the method which worked well for me.

Note that RSiteCatalyst is naturally subject to the usual Adobe API limits – the main one that causes me sadness being the inability to retrieve over 50k rows at a time – but, due to the incessant flexibility of R and the comprehensiveness of this package, I’ve not encountered a problem I couldn’t solve just yet.

So, how to set up?

Install the RSiteCatalyst package

First, open up R, or your favourite R environment, and install the RSiteCatalyst package (the first time you use it) or load the library (each new session).

if(!require(RSiteCatalyst)) install.packages("RSiteCatalyst")
library(RSiteCatalyst)

Log in to your SiteCatalyst account

You next need to authenticate against your Adobe Sitecatalyst installation, using the SCAuth function. There’s an old way involving a shared secret, and a new way using OAuth. The latter is to be preferred, but at the time I first looked at it there seemed to be a web service issue that prevented the OAuth process completing. At the moment then, I can confirm that the old method still works!

 key <- "username:company>"
 secret <- "https://sc.omniture.com/p/suite/1.3/index.html?a=User.GetAccountInfo >"
 SCAuth(key, secret)

 

Retrieve metadata about your installation

Once you’re in, there’s a plethora of commands to retrieve useful metadata, and then to run and retrieve various types of report data. For several such commands, you’ll need to know the ID of the Adobe Report Suite concerned, which is fortunately as easy as:

suites <- GetReportSuites()

whereby you’ll receive a dataframe containing all available report suites by title and ID.

If you already know the title of the report suite you’re interested in then you can grab the ID directly with something like:

my.rsid <- suites[suites$site_title=="My favourite website",1]

You can then find out which elements are available within your report suite:

elements.available <- GetElements(my.rsid)

and later on which metrics are available for a given element

metrics.available <- GetMetrics(my.rsid, elements = "<<myFavouriteElement>>")

Retrieve Adobe Analytics report data

There are a few different RSitecatalyst functions you can call, depending on the type of report you’re interested in. In each case they start with “queue”, as what you’re actually doing is submitting a data request to the Sitecatalyst reporting queue.

If your request is pretty quick, you can wait a few seconds and get the results sent back to you immediately in R. If it’s going to take a very long time, you can instead store a report request ID and then use the GetReport function to go back and get it later, once it’s finished.

The current list of queue functions, which are named such that Adobe aficionados will probably be able to guess which type of data they facilitate, is:

  • QueueDataWarehouse
  • QueueFallout
  • QueueOvertime
  • QueuePathing
  • QueueRanked
  • QueueSummary
  • QueueTrended

Here I’ll show just a couple of examples – but the full official documentation for all of them and much more besides is available at Cran.

Firstly, an “over  time” report to get me the daily pageview and visit counts my site received in the first half of 2016.

Here’s how the documentation describes this type of report:

“A QueueOvertime report is a report where the only granularity allowed is time. This report allows for a single report suite, time granularity, multiple metrics, and a single segment”

An example would be a day by day count of total pageviews to your site.

Remember that above we set the variable “my.rsid” to the Report Suite ID of the report suite I am looking at. So:

dailyStats <- QueueOvertime(my.rsid,
                            date.from = "2016-01-01",
                            date.to = "2016-06-30",
                            metrics = c("pageviews","visits"),
                            date.granularity = "day"
                            )

The parameters are fairly self-evident, especially in combination with the documentation. There are many more available, including the ability to isolate by segment and to forecast data.

Your site will probably have some different metrics available to mine, depending on how the Adobe admin set it up, but the basics like page views and visits should be accessible pretty much anywhere.

What you’ll get back with the above command, i.e. the contents of the dailyStats variable after it has run, is a dataframe in this sort of format:

datetime name year month day segment.id segment.name pageviews visits
01/01/2016 Fri. 1 Jan. 2016 2016 1 1 100 75
02/01/2016 Sat. 2 Jan. 2016 2016 1 2 200 150
03/01/2016 Sun. 3 Jan. 2016 2016 1 3 250 180

Which you can then go on to process as you would any other such snippet of data in R.

My second, slightly more complex, example concerns the QueueRanked function. It’s described like this:

A QueueRanked report is a report that shows the ranking of values for one or more elements relative to a metric, aggregated over the time period selected.

That’s a bit more vague, but was useful to me in my original goal of identifying specifically which customers logged into my website in December, and how many times they visited.

A key feature of this function is that you can ask for the results from rank [x] to rank [y], instead of just the top [n].

This is super-useful where, like I did,  you expect to get over 50k rows. 50k is maximum row limit you can retrieve in one request via the Adobe API, which this R package uses. But R is full of the typical program language features like loops, thus allowing one to iterate through the commands to retrieve for instance results 1-50,000, then results 50,001 -100,000, then 100,001 – 150,000 and so on.

So, I built a loop that would generate these “ranked reports”, starting at row ‘i’ and giving me the next ‘count.step’ records, where count.step = 50000, the maximum I’m allowed to retrieve in one go.

Thus, I’d call the function repeatedly, each time asking for the next 50,000 records. At some point, when there were no more customers to download, I’d get a blank report sent back to me. At that point, I know I have everything so quit the loop.

I wanted to retrieve the ID of the customer using the website, which in my setup is stored in an custom element called “prop1”. All that sort of detail is controlled by your Adobe Sitecatalyst administrator, should you have exactly the same sort of requirement as I did – so best go ask them which element to look in, as there’s no real chance your setup is identical to mine at that level.

Nonetheless, the code pattern below could likely be used without much modification in order to iterate through any SiteCatalyst data that exceeds the API row limits.

 

count.limit <- 500000 #the max number of records we're interested in
count.step <- 50000 #how many records to retrieve per request, must not exceed 50k
count.start <- 1 #which record number to start with
CustomerList <- NULL #a variable to store the results in
fromDate <- "2016-12-01"
toDate <- "2016-12-31"

for(i in seq(1, count.limit, by = count.step)) {
  print(paste("Requesting rows",i, "through", i + count.step - 1))

  tempCustomerList <- QueueRanked(my.rsid,
                          date.from = fromDate,
                          date.to = toDate,
                          metrics = "visits",
                          elements = "prop1",
                          top = count.step,
                          start = i
                          )

  if  (nrow(tempCustomerList) == 0 ) {   # no more rows were returned - presumably we have them all now
    print("Last batch had no rows, exiting loop")
    break
  }
   
  tempCustomerList$batch.start.row <- i

  CustomerList <- rbind(customerList, tempCustomerList)

}

 

After running this, you should end up with a “CustomerList” dataframe that looks something like this, where “name” is the value of prop1, in my case the customer ID:

name url visits segment.id segment.name batch.start.row
ID123 10 1
ID234 5 1

which, again, you can process as though it was any standard type of R data.

Actually you can use variables, CTEs and other fancy SQL with Tableau after all

A few months ago, I blogged about how you can use Tableau parameters when connecting to many database datasources in order to exert the same sort of flexibility that SQL coders can build into their queries using SQL variables.

This was necessary because Tableau does not let you use SQL variables, common table expressions, temp table creation or other such fanciness when defining the data you wish to analyse, even via custom SQL. You can copy and paste a SQL query using those features that works fine in Microsoft’s SQL Server Management Studio or Oracle’s SQL Developer into Tableau’s custom SQL datasource box, and you’ll get nothing but errors.

But it turns out that there are actually ways to use these features in Tableau, as long as you only need to run them once per session – upon connection to the database.

Why do this? Well, if you have a well-designed data warehouse or other analytically-friendly datasource hopefully you’ll never need to. However, if you’re accessing “sub-optimally” designed tables for your task, or large raw unprocessed operational data tables, or other such scenarios that consistently form one of the bane of an analyst’s life, then you might find manipulating or summarising data before you use it in Tableau makes life much easier, much faster, or both.

To do this, we’ll use the “Initial SQL” feature of Tableau. This is supported on some, but not all, of the database types Tableau can connect to. You’ll know if your data connection does support it, because an option to use it will appear if you click on the relevant connection at the top left of the data source screen.

capture

If you click that option, you’ll get a text box where you can go crazy with your SQL in order to set up and define whatever final data table you’d like to connect to. Note that you’ll be restricted to the features of your database and the permissions your login gives you – so no standard table creation if you don’t have write-permissions etc. etc. But, to take the example of SQL Server, in my experience most “normal” users can write temporary tables.

Tableau isn’t great at helping you write this initial SQL – so if you’re not super-proficient and intend to write something complicated then you might want to play with it in a friendlier SQL tool first and paste it into Tableau when you know it works, or ask a friendly database-guru to do it for you.

Below then is an example of how one can use variables, CTEs and temporary tables in order to pre-build a data table you can then analyse in Tableau, by virtue of pasting code into the initial SQL box. This code will be re-run and hence refreshed every time you open the workbook. But as the name “initial SQL” suggests, it will not be refreshed every time you create or modify a new chart if you’ve not re-connected to the datasource inbetween.

(For what it’s worth, this uses the default demo database in SQL server – AdventureWorks. It’s the equivalent of Tableau’s famous Superstore dataset, in the Microsoft world 🙂 )

DECLARE @granularity AS varchar(50);
SET @granularity = 'yearly';

WITH OrdersCTE AS
(
SELECT OrderDate, SalesOrderID, TotalDue
FROM [SalesLT].[SalesOrderHeader]
)

SELECT
CASE WHEN @granularity = 'yearly' THEN CONVERT(varchar,YEAR(OrderDate))
WHEN @granularity = 'monthly' THEN CONVERT(varchar,YEAR(OrderDate)) + '-' + CONVERT(varchar,MONTH(OrderDate))
WHEN @granularity = 'daily' THEN CONVERT(date,OrderDate)
END AS Period,

COUNT(SalesOrderID) AS CountOfSales,
SUM(TotalDue) AS TotalDue
INTO #initialSQLDemo
FROM OrdersCTE

GROUP BY
CASE WHEN @granularity = 'yearly' THEN CONVERT(varchar,YEAR(OrderDate))
WHEN @granularity = 'monthly' THEN CONVERT(varchar,YEAR(OrderDate)) + '-' + CONVERT(varchar,MONTH(OrderDate))
WHEN @granularity = 'daily' THEN CONVERT(date,OrderDate)
END;

Now, this isn’t supposed to be an SQL tutorial, so I’ll not explain in detail what the above does. But for those of you already familiar with (T)SQL, you’ll note that I set a variable as to how granular I want my dates to be, use a CTE to build a cut-down version of a sales table with fewer columns, and then aggregate my sales table to the level of date I asked for in my variable into a temporary table I called #initialSQLDemo.

In the above, it’s set to summarise sales by year. This means Tableau will only ever receive 1 row per year. You’ll not be able to drill down into more granular dates – but if this is detailed enough for what you want, then this might provide far better performance than if you connected to a table with 1 row per minute and, implicitly or explicitly, ask Tableau to constantly aggregate it in expensive ways.

Later on, perhaps I realise I need daily data. In which case, I can just change the second line above to :

SET @granularity = 'daily';

…which will expose data at a daily level to Tableau, and I’m good to go.

SQL / Adventureworks aficionados will probably realise that my precise example is very contrived, but hopefully it’s clear how the initial SQL feature works, and hence in the real world you’ll be able to think of cases that are truly useful. Note though that if your code does something that is slow, then you will experience this slowness whenever you open your workbook.

A quick note on temporary tables:

If you’re following along, you might notice something. Although I created a temp table called #initialSQLDemo (which being temporary, will be deleted from the database as soon as you close your session), it never appears in the Table list on the left of the Tableau data source  screen.

capture

Why so? Well, in SQL Server, temporary tables are created in a database called “tempdb” that Tableau doesn’t seem to show. This is not a fatal deal though, as they’re still accessible via the “New Custom SQL” option shown above.

In my example then, I dragged New Customer SQL to the data pane and entered a simple “give me everything” into the resulting box.

capture

Now, there is a downside in that I understand using custom SQL can reduce performance in some cases, as Tableau is limited in how it can optimise queries. But a well-designed temporary table might in any case be intrinsically faster to use that an over-sized ill-designed permanent table.

When venturing back into Tableau’s main viz-design screen, you’ll see then your temporary table is treated just as any other table source would be. Tableau doesn’t care that it’s temporary.

capture

Although we should note that if you want to use the table in the future in a different Tableau workbook, you’d have to have it run the initial SQL again there, as standard temporary tables are not shareable and do not persist between database sessions.

 

Clustering categorical data with R

Clustering is one of the most common unsupervised machine learning tasks. In Wikipedia‘s current words, it is:

the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups

Most “advanced analytics” tools have some ability to cluster in them. For example, Alteryx has K-Centroids AnalysisR, Python, SPSS, Statistica and any other proper data sciencey tools all likely have many methods – and even Tableau, although not necessarily aimed at the same market, just added a user-friendly clustering facility.  You can do the calculations in Excel, should you really want to (although why not cheat and use a nice addin if you want to save time?).

However, many of the more famous clustering algorithms, especially the ever-present K-Means algorithm, are really better for clustering objects that have quantitative numeric fields, rather than those that are categorical. I’m not going delve into the details of why here, but, simplistically, they tend to be based on concepts like Euclidean distance – and in that domain, it’s conceptually difficult to say that [bird] is Euclideanly “closer” to [fish] than [animal]; vs the much more straightforward task of knowing that an income of £100k is nearer to one of £90k than it is to 50p. IBM has a bit more about that here.

But, sometimes you really want to cluster categorical data! Luckily, algorithms for that exist, even if they are rather less widespread than typical k-means stuff.

R being R, of course it has a ton of libraries that might help you out. Below are a couple I’ve used, and a few notes as to the very basics of how to use them – not that it’s too difficult once you’ve found them. The art of selecting the optimum parameters for the very finest of clusters though is still yours to master, just like it is on most quantitative clustering.

The K-Modes algorithm

Like k-means, but with modes, see 🙂 ? A paper called ‘Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values‘ by Huang gives the gory details.

Luckily though, a R implementation is available within the klaR package. The klaR documentation is available in PDF format here and certainly worth a read.

But simplistically, you’re looking at passing a matrix or dataframe into the “kmodes” function.

Imagine you have a CSV file something like:

RecordID FieldA FieldB FieldC FieldD
1 0 0 0 1
2 0 0 0 0
3 0 0 0 1
4 1 1 0 0

Here’s how you might read it in, and cluster the records based on the contents of fields “FieldA”, “FieldB”, “FieldC”, and “FieldD”.

install.packages("klaR")
library(klaR)
setwd("C:/Users/Adam/CatCluster/kmodes")
data.to.cluster <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-kmodes(data.to.cluster[,2:5], 3, iter.max = 10, weighted = FALSE ) #don't use the record ID as a clustering variable!

Here I’ve asked for 3 clusters to be found, which is the second argument of the kmodes function. Just like k-means, you can specify as many as you want so you have a few variations to compare the quality or real-world utility of.

This is the full list of parameters to kmodes, per the documentation.

kmodes(data, modes, iter.max = 10, weighted = FALSE)
  • data: A matrix or data frame of categorical data. Objects have to be in rows, variables
    in columns.
  • modes: Either the number of modes or a set of initial (distinct) cluster modes. If a
    number, a random set of (distinct) rows in data is chosen as the initial modes.
  • iter.max: The maximum number of iterations allowed.
  • weighted: Whether usual simple-matching distance between objects is used, or a weighted version of this distance.

What do you get back?

Well, the kmodes function returns you a list, with the most interesting entries being:

  • cluster: A vector of integers indicating the cluster to which each object is allocated.
  • size: The number of objects in each cluster.
  • modes: A matrix of cluster modes.
  • withindiff: The within-cluster simple-matching distance for each cluster

Here’s an example what it looks like when output to the console:

K-modes clustering with 3 clusters of sizes 3, 5, 12

Cluster modes:
 FieldA FieldB FieldC FieldD
1 1 0 0 0
2 1 0 1 1
3 0 0 0 0

Clustering vector:
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
 3 3 3 1 3 1 2 3 3 3 2 2 2 3 3 2 1 3 3 3

Within cluster simple-matching distance by cluster:
[1] 2 2 8

Available components:
[1] "cluster" "size" "modes" "withindiff" "iterations" "weighted"

So, if you want to append your newly found clusters onto the original dataset, you can just add the cluster back onto your original dataset as a new column, and perhaps write it out as a file to analyse elsewhere, like this:

cluster.output <- cbind(data.to.cluster,cluster.results$cluster)
write.csv(cluster.output, file = "kmodes clusters.csv", row.names = TRUE)

 

The ROCK algorithm

Some heavy background reading on Rock is available in this presentation by Guha et al.

Again, a benevolent genius has popped an implementation into R for our use. This time you can find it in package “cba”. The PDF docs for cba are here.

But the most simplistic usage is very similar to k-modes, albeit with different optional parameters based on the how the algorithms differ.

Here’s what you’d do to cluster the same data as above, and write it back out, this time with the Rock clusters appended. Note here that ideally you’re specifically passing in a matrix to the rockCluster function.

install.packages("cba")
library(cba)
setwd("C:/Users/Adam/CatCluster/rock")
data.to.cluster <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-rockCluster(as.matrix(data.to.cluster[,2:5]), 3 )
cluster.output <- cbind(data.to.cluster,cluster.results$cl)
write.csv(cluster.output, file = "Rock clusters.csv", row.names = TRUE)

The full list of parameters to the relevant function, rockCluster is:

rockCluster(x, n, beta = 1-theta, theta = 0.5, fun = "dist", funArgs = list(method="binary"), debug = FALSE)
  • x: a data matrix; for rockLink an object of class dist.
  • n: the number of desired clusters.
  • beta: optional distance threshold
  • theta: neighborhood parameter in the range [0,1).
  • fun: distance function to use.
  • funArgs: a list of named parameter arguments to fun.
  • debug: turn on/off debugging output.

This is the output, which is of class “rock”, when printed to the screen:

data: x 
 beta: 0.5 
theta: 0.5 
 fun: dist 
 args: list(method = "binary") 
 1 2 3 
14 5 1

The object is a list, and its most useful component is probably “cl”, which is a factor containing the assignments of clusters to your data.

Of course once you have the csv files generated in the above ways, it’s just bog-standard data – so you’re free to visualise in R, or any other tool.

Workaround for the Tableau “custom SQL” restriction on SQL Server variables

SQL Server (among other databases) has a handy feature for easy re-use of queries, in the guise of variables. You can declare variables and use them repeatedly in any query in the current session. That’s pretty handy for any complicated query forms you use repeatedly, where each time you might need to change some basic criteria.

As a trivial example: here’s how to select all the customers I have that live in postcode 12345

DECLARE @postcode varchar(5)
SET @postcode = '12345'

SELECT POSTAL_CODE, CUSTOMER_NO FROM [MY_CUSTOMERS] WHERE POSTAL_CODE = @postcode

I can then run it another day for postcode 23456 by just changing the second line to

SET @postcode = '23456'

This example is a bit silly because you might as well just type 12345 or 23456 into the query itself – but imagine a more complex SQL statement, or set of coordinated SQL statements, with screenfuls of code, and you might see how it makes it much easier to maintain.

But, sad to say, you can’t use these sort of variables as part of the custom SQL feature in Tableau. If you try, then you’ll get this slightly unhelpful message:

[Microsoft][SQL Server Native Client 11.0][SQL Server]Incorrect syntax near the keyword 'DECLARE'.
[Microsoft][SQL Server Native Client 11.0][SQL Server]Incorrect syntax near ')'.

But rather than have to hardcode your queries and modify them for every extract you can use a very simple alternative: our friend, the Tableau parameter.

For each SQL variable, create a Tableau parameter of the same data type. You can put in a default value as the “Current value” when setting it up.

capture

Then, when connecting to your database, paste in the original SQL statement you had into the Custom SQL box, but removing the variable declaration and setting from the top.
SELECT POSTAL_CODE, CUSTOMER_NO FROM [MY_CUSTOMERS] WHERE POSTAL_CODE = @postcode

capture

One at a time, highlight each SQL variable there, i.e. the @postcode part of the above, press the “Insert Parameter” button under the text field and select the appropriate Tableau parameter.

(There is actually the option to create a new parameter at this stage, if you didn’t do so in advance).

You’ll end up with a similar looking SQL statement, but with the variables now referred to in this sort of way:

SELECT POSTAL_CODE, CUSTOMER_NO FROM [MY_CUSTOMERS]
WHERE POSTAL_CODE = <Parameters.Postcode>

And it’ll work!

You can then use the standard Tableau functionality to change the parameters in Tableau, and behind the scenes your SQL statement will update appropriately.

Note firstly, for performance, that each change of this sort of parameter is going to cause the query to be re-run on the database (unless it’s already been cached).

Secondly, if you have created a Tableau Data Extract from this datasource, then altering the parameter won’t change anything until you refresh the extract.

Finally, the Tableau parameter is simply a normal Tableau parameter, just like anything else. So there’s nothing to stop a Tableau user altering it, which depending on your setup might be either exactly what you want, or exactly what you don’t want. If the latter, you might consider having the datasource involving parameters in a separate workbook to the user-facing dashboard (publishing to Tableau Server to share), or at least making sure it’s not visible on the end-user dashboards if they have no reason to change it.

Quick sidenote: I wondered if the “Initial SQL” feature might work, to declare variables upon connection to the datasource which could then be used in custom SQL later. But sadly not, I guess it doesn’t count as the same session. If you try, you’ll see a message like:
[Microsoft][SQL Server Native Client 11.0][SQL Server]Must declare the scalar variable "@postcode"

A few notes on Tableau 10’s cross-database join feature

The new version of Tableau, number 10, got released whilst I was busy sunning myself on holiday. This mean there was at least something to look forward to in returning to the world of work 🙂

It has many, many new features. I suspect one of the most impactful, in this era of data lying around everywhere you look, may end up being cross-database joins.

This means that Tableau can now do “proper” joins (i.e. not blends) that span different databases, even different database technologies. So for instance, you can execute an inner join from one table from SQL Server onto data from another table in Redshift, and analyse the result as though it all came from the same place. Better yet, you can publish the resulting datasource as a Tableau Server datasource so everyone can join in the fun, without even having to care where it came from in the first place.

Below a few notes / warnings from my early experimentations. Some may sound critical, but, overall, this is truly a great feature that will enable the less ETL-loving folks to get more done quicker, and/or without the resources of tech database specialists or sofwtare being needed.

  1. Excel counts as a database (as does any other supported file format). This means you can simulate database joins between multiple Excel files for instance without having to modify them in any way.
  2. You can use a custom SQL statement as the “table” you’re joining to or from in any join: i.e. you’re not restricted to just the formal tables on the database. This means that if you’re an SQL wizard who would prefer to have the database preprocess stuff before it hits Tableau, you can do that in the context of a cross-database join.
  3. Warning! I believe (from experience 🙂 ) Tableau downloads all the data from the tables you choose to your local computer before performing the join. This is kind of predictable if you think it through, but not totally. If you know what an Alteryx “in-database tool” is, then this is not like one of those.

    The joining is on your computer, even if you don’t need to care about that. This means that if you’re inner joining a table of 10 rows in one database to 10 out of a possible 1 billion rows in another database to create an extract, Tableau will be downloading 1 billion and 10 rows to your computer before later returning you the post-join 10 rows as your datasource. You can apply data source filters to your tables before extracting though which might help alleviate some of the potential pain here.

  4. You can do left, inner and right joins very easily, joining on any number of fields as the key. But I didn’t see a facility for conditional joins, in the sense of saying “download all sales from Redshift for a customer whose info is in Oracle where the sales are between the start and end date of the customer’s first 6 weeks according to Oralce”. You would have to let it download and show you all of the joined data and then use formula in Tableau afterwards to filter out what you don’t want.
  5. Some data connection types aren’t allowed to be involved in cross database joins at the moment. A Tableau blog post lists these as:
    • Tableau Server
    • Firebird
    • Google Analytics
    • Microsoft Analysis Services
    • Microsoft PowerPivot
    • Odata
    • Oracle Essbase
    • Salesforce
    • SAP BW
    • Splunk
    • Teradata OLAP Connector

      The big shame for me on that list is Tableau Server. Tableau Server datasources are a very cool way for non-DBAs to share analytically-useful data around the organisation (or DBA professionals to provide a lovely structured, governed, datasource ideal for Tableau users to use). I have however found that they can result in semi-silos, as exporting or manipulating them has certain limitations vs standard databases. If one could cross-database join these types of datasource one day it would be an amazing way of adding extra flexibility to that feature

  6. Performance may be super great, or it may be super slow, depending on your exact database setup.

But all those compliments and complaints aside, debt is very much due to the makers of Tableau 10 for providing one fewer reason for analysts to load up other software, or pester other departments, before getting started on analysing all your lovely data.

Creating my first Tableau web data connector : part 3

At last, the final part of the trials and tribulations of creating my first Tableau Web Data Connector… Part 1 went through the pre-requisites, and building what I might generously term my “user interface”. Part 2 was a struggle against the forces of web security. And in this part, we battle against the data itself, until the error-message dragon is truly slayed.

So, the in the last part, we ended up with a connector that could ask for the username of a BoardGameGeek user, connect to a CORS-enabled version of the API, and complain that whatever objects it found “are not valid as a React child”.

That makes sense, as I didn’t yet tell it what sort of data objects to expect.

As I’m following the Tableau web data connector tutorial, which accesses data from URLs like this one, I figured I’d look at what data that URL returns precisely, compare it to the web data connector tutorial code, and then, when I understand the relationship between the code and the data from the Yahoo URL, I might be able to adapt the code to fit the format my chosen data arrives in.

Here’s the data one gets if you point your web browser to the URL that the tutorial is teaching you how to build a WDC from:

Capture

OK, stock-pickers will immediately recognise that it’s stock quote data, with dates, highs, lowers, symbols and all the other lovely gubbins that comes along with basic stock charts. It’s in JSON format, which is human-readable, but only in small doses.

So I need to understand the relationship between the above data, and this piece of code:

if (data.query.results) {
              var quotes = data.query.results.quote;
              var ii;
              for (ii = 0; ii < quotes.length; ++ii) {
                  var entry = {'Ticker': quotes[ii].Symbol,
                               'Day': quotes[ii].Date,
                               'Close': quotes[ii].Close};
                  dataToReturn.push(entry);
              }

I can kind of see it by eye – there’s obviously JSON entries for Symbol, Date, Close, query and so on. But can’t we make it easier?

Yes we can, because there’s such a thing as a web JSON viewer. Past all that stock-related text into something like http://www.jsoneditoronline.org/ and you get a nice hierarchical visualisation of the JSON structure, like this:

Capture

So if we assume that “data” is referring to the overall set of data itself, and use dot notation to traverse the hierarchy, we can see a line in the code that says:

var quotes = data.query.results.quote;

That would seem to fit into the JSON structure above, where below the “object” level you have a level called query, a sublevel called results, and a sublevel called quote.

Capture

The variables “quotes” that the code creates therefore is basically referring to everything at/below the “quote” level of the JSON hierarchy.

Then you get a bunch of records which are the data of interest itself. These are numbered 0 for the first datapoint, 1 for the next, and so on.

If you know a bit of basic programming, you might note that the WDC code has “for loop” with a counter variable called “ii” that is set to 0 at first and is incrementing by one each time it runs. That seems to fit nicely in with the idea of it iterating through datapoint 0, 1, …n until it gets to the end (which is effectively the length of the quotes dataset, i.e. quotes.length).

Each one of these JSON records then has a few attributes – and they include the ones mentioned in the code as being part of whichever “quotes” subrecord we’re on; “Symbol”, “Date”, “Close”.

   var entry = {'Ticker': quotes[ii].Symbol,
                               'Day': quotes[ii].Date,
                               'Close': quotes[ii].Close};

Capture

Awesome. So where does “Ticker”, “Day” and “Close” on the left hand side of those code lines come from?

Well, that’s up to us. Remember the “add code for defining columns” part of the tutorial?

myConnector.getColumnHeaders = function() {
    var fieldNames = ['Ticker', 'Day', 'Close'];
    var fieldTypes = ['string', 'date', 'float'];
    tableau.headersCallback(fieldNames, fieldTypes);
}

There they’ve defined the fieldnames they want to have Tableau display (which can be totally different from the JSON names we saw above in the datasource) and the field types.

Field types can be any from this list:

  • bool
  • date
  • datetime
  • float
  • int
  • string

So, let’s look at what’s returned when I make the call to my Boardgames datasource.

Capture.PNG

OK, I can see some fields in there, but it’s kind of messy. So I pasted it back into the online JSON editor.

Here’s what the relevant part looks like when formatted nicely:

Capture.PNG

So, the interesting information there is probably the boardgame’s name, year published, URLs to its image and thumbnail, and the various statuses (own, want, etc.) and the number of times its been played (numplays).

I therefore edited the “getColumnHeaders” code snippet to reflect the fields I wanted, and the type of data that would feature in them.

myConnector.getColumnHeaders = function() {
		var fieldNames = ['BoardgameName','YearPublished','ImageURL','ImageThumbnailURL', 'Plays','Owned','PreviouslyOwned','ForTrade','Want','WantToPlay','WantToBuy','Wishlist','Preordered','LastModified'];
		var fieldTypes = ['string','int','string','string','int','int','int','int','int','int','int','int','int','datetime'];
		tableau.headersCallback(fieldNames, fieldTypes);
	}

Now I’ve defined the fields, I can go back to the retrieving results section of code (if (data.query.results)…) and, now knowing the structure of the JSON generated by my API, parse out and assign to the above variables the data I want.

I decided to call the collection of data I was building “games” rather than “quotes”, because that’s what it is. I next noted that each individual “game” within the JSON is listed in a hierarchy below the “item” entry which itself is below “items”.

Capture.PNG

(“Amyitis” is the name of a boardgame, rather than an allergy to people called Amy, believe it or not).

So, I assigned all the items.item data to “games”

if (data.items.item) {
var games = data.items.item;

And copied the tutorial to loop through all the “items.item” entries, each of which is a boardgame, until we’d reached the end i.e. when the number of times we looped is the same as the length of the data table..

var ii;
for (ii = 0; ii < games.length; ++ii) {

Finally, it’s time to assign the relevant bit of data returned from the API to the variables I’d defined above.

Capture

At first I got a little confused, because the JSON output had a bunch of _ and $ entries that didn’t seem similar to what was returned in the tutorial dataset. But it turns out that’s nothing to worry about. Just treat them as though they were any other text.

var entry = {'BoardgameName': games[ii].name[0]._,
'YearPublished': games[ii].yearpublished[0],
'ImageURL': 'https:' + games[ii].image[0],
'ImageThumbnailURL': 'https:' + games[ii].thumbnail[0],
'Plays': games[ii].numplays[0],
'Owned': games[ii].status[0].$.own,
'PreviouslyOwned': games[ii].status[0].$.prevowned,
'ForTrade': games[ii].status[0].$.fortrade,
'Want': games[ii].status[0].$.want,
'WantToPlay': games[ii].status[0].$.wanttoplay,
'WantToBuy': games[ii].status[0].$.wanttobuy,
'Wishlist': games[ii].status[0].$.wishlist,
'Preordered': games[ii].status[0].$.preordered,
'LastModified': games[ii].status[0].$.lastmodified
};
dataToReturn.push(entry);

In the above, you can think about games[ii] as being the individual boardgame record. The “for loop” we defined above substitutes each record into the ‘ii’ variable, so it’s accessing games[0], games[1] etc. which translates into data.items.item[0] and so on, if you remember how we defined the games variable above.

Then to find the boardgame’s name we need to traverse into the “name” chid of the boardgame itself, then look for the first entry below that (always the first one here, so we can refer to that as entry [0]), and look for the field that is shown as an underscore, _.

Capture.PNG

 

Hence:

'BoardgameName': games[ii].name[0]._,

Rinse and repeat this for each element of interest, and you’ve collected your dataset ready for Tableau to use!

Of course, the world is not perfect, and, in reality, I did not manage to do the above without making the odd mistake here and there. I got blank fields sometimes, when I knew they shouldn’t be, or fields with the wrong data in. As I was just using Notepad and the online simulator, I wasn’t really getting many useful error messages.

Lesson learned: you can use the code

tableau.log("YOUR MESSAGE HERE");

to display the message you write in the Google Chrome developer console. You might remember from part 2 that you can bring that up by pressing Ctrl Shift I in Google Chrome, and selecting the “Console” tab.

Why is that useful? Well, one, if it displays your message then it means that piece of code ran, so you can check you’re not skipping over any important section. And secondly, you can append text stored in variables to it. So for instead I could write:

tableau.log("the name of the game is " + games[ii].name[0]._);

And, if my code is correct, it should print out the name of the boardgame as it runs.

If I messed up, perhaps using the wrong index, or forgetting the underscore, it will output something different, and perhaps also an error message, guiding me towards the source of the problem.

That, as they say, is basically that! I ran it through the hosted Tableau Web Data Connector simulator again,  this ticking the box so it will “automatically continue to data gather phase”. And, lo and behold, below you can see that it has interpreted the information about that famous Amyitis game into a nice tabular format.

Capture.PNG

Once you’ve got to that stage, you’ve already won. Just load up Tableau Desktop for real, select to connect to a web data connector and up should pop your interface, and later, your data.

 

Creating my first Tableau web data connector – part 2

Now for the next part of the creating-a-web-data-connector journey: previously, I had got all the software I needed to create a web data connector installed (well, to be fair Notepad comes with Windows so that wasn’t hard) and designed my user interface.

Now it was time to move on to Tableau tutorial section 5 – using the information retrieved from the user of my UI in order to have Tableau connect to the BoardGameGeek API and retrieve the relevant data for analysis.

I read through the Tableau WDC tutorial, and, using some previous programming-other-stuff knowledge, realised that this tutorial was a mix of explaining the fundamentals of connecting to a generic web data connector, and explaining some very specific details of the example Yahoo finance web data connector the example showed.

The first thing to do then was to determine which bits were essential for my simple attempt and which were only there to facilitate the specific Yahoo API.

  • The “code for defining columns” section is obviously important. I need columns to put my data in.
  • The Yahoo Query YQL less so – my API is nothing to do with Yahoo.
  • The helper functions (buildUri, getFormattedDate,makeTwoDigits), no, these are just functions to help construct the YQL query needed to access their example API. It’s great to know that it’s possible to use helper functions like this if needed, but chances are, if I did need some, they’d not be the same unless my API was very similar to their Yahoo example.
  • The “code to get stock data”, well, I don’t want stock data, but I do want some sort of data so obviously some of that’s going to be relevant, so I pasted that into my file as-is, noting from the code that some of it was clearly going to need to be replaced given I don’t want boring finance data when exciting boardgames data is available instead.

It seemed obvious that providing the weblink to my API of choice -http://boardgamegeek.com/xmlapi2/collection?username=[whatever] – was going to be key.  So I replaced their connectionURI  code with a simple text variable of that, using the user-supplied BGG username I had got from the previous part.

connectionUri = 'http://boardgamegeek.com/xmlapi2/collection?username=' + bggUsername

 

(a URI is a Uniform Resource Identifier)

Of course this wasn’t going to be enough to make it work – I have not yet told Tableau what data I want it to receive and how to format it – but figured I’d run it through the simulator and see what happened. Nothing especially good did. But I did learn about “CORS”.

Now, again, no particular error messages tend to pop up by default even when things go fatally wrong in the simulator . It might look like it hung, it might go blank or similar, but no friendly message.

Lesson learned: If you are working with the super-basic tools I had at hand, what you need then is the built in developer tools Chrome has which you can bring up by pressing Ctrl Shift + I together. Switch it to show the “Console” tab, and you might see an error message there to help you out, if you’re lucky.

Here’s what I saw:

Untitled picture

Now, it’s been a long time since I did any serious web developing, and I had never suffered an Access-Control-Allow-Origin problem before. A swift bit of Googling led me to this nice explanation where I learnt that it’s a security feature “implemented in browsers to restrict interaction between documents (or scripts) that have different origins”. That makes sense; I don’t own either the BoardGameGeek site nor Github, I’m sad to say.

Of the 3 solutions that Jvaneyck suggests, it seemed like if I could find a JSONP source for my API instead of the XML one then that would be simplest, given I have no control over either server nor intention of building any infrastructure. So, can I get the BoardGameGeek API in JSONP format? Well, another quick Google ™ let me to a post on the BoardGameGeek forum, where a kind user named strathmeyer had offered up “Ajax-ready BGG API for any web programmers out there”.

The best thing is that I’ve enabled CORS headers, so you can use the API straight from the web browser

Yes, here indeed that was the best thing! Access-Control-Allow-Origin issue solved, and all thanks to the work of someone else 🙂

Lesson learned: I later found that Tableau 1) has a whole page dedicated to this exact problem, and 2) there’s an example in their SDK of various methods to get around this issue when connecting to XML. So next time, I should probably RT original M before going Google crazy.

But for now, all I needed to do was change the URI I pointed to from my web data connector to:

connectionUri = 'http://bgg-api.herokuapp.com/api/v1/collection?username=' + bggUsername

I ran that through the simulator once more, of course not expecting it to work before I had defined what data I was actually after – but it did eliminate the previous error message so I knew things were moving on nicely.

So when it ran through, Ctrl Shift + I back into Chrome to bring up the inspector and saw this message.

Untitled picture

My variable is not defined?! Where did it go? Oops, my fault, I’d embedded a previous function in this one. I needed to be more careful with my bracket-pasting finger. But it was simplicity itself to move the code back to where it should be, and re-simulate.

 

Untitled picture

OK, “uncaught invariant violation” is perhaps a little cryptic, but it’s sounds like it’s talking about having found something it didn’t expect. That could be my data (spoiler: it was). So now I needed to tell it what to expect and what to do with it.

Tune back in, same place, different time, for the exciting conclusion!