Lessons from what happened before Snow’s famous cholera map changed the world

Anyone who studies any amount of the history of, or the best practice for, data visualisation will almost certainly come across a handful of “classic” vizzes. These specific transformations of data-into-diagram have stuck with us through the mists of time in order to become examples that teachers, authors, conference speakers and the like repeatedly pick to illustrate certain key points about the power of dataviz.

A classic when it comes to geospatial analysis is John Snow’s “Cholera map”. Back in the 1850s, it was noted that some areas of the country had a lot more people dying from cholera than other places. At the time, cholera’s transmission mechanism was unknown, so no-one really knew why. And if you don’t know why something’s happening, it’s usually hard to take action against it.

Snow’s map took data that had been gathered about people who had died of cholera, and overlaid the locations where these people resided against a street map of a particularly badly affected part of London. He then added a further data layer denoting the local water supplies.

snowmap

(High-resolution versions available here).

By adding the geospatial element to the visualisation, geographic clusters showed up that provided evidence to suggest that use of a specific local drinking-water source, the now-famous Broad Street public well, was the key common factor for sufferers of this local peak of cholera infection.

Whilst at the time scientists hadn’t yet proven a mechanism for contagion, it turned out later that the well was indeed contaminated, in this case with cholera-infected nappies. When locals pumped water from it to drink, many therefore tragically succumbed to the disease.

Even without understanding the biological process driving the outbreak – nobody knew about germs back then –  seeing this data-driven evidence caused  the authorities to remove the Broad Street pump handle, people could no longer drink the contaminated water, and lives were saved. It’s an example of how data visualisation can open ones’ eyes to otherwise hidden knowledge, in this case with life-or-death consequences.

But what one hears a little less about perhaps is that this wasn’t the first data-driven analysis to confront the same problem. Any real-world practising data analyst might be unsurprised to hear that there’s a bit more to the story than a swift sequence of problem identification -> data gathering -> analysis determining the root cause ->  action being taken.

Snow wasn’t working in a bubble. Another gentleman, by the name of William Farr, whilst working at the General Register Office, had set up a system that recorded people’s deaths along with their cause. This input seems to have been a key enabler of Snow’s analysis.

Lesson 1: sharing data is a Very Good Thing. This is why the open data movement is so important, amongst other reasons. What if Snow hadn’t been able examine Farr’s dataset – could lives have been lost? How would the field of epidemiology have developed without data sharing?

In most cases, no single person can reasonably be expected to both be the original source of all the data they need and then go on to analyse it optimally. “Gathering data” does not even necessarily involve the same set of skills as “analysing data” does – although of course a good data practitioner should usually understand some of the theory of both.

As it happens, William Farr had gone beyond collecting the data. Being of a statistical bent, he had actually already used the same dataset himself to analytically tackle the same question – why are there relatively more cholera deaths in some places than others? He’d actually already found what appeared to be an answer. It later turned out that his conclusion wasn’t correct – but it certainly wasn’t obvious at the time. In fact, it likely seemed more intuitively correct than Snow’s theory back then.

Lesson 2: Here then is a real life example then of the value of analytical iteration. Just because one person has looked at a given dataset doesn’t mean that it’s worthless to have someone else re-analyse it – even if the former analyst has established a conclusion. This is especially important when the stakes are high, and the answer in hand hasn’t been “proven” by virtue of any resulting action confirming the mechanism. We can be pleased that Snow didn’t just think “oh, someone’s already looked at it” and move on to some shiny new activity.

So what was Farr’s original conclusion? Farr had analysed his dataset, again in a geospatial context, and seen a compelling association between the elevation of a piece of land and the number of cholera deaths suffered by people who live on it. In this case, when the land was lower (vs sea level for example) then cholera deaths seemed to increase.

In June 1852, Farr published a paper entitled “Influence of Elevation on the Fatality of Cholera“. It included this table:

farrtable

The relationship seems quite clear; cholera deaths per 10k persons goes up dramatically as the elevation of the land goes down.

Here’s the same data, this time visualised in the form of a linechart, from a 1961 keynote address on “the epidemiology of airborne infection”, published in Bacteriology Reviews. Note the “observed mortality” line.

farrchart.gif

Based on that data, his elevation theory seems a plausible candidate, right?

You might notice that the re-vizzed chart also contains a line concerning the calculated death rate according to “miasma theory”, which seems to have an outcome very similar on this metric to the actual cholera death rate. Miasma was a leading theory of disease-spread back in the nineteenth century, with a pedigree encompassing many centuries. As the London Science Museum tells us:

In miasma theory, diseases were caused by the presence in the air of a miasma, a poisonous vapour in which were suspended particles of decaying matter that was characterised by its foul smell.

This theory was later replaced with the knowledge of germs, but at the time the miasma theory was a strong contender for explaining the distribution of disease. This was probably helped because some potential actions one might take to reduce “miasma” evidently would overlap with those of dealing with germs.

After analysing associations between cholera and multiple geo-variables (crowding, wealth, poor-rate and more), Farr’s paper selects the miasma explanation as the most important one, in a style that seems  quite poetic these days:

From an eminence, on summer evenings, when the sun has set, exhalations are often seen rising at the bottoms of valleys, over rivers, wet meadows, or low streets; the thickness of the fog diminishing and disappearing in upper air. The evaporation is most abundant in the day; but so long as the temperature of the air is high, it sustains the vapour in an invisible body, which is, according to common observation, less noxious while penetrated by sunlight and heat, than when the watery vapour has lost its elasticity, and floats about surcharged with organic compounds, in the chill and darkness of night.

The amount of organic matter, then, in the atmosphere we breathe, and in the waters, will differ at different elevations; and the law which regulates its distribution will bear some resemblance to the law regulating the mortality from cholera at the various elevations.

As we discover later, miasma theory wasn’t correct, and it certainly didn’t offer the optimum answer to addressing the cluster of cholera cases Snow examined.But there was nothing impossible or idiotic about Farr’s work. He (as far as I can see at a glance) gathered accurate enough data and analysed them in a reasonable way. He was testing a hypothesis that was based on the common sense at the time he was working, and found a relationship that does, descriptively, exist.

Lesson 3: correlation is not causation (I bet you’ve never heard that before ūüôā ). Obligatory link to the wonderful Spurious Correlations site.

Lesson 4: just because an analysis seems to support a widely held theory, it doesn’t mean that the theory must be true.

It’s very easy to lay down tools once we seem to have shown that what we have observed is explained by a common theory. Here though we can think of Karl Popper’s views of scientific knowledge being derived via falsification. If there are multiple competing theories in play, the we shouldn’t assume certainty that the dominant one is correct until we have come up with a way of proving the case either way. Sometimes, it’s a worthwhile exercise to try to disprove your findings.

Lesson 5: the most obvious interpretation of the same dataset may vary depending on temporal or other context.

If I was to ask a current-day analyst (who was unfamiliar with the case) to take a look at Farr’s data and provide a view with regards to the explanation of the differences in cholera death rates, then it’s quite possible they’d note the elevation link. I would hope so. But it’s unlikely that, even if they used precisely the same analytical approach, they would suggest that miasma theory is the answer. Whilst I’m hesitant to claim there’s anything that no-one believes, for the most part analysts will probably place an extremely low weight on discredited scientific theories from a couple of centuries ago when it comes to explaining what data shows.

This is more than an idealistic principle – parallels, albeit usually with less at stake, can happen in day-to-day business analysis. Preexisting knowledge changes over time, and differs between groups. Who hasn’t seen (or had of being) the poor analyst who revealed a deep, even dramatic, insight into business performance predicated on data which was later revealed to have been affected by something entirely different.

For my part, I would suggest to learn what’s normal, and apply double-scepticism (but not total disregard!) when you see something that isn’t. This is where domain knowledge is critical to add value to your technical analytical skills. Honestly, it’s more likely that some ETL process messed up your data warehouse, or your store manager is misreporting data, than overnight 100% of the public stopped buying anything at all from your previously highly successful store for instance.

Again, here is an argument for sharing one’s data, holding discussions with people outside of your immediate peer group, and re-analysing data later in time if the context has substantively changed. Although it’s now closed, back in the deep depths of computer data viz history (i.e. the year 2007), IBM launched a data visualisation platform called “Many Eyes”. I was never an avid user, but the concept and name rather enthralled me.

Many Eyes aims to democratize visualization by providing a forum for any users of the site to explore, discuss, and collaborate on visual content…

Sadly, I’m afraid it’s now closed. But other avenues of course exist.

In the data-explanation world, there’s another driving force of change – the development of new technologies for inferring meaning from datapoints. I use “technology” here in the widest possible sense, meaning not necessarily a new version of your favourite dataviz software or a faster computer (not that those don’t help), but also the development of new algorithms, new mathematical processes, new statistical models, new methods of communication, modes of thought and so on.

One statistical model, commonplace in predictive analysis today, is logistic regression. This technique was developed in the 1950s, so was obviously unavailable as a tool for Farr to use a hundred years beforehand. However, in 2004, Bingham et al. published a paper that re-analysed Farr’s data, but this time using logistic regression. Now, even here they still find a notable relationship between elevation and the cholera death rate, reinforcing the idea that Farr’s work was meaningful – but nonetheless conclude that:

Modern logistic regression that makes best use of all the data, however, shows that three variables are independently associated with mortality from cholera. On the basis of the size of effect, it is suggested that water supply most strongly invited further consideration.

Lesson 6: reanalysing data using new “technology” may lead to new or better insights (as long as the new technology is itself more meritorious in some way than the preexisting technology, which is not always the case!).

But anyway, even without such modern-day developments, Snow’s analysis was conducted, and provided evidence that a particular water supply was causing a concentration of cholera cases in a particular district of London. He immediately got the authorities to remove the handle of the contaminated pump, hence preventing its use, and hundreds of people were immediately saved from drinking its foul water and dying.

That’s the story, right? Well, the key events themselves seem to be true, and it remains a great example of that all-too-rare phenomena of data analysis leading to direct action. But it overlooks the point that, by the time the pump was disabled, the local cholera epidemic had already largely subsided.

The International Journal of Epidemiology published a commentary regarding the Broad Street pump in 2002, which included a chart using data taken from Whitehead’s “Remarks on the outbreak of cholera in Broad Street, Golden Square, London, in 1854” paper, which was published in 1867. The chart shows, quite vividly, that by the date that the handle of the pump was removed, the local cholera epidemic that it drove was likely largely over.

handle

As Whitehead wrote:

It is commonly supposed, and sometimes asserted even at meetings of Medical Societies, that the Broad Street outbreak of cholera in 1854 was arrested in mid-career by the closing of the pump in that street. That this is a mistake is sufficiently shown by the following table, which, though incomplete, proves that the outbreak had already reached its climax, and had been steadily on the decline for several days before the pump-handle was removed

Lesson 7: timely analysis is often vital – but if it was genuinely important to analyse urgently, then it’s likely important to take action on the findings equally as fast.

It seems plausible that if the handle had been removed a few days earlier, many more lives could have been saved. This was particularly difficult in this case, as Snow had the unenviable task of persuading the authorities too take action based on a theory that was counter to the prevailing medical wisdom at the time. At least any modern-day analysts can take some solace in the knowledge that even our highest regarded dataviz heroes had some frustration in persuading decision makers to actually act on their findings.

This is not at all to reduce Snow’s impact on the world. His work clearly provided evidence that helped lead to germ theory, which we now hold to be the explanatory factor in cases like these. The implications of this are obviously huge. We save lives based on that knowledge.

Even in the short term, the removal of the handle, whilst too late for much of the initial outbreak, may well have prevented a deadly new outbreak. Whitehead happily acknowledged this in his article.

Here I must not omit to mention that if the removal of the pump-handle had nothing to do with checking the outbreak which had already run its course, it had probably everything to do with preventing a new outbreak; for the father of the infant, who slept in the same kitchen, was attacked with cholera on the very day (Sept. 8th) on which the pump-handle was removed. There can be no doubt that his discharges found their way into the cesspool, and thence into the well. But, thanks to Dr. Snow, the handle was then gone.

Lesson 8: even if it looks like your analysis was ignored until it was too late to solve the immediate problem, don’t be too disheartened –  it may well contribute towards great things in the future.

Retrieving Adobe SiteCatalyst data with R

Adobe¬†SiteCatalyst (part of Adobe Analytics) is a nicely comprehensive tool for tracking user interactions upon one’s website, app and more. However, in the past I’ve had a fair amount of trouble de-siloing its¬†potentially immensely¬†useful data into external tools, such that I could connect, link and¬†process it for insights over and above those you can get within the default web tool (which, to be fair, is itself improving over time).

I’ve written in the past¬†about my happiness when Alteryx released a data connector allowing one to access Sitecatalyst data from inside their own tool. That’s still great, but the tool is necessarily constrained to the specific tasks the creator designed it to do, and subject to the same API limits as everyone else is.¬†I have no doubt that there are ways and means to get around that in Alteryx (after all, it speaks R). But sometimes, when trying to automate operations,¬†coding in something like R¬†might actually be easier…or at least cheaper!

With that in mind, after having a recent requirement to analyse web browsing data at a individual customer level, I successfully experimented with the awesome RSiteCatalyst package, and have some notes below as to the method which worked well for me.

Note that RSiteCatalyst is naturally subject to the usual Adobe API limits – the main one that causes me sadness being the inability to retrieve over 50k rows at a time¬†– but, due to the incessant flexibility of R¬†and the comprehensiveness of this package, I’ve not encountered a problem I couldn’t solve just yet.

So, how to set up?

Install the RSiteCatalyst package

First, open up R, or your favourite R environment, and install the RSiteCatalyst package (the first time you use it) or load the library (each new session).

if(!require(RSiteCatalyst)) install.packages("RSiteCatalyst")
library(RSiteCatalyst)

Log in to your SiteCatalyst account

You next need to authenticate against your Adobe Sitecatalyst installation, using the SCAuth function. There’s an old way involving a shared secret, and a new way using OAuth. The latter is to be preferred, but at the¬†time I first looked at it there seemed to be a web service issue that prevented the OAuth process completing. At the moment then, I can confirm that the old method still works!

 key <- "username:company>"
 secret <- "https://sc.omniture.com/p/suite/1.3/index.html?a=User.GetAccountInfo >"
 SCAuth(key, secret)

 

Retrieve metadata about your installation

Once you’re in, there’s a plethora of¬†commands to retrieve useful metadata, and then to run and retrieve various types of report data. For¬†several such commands, you’ll need to know the ID of the¬†Adobe Report Suite concerned, which is fortunately as easy as:

suites <- GetReportSuites()

whereby you’ll receive a dataframe containing all available report suites by title and ID.

If you already know the title of the report suite you’re interested in then you can grab the ID¬†directly with something like:

my.rsid <- suites[suites$site_title=="My favourite website",1]

You can then find out which elements are available within your report suite:

elements.available <- GetElements(my.rsid)

and later on which metrics are available for a given element

metrics.available <- GetMetrics(my.rsid, elements = "<<myFavouriteElement>>")

Retrieve Adobe Analytics report data

There are a few different RSitecatalyst functions you can call, depending on the type of report you’re interested in. In each case they start with “queue”, as what you’re actually doing is submitting a data request to the Sitecatalyst reporting queue.

If your request is pretty quick, you can wait a few seconds and get the results sent back to you immediately in R. If it’s going to take a very long time, you can instead store a report request ID and then use the GetReport function to go back and get it later, once¬†it’s finished.

The current list of queue functions, which are named such that Adobe aficionados will probably be able to guess which type of data they facilitate, is:

  • QueueDataWarehouse
  • QueueFallout
  • QueueOvertime
  • QueuePathing
  • QueueRanked
  • QueueSummary
  • QueueTrended

Here I’ll show just a couple of examples – but the full official documentation for all of them and much more besides is available¬†at Cran.

Firstly, an “over¬† time” report to get me the daily pageview and visit counts my site received in the first half of 2016.

Here’s how the documentation describes this type of report:

“A QueueOvertime report is a report where the only granularity allowed is time. This report allows for a single report suite, time granularity, multiple metrics, and a single segment”

An example would be a day by day count of total pageviews to your site.

Remember that above we set the variable “my.rsid” to the Report Suite ID of the report suite I am looking at. So:

dailyStats <- QueueOvertime(my.rsid,
                            date.from = "2016-01-01",
                            date.to = "2016-06-30",
                            metrics = c("pageviews","visits"),
                            date.granularity = "day"
                            )

The parameters are fairly self-evident, especially in combination with the documentation. There are many more available, including the ability to isolate by segment and to forecast data.

Your site will probably have some different metrics available to mine, depending on how the Adobe admin set it up, but the basics like page views and visits should be accessible pretty much anywhere.

What you’ll get back with the above command, i.e. the contents of the dailyStats variable after it has run, is a dataframe in this sort of format:

datetime name year month day segment.id segment.name pageviews visits
01/01/2016 Fri. 1 Jan. 2016 2016 1 1 100 75
02/01/2016 Sat. 2 Jan. 2016 2016 1 2 200 150
03/01/2016 Sun. 3 Jan. 2016 2016 1 3 250 180

Which you can then go on to process as you would any other such snippet of data in R.

My second, slightly more complex, example concerns the QueueRanked function. It’s described like this:

A QueueRanked report is a report that shows the ranking of values for one or more elements relative to a metric, aggregated over the time period selected.

That’s a bit more vague, but was useful to me in my original goal of identifying specifically which customers logged into my website in December, and how many times they visited.

A key feature of this function is that you can ask for the results from rank [x] to rank [y], instead of just the top [n].

This is super-useful where, like I did,  you expect to get over 50k rows. 50k is maximum row limit you can retrieve in one request via the Adobe API, which this R package uses. But R is full of the typical program language features like loops, thus allowing one to iterate through the commands to retrieve for instance results 1-50,000, then results 50,001 -100,000, then 100,001 Р150,000 and so on.

So, I built a loop that would generate these “ranked reports”, starting at row ‘i’ and giving me the next ‘count.step’ records, where count.step = 50000, the maximum I’m allowed to retrieve in one go.

Thus, I’d call the function repeatedly, each time asking for the next 50,000 records. At some point, when there were no more customers to download, I’d get a blank report sent back to me. At that point, I know I have everything so quit the loop.

I wanted to retrieve the ID of the customer using the website, which in my setup is stored in an custom element called “prop1”. All that sort of detail is controlled by your Adobe Sitecatalyst administrator, should you have exactly the same sort of requirement as I did – so best go ask them which element to look in, as there’s no real chance your setup is identical to mine at that level.

Nonetheless, the code pattern below could likely be used without much modification in order to iterate through any SiteCatalyst data that exceeds the API row limits.

 

count.limit <- 500000 #the max number of records we're interested in
count.step <- 50000 #how many records to retrieve per request, must not exceed 50k
count.start <- 1 #which record number to start with
CustomerList <- NULL #a variable to store the results in
fromDate <- "2016-12-01"
toDate <- "2016-12-31"

for(i in seq(1, count.limit, by = count.step)) {
  print(paste("Requesting rows",i, "through", i + count.step - 1))

  tempCustomerList <- QueueRanked(my.rsid,
                          date.from = fromDate,
                          date.to = toDate,
                          metrics = "visits",
                          elements = "prop1",
                          top = count.step,
                          start = i
                          )

  if  (nrow(tempCustomerList) == 0 ) {   # no more rows were returned - presumably we have them all now
    print("Last batch had no rows, exiting loop")
    break
  }
   
  tempCustomerList$batch.start.row <- i

  CustomerList <- rbind(customerList, tempCustomerList)

}

 

After running this, you should end up with a “CustomerList” dataframe that looks something like this, where “name” is the value of prop1, in my case the customer ID:

name url visits segment.id segment.name batch.start.row
ID123 10 1
ID234 5 1

which, again, you can process as though it was any standard type of R data.

Actually you can use variables, CTEs and other fancy SQL with Tableau after all

A few months ago, I blogged about how you can use Tableau parameters when connecting to many database datasources in order to exert the same sort of flexibility that SQL coders can build into their queries using SQL variables.

This was necessary because Tableau does not let you¬†use SQL variables, common table expressions, temp table creation or other such fanciness when defining the data you wish to analyse, even via custom SQL. You can copy and paste a SQL query using those features that works fine in Microsoft’s¬†SQL Server Management Studio or Oracle’s¬†SQL Developer into Tableau’s custom SQL¬†datasource box, and you’ll get nothing but errors.

But it turns out that there are actually ways to use these features in Tableau, as long as you only need to run them once per session Рupon connection to the database.

Why do this? Well, if you have a well-designed data warehouse or other¬†analytically-friendly datasource hopefully you’ll never need to. However, if you’re accessing “sub-optimally” designed tables for your task, or large raw unprocessed operational data tables, or other such scenarios that consistently form one of the bane of an analyst’s life, then you might find manipulating or summarising data before¬†you use it in Tableau makes life much easier, much faster, or both.

To do this, we’ll use the “Initial SQL” feature of Tableau. This is supported on some, but not all, of the database types Tableau¬†can connect to. You’ll know if your data connection does support it, because an option to use it will appear if you click on the relevant connection at the top left of the data source screen.

capture

If you click that option, you’ll get a¬†text box where you can go crazy with your SQL in order to¬†set up and define¬†whatever final data table you’d like to connect to. Note that you’ll be restricted to the features of your database and the permissions your login gives you – so no standard table creation if you don’t have write-permissions etc. etc.¬†But, to take the example of SQL Server, in my experience most “normal” users can write temporary tables.

Tableau isn’t great at helping you write this initial SQL – so if you’re not super-proficient and intend to write something complicated then you might want to play with it in a friendlier SQL tool first and paste it into Tableau when you know it works, or ask a friendly database-guru to do it for you.

Below then is an example of how one can use variables, CTEs and temporary tables in order to pre-build a data table you can then analyse in Tableau, by virtue of pasting code into the initial SQL box. This code will be re-run and hence refreshed every time you open the workbook. But as the name “initial SQL” suggests, it will not be refreshed every time you¬†create or modify a new chart if you’ve not re-connected to the datasource inbetween.

(For what it’s worth, this uses the default demo database in SQL server – AdventureWorks. It’s the equivalent of Tableau’s famous Superstore dataset, in the Microsoft world ūüôā )

DECLARE @granularity AS varchar(50);
SET @granularity = 'yearly';

WITH OrdersCTE AS
(
SELECT OrderDate, SalesOrderID, TotalDue
FROM [SalesLT].[SalesOrderHeader]
)

SELECT
CASE WHEN @granularity = 'yearly' THEN CONVERT(varchar,YEAR(OrderDate))
WHEN @granularity = 'monthly' THEN CONVERT(varchar,YEAR(OrderDate)) + '-' + CONVERT(varchar,MONTH(OrderDate))
WHEN @granularity = 'daily' THEN CONVERT(date,OrderDate)
END AS Period,

COUNT(SalesOrderID) AS CountOfSales,
SUM(TotalDue) AS TotalDue
INTO #initialSQLDemo
FROM OrdersCTE

GROUP BY
CASE WHEN @granularity = 'yearly' THEN CONVERT(varchar,YEAR(OrderDate))
WHEN @granularity = 'monthly' THEN CONVERT(varchar,YEAR(OrderDate)) + '-' + CONVERT(varchar,MONTH(OrderDate))
WHEN @granularity = 'daily' THEN CONVERT(date,OrderDate)
END;

Now, this isn’t supposed to be an SQL tutorial, so I’ll not explain in detail what the above does. But for those of you already familiar with (T)SQL, you’ll note that I set a variable as to how granular I want my dates to be,¬†use a CTE to build a cut-down version of a sales table with fewer columns, and then aggregate my sales table to the level¬†of date I asked for in my variable into a temporary table I called #initialSQLDemo.

In the above, it’s set to summarise sales by year.¬†This means Tableau will only ever receive 1 row per year.¬†You’ll not be able to drill down into more granular dates – but if this is detailed enough for what you want, then this might provide¬†far better¬†performance than if you connected to a table with 1 row per minute and, implicitly or explicitly, ask Tableau to constantly¬†aggregate it in expensive ways.

Later on, perhaps I realise I need daily data. In which case, I can just change the second line above to :

SET @granularity = 'daily';

…which will expose¬†data at a daily level to Tableau, and I’m good to go.

SQL / Adventureworks aficionados will probably realise that my precise example is very contrived, but hopefully it’s clear how the initial SQL feature works, and hence in the real world you’ll be able to think of cases that are truly useful. Note though that if your code does something that is slow, then you will experience this slowness whenever you open your workbook.

A quick note on temporary tables:

If you’re following along, you might notice something. Although I created a temp table called¬†#initialSQLDemo (which being temporary, will be deleted¬†from the database as soon as you close your session), it never appears in the Table list on the left of the Tableau¬†data source ¬†screen.

capture

Why so? Well, in SQL Server, temporary tables are created in a database called “tempdb” that Tableau doesn’t seem to show. This is not a¬†fatal deal though, as they’re still accessible via the “New Custom SQL” option shown¬†above.

In my example then, I dragged New Customer SQL to the data pane and entered¬†a simple “give me everything” into the resulting box.

capture

Now, there is a downside in that I understand using custom SQL can reduce performance in some cases, as Tableau is limited in how it can optimise queries. But a well-designed temporary table might in any case be intrinsically faster to use that an over-sized ill-designed permanent table.

When venturing back into Tableau’s main viz-design screen, you’ll see then your temporary table is treated just as any other¬†table source would be. Tableau doesn’t care that it’s temporary.

capture

Although we should note that if you want to use the table in the future in a different Tableau workbook, you’d have to have it run the initial SQL again there, as standard temporary tables are not shareable and do not persist between database sessions.

 

Clustering categorical data with R

Clustering is one of the most common¬†unsupervised machine learning tasks. In¬†Wikipedia‘s¬†current words, it is:

the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups

Most “advanced analytics” tools have some ability to cluster in them. For example,¬†Alteryx has K-Centroids Analysis.¬†R, Python, SPSS, Statistica and any other proper data sciencey tools all likely have¬†many methods¬†–¬†and even Tableau, although not necessarily aimed at the same market, just added a user-friendly clustering facility. ¬†You can do¬†the calculations in Excel, should you really want to (although why not cheat and use a nice addin¬†if you want to save time?).

However,¬†many of the more famous clustering algorithms, especially the ever-present K-Means algorithm, are really better for clustering objects that have¬†quantitative numeric¬†fields, rather than those that are categorical.¬†I’m not going delve¬†into the¬†details of why¬†here, but, simplistically, they tend to be based on concepts like¬†Euclidean distance¬†–¬†and in that domain, it’s¬†conceptually difficult¬†to say that [bird] is Euclideanly “closer” to [fish] than¬†[animal]; vs the much more straightforward¬†task of¬†knowing that an income of ¬£100k is nearer to one of ¬£90k than it is to¬†50p.¬†IBM has a bit more about that¬†here.

But, sometimes you really want to cluster categorical data! Luckily, algorithms for that exist, even if they are rather less widespread than typical k-means stuff.

R being R, of course it has¬†a ton of libraries that might help you out. Below¬†are a couple I’ve used, and¬†a few notes as to the very basics of how to use them – not that it’s too difficult¬†once you’ve found¬†them. The art of selecting the optimum parameters for the very finest of clusters though¬†is still yours to master, just like it is on most quantitative clustering.

The K-Modes algorithm

Like k-means, but with modes, see ūüôā ? A paper called ‘Extensions to the k-Means Algorithm for Clustering¬†Large Data Sets with Categorical Values‘ by Huang gives¬†the gory details.

Luckily though, a R implementation is available within the klaR package. The klaR documentation is available in PDF format here and certainly worth a read.

But simplistically, you’re looking at passing a matrix or dataframe into the “kmodes” function.

Imagine you have a CSV file something like:

RecordID FieldA FieldB FieldC FieldD
1 0 0 0 1
2 0 0 0 0
3 0 0 0 1
4 1 1 0 0

Here’s how you might read it in, and cluster the records¬†based on the contents of fields¬†“FieldA”, “FieldB”, “FieldC”, and “FieldD”.

install.packages("klaR")
library(klaR)
setwd("C:/Users/Adam/CatCluster/kmodes")
data.to.cluster <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-kmodes(data.to.cluster[,2:5], 3, iter.max = 10, weighted = FALSE ) #don't use the record ID as a clustering variable!

Here I’ve asked for 3 clusters to be found, which is the second argument of the kmodes function. Just like k-means, you can specify as many as you want¬†so you have a few variations to compare the quality or real-world utility of.

This is the full list of parameters to kmodes, per the documentation.

kmodes(data, modes, iter.max = 10, weighted = FALSE)
  • data: A matrix or data frame of categorical data. Objects have to be in rows, variables
    in columns.
  • modes: Either the number of modes or a set of initial (distinct) cluster modes. If a
    number, a random set of (distinct) rows in data is chosen as the initial modes.
  • iter.max: The maximum number of iterations allowed.
  • weighted: Whether usual simple-matching distance between objects is used, or a weighted¬†version of this distance.

What do you get back?

Well, the kmodes function returns you a list, with the most interesting entries being:

  • cluster: A vector of integers indicating the cluster to which each object is allocated.
  • size: The number of objects in each cluster.
  • modes: A matrix of cluster modes.
  • withindiff: The within-cluster simple-matching distance for each cluster

Here’s an example what it looks like when output to the console:

K-modes clustering with 3 clusters of sizes 3, 5, 12

Cluster modes:
 FieldA FieldB FieldC FieldD
1 1 0 0 0
2 1 0 1 1
3 0 0 0 0

Clustering vector:
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
 3 3 3 1 3 1 2 3 3 3 2 2 2 3 3 2 1 3 3 3

Within cluster simple-matching distance by cluster:
[1] 2 2 8

Available components:
[1] "cluster" "size" "modes" "withindiff" "iterations" "weighted"

So, if you want to append your newly found clusters onto the original dataset, you can just add the cluster back onto your original dataset as a new column, and perhaps write it out as a file to analyse elsewhere, like this:

cluster.output <- cbind(data.to.cluster,cluster.results$cluster)
write.csv(cluster.output, file = "kmodes clusters.csv", row.names = TRUE)

 

The ROCK algorithm

Some heavy background reading on Rock is available in this presentation by Guha et al.

Again,¬†a benevolent genius has popped an implementation into R for our use. This time you can find it in package “cba”. The PDF docs for cba are here.

But the most simplistic usage is very similar to k-modes, albeit with different optional parameters based on the how the algorithms differ.

Here’s what you’d do to cluster the same data as above, and write it back out, this time with the Rock clusters appended. Note here that ideally you’re specifically passing in a matrix to the rockCluster function.

install.packages("cba")
library(cba)
setwd("C:/Users/Adam/CatCluster/rock")
data.to.cluster <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-rockCluster(as.matrix(data.to.cluster[,2:5]), 3 )
cluster.output <- cbind(data.to.cluster,cluster.results$cl)
write.csv(cluster.output, file = "Rock clusters.csv", row.names = TRUE)

The full list of parameters to the relevant function, rockCluster is:

rockCluster(x, n, beta = 1-theta, theta = 0.5, fun = "dist", funArgs = list(method="binary"), debug = FALSE)
  • x: a data matrix; for rockLink an object of class dist.
  • n: the number of desired clusters.
  • beta: optional distance threshold
  • theta: neighborhood parameter in the range [0,1).
  • fun: distance function to use.
  • funArgs: a list of named parameter arguments to fun.
  • debug: turn on/off debugging output.

This is the output, which is of class “rock”, when printed to the screen:

data: x 
 beta: 0.5 
theta: 0.5 
 fun: dist 
 args: list(method = "binary") 
 1 2 3 
14 5 1

The object is¬†a¬†list, and its most useful component is probably “cl”, which is a factor containing the assignments of clusters to your data.

Of course once you have the csv files generated in the above ways, it’s just bog-standard data – so you’re¬†free to visualise in R, or any other tool.

Workaround for the Tableau “custom SQL” restriction on SQL Server variables

SQL Server (among other databases) has a handy feature for easy re-use of queries, in the guise of variables. You can declare variables and use them repeatedly in any query in the current session. That’s pretty handy for any complicated query forms you use repeatedly, where each time¬†you might need to change some¬†basic criteria.

As a trivial example: here’s how to select all the customers I have that live in postcode 12345

DECLARE @postcode varchar(5)
SET @postcode = '12345'

SELECT POSTAL_CODE, CUSTOMER_NO FROM [MY_CUSTOMERS] WHERE POSTAL_CODE = @postcode

I can then run it another day for postcode 23456 by just changing the second line to

SET @postcode = '23456'

This example is a bit silly because you might as well just type 12345 or 23456 into the query itself Рbut imagine a more complex SQL statement, or set of coordinated SQL statements, with screenfuls of code, and you might see how it makes it much easier to maintain.

But, sad to say, you can’t use these sort of variables as part of the¬†custom SQL feature in Tableau. If you try, then you’ll get this slightly unhelpful message:

[Microsoft][SQL Server Native Client 11.0][SQL Server]Incorrect syntax near the keyword 'DECLARE'.
[Microsoft][SQL Server Native Client 11.0][SQL Server]Incorrect syntax near ')'.

But rather than have to hardcode your queries and modify them for every extract you can use a very simple alternative: our friend, the Tableau parameter.

For each SQL variable, create a Tableau parameter of the same data type. You can put in a default value as the “Current value” when setting it up.

capture

Then, when connecting to your database, paste in the original SQL statement you had into the Custom SQL box, but removing the variable declaration and setting from the top.
SELECT POSTAL_CODE, CUSTOMER_NO FROM [MY_CUSTOMERS] WHERE POSTAL_CODE = @postcode

capture

One at a time, highlight each SQL variable there, i.e. the @postcode part of the above, press the “Insert Parameter” button under the text field and select the appropriate Tableau parameter.

(There is actually the option to create a new parameter at this stage, if you didn’t do so in advance).

You’ll end up with a similar looking SQL statement, but with the variables now referred to in this sort of way:

SELECT POSTAL_CODE, CUSTOMER_NO FROM [MY_CUSTOMERS]
WHERE POSTAL_CODE = <Parameters.Postcode>

And it’ll work!

You can then use the standard Tableau functionality to change the parameters in Tableau, and behind the scenes your SQL statement will update appropriately.

Note firstly, for performance, that each change of this sort of parameter is going to cause the query to be re-run on the database (unless it’s already been cached).

Secondly, if you have created a Tableau Data Extract from this datasource, then altering the parameter won’t change anything until you refresh the extract.

Finally, the Tableau parameter is simply a normal Tableau parameter, just like anything else. So there’s nothing to stop a Tableau user altering it, which depending on your setup might be either exactly what you want, or exactly what you don’t want. If the latter, you might consider having the datasource involving parameters in a separate workbook to the user-facing dashboard (publishing to Tableau Server to share), or at least making sure it’s not visible on the end-user dashboards if they have no reason to change it.

Quick sidenote: I wondered if the “Initial SQL” feature might work, to declare variables upon connection to the datasource which could then be used in custom SQL later. But sadly not, I guess it doesn’t count as the same session. If you try, you’ll see a message like:
[Microsoft][SQL Server Native Client 11.0][SQL Server]Must declare the scalar variable "@postcode"

A few notes on Tableau 10’s cross-database join feature

The new version of Tableau, number 10, got¬†released whilst I was busy sunning myself on holiday. This mean¬†there was at least something to look forward to in returning to the world of work ūüôā

It has many, many new features. I suspect one of the most impactful, in this era of data lying around everywhere you look, may end up being cross-database joins.

This means that Tableau can now do “proper” joins (i.e. not blends) that span different databases, even different database technologies. So for instance, you can execute an inner join from one table from SQL Server onto data from another table in Redshift, and analyse the result¬†as though it all came from the same place. Better yet, you can publish the resulting datasource as a Tableau Server datasource so¬†everyone can join in the fun, without even having to care where it came from in the first place.

Below a few notes / warnings from my early experimentations. Some may sound critical, but, overall, this is truly a great feature that will enable the less ETL-loving folks to get more done quicker, and/or without the resources of tech database specialists or sofwtare being needed.

  1. Excel counts as a database (as does any other supported file format). This means you can simulate database joins between multiple Excel files for instance without having to modify them in any way.
  2. You can use a custom SQL statement as the “table” you’re joining to or from in any join:¬†i.e. you’re not restricted to just the formal tables on the database. This means that if you’re an SQL wizard who would prefer to have the database preprocess stuff before it hits Tableau, you can do that in the context of a cross-database join.
  3. Warning! I believe (from experience ūüôā ) Tableau downloads all the data from the tables you choose to your local computer before performing the join. This is kind of predictable¬†if you think it through, but not totally. If you know what an Alteryx “in-database tool” is, then this is not like one of those.

    The joining is on your computer, even if you don’t need to care about that.¬†This means that if you’re inner joining a table of 10 rows in one database to 10 out of a possible 1 billion rows in another database to create an extract, Tableau will be downloading 1 billion and 10 rows to your computer before later returning you the post-join 10 rows as your datasource. You can apply data source filters to your tables before extracting though which might help alleviate some of the potential pain here.

  4. You can do left, inner and right joins very easily, joining on any number of fields as the key. But I didn’t see a¬†facility for¬†conditional joins, in the¬†sense of saying “download all sales from Redshift for a customer whose info is in Oracle where the sales are between the start and end date of the customer’s first 6 weeks according to Oralce”.¬†You would have to let it download and show you all of the joined data and then use formula in Tableau afterwards to filter out what you don’t want.
  5. Some data connection types¬†aren’t allowed to be involved in cross database joins at the moment. A Tableau blog post lists these as:
    • Tableau Server
    • Firebird
    • Google Analytics
    • Microsoft Analysis Services
    • Microsoft PowerPivot
    • Odata
    • Oracle Essbase
    • Salesforce
    • SAP BW
    • Splunk
    • Teradata OLAP Connector

      The big shame for me on that list is Tableau Server. Tableau Server datasources are a very cool way for non-DBAs to share analytically-useful data around the organisation (or DBA professionals to provide a lovely structured, governed, datasource ideal for Tableau users to use). I have however found that they can result in semi-silos, as exporting or manipulating them has certain limitations vs standard databases. If one could cross-database join these types of datasource one day it would be an amazing way of adding extra flexibility to that feature

  6. Performance may be super great, or it may be super slow, depending on your exact database setup.

But all those compliments and complaints aside, debt is very much due to the makers of Tableau 10 for providing one fewer reason for analysts to load up other software, or pester other departments, before getting started on analysing all your lovely data.

Creating my first Tableau web data connector : part 3

At last, the final part of the trials and tribulations of creating my first Tableau Web Data Connector…¬†Part 1 went through the pre-requisites, and building what I¬†might generously term my “user interface”. Part 2 was a struggle against the forces of web security. And¬†in this part, we battle against the data itself, until the error-message dragon is truly slayed.

So, the¬†in the last part, we ended up with a connector that could¬†ask for the username of a BoardGameGeek user, connect to a CORS-enabled version of the API, and complain that whatever objects it found “are not valid as a React child”.

That makes sense, as I didn’t yet tell it what sort of data objects to expect.

As I’m following the Tableau¬†web data connector tutorial, which accesses data from URLs like this one, I figured I’d look at what data that URL returns precisely, compare it to the web data connector tutorial code, and then, when I understand the relationship between the code and the data from the Yahoo URL, I might be able to¬†adapt the code¬†to fit the format my chosen data¬†arrives in.

Here’s the data one gets if you point your web browser to the URL that the tutorial is teaching you how to build a WDC from:

Capture

OK, stock-pickers will immediately recognise that it’s stock quote data, with dates, highs, lowers,¬†symbols and all the other lovely gubbins¬†that comes along with basic stock charts.¬†It’s in¬†JSON format, which is human-readable, but only in small doses.

So I need to understand the relationship between the above data, and this piece of code:

if (data.query.results) {
              var quotes = data.query.results.quote;
              var ii;
              for (ii = 0; ii < quotes.length; ++ii) {
                  var entry = {'Ticker': quotes[ii].Symbol,
                               'Day': quotes[ii].Date,
                               'Close': quotes[ii].Close};
                  dataToReturn.push(entry);
              }

I can¬†kind of see it by eye – there’s obviously JSON entries for Symbol, Date, Close, query and so on. But can’t we make it easier?

Yes we can, because there’s such a thing as a web JSON viewer. Past all that stock-related text into something like¬†http://www.jsoneditoronline.org/¬†and you get a nice hierarchical visualisation of the JSON structure, like this:

Capture

So if we assume that “data” is referring to the overall set of data itself, and use dot notation to traverse the hierarchy, we can see a line in the code that says:

var quotes = data.query.results.quote;

That would seem to fit into the¬†JSON structure above, where below the “object” level you have a level called query, a sublevel called results, and a¬†sublevel called quote.

Capture

The variables “quotes” that the code creates therefore is¬†basically referring to everything at/below the “quote” level of the JSON hierarchy.

Then you get a bunch of records which are the data of interest itself. These are numbered 0 for the first datapoint, 1 for the next, and so on.

If you know a bit of basic programming, you might note that the WDC code has “for loop” with¬†a counter variable called “ii” that is set to 0 at first¬†and is incrementing by one each time it runs. That seems to fit nicely in with the idea of it iterating through datapoint 0, 1, …n until it gets to the end (which is effectively the length of the quotes dataset, i.e. quotes.length).

Each one of these JSON records then has a few attributes – and they include the ones mentioned in the code as being part of¬†whichever “quotes” subrecord we’re on; “Symbol”, “Date”, “Close”.

   var entry = {'Ticker': quotes[ii].Symbol,
                               'Day': quotes[ii].Date,
                               'Close': quotes[ii].Close};

Capture

Awesome. So where does “Ticker”, “Day” and “Close” on the left hand side of¬†those code lines come from?

Well, that’s up to us. Remember the “add code for defining columns” part of the tutorial?

myConnector.getColumnHeaders = function() {
    var fieldNames = ['Ticker', 'Day', 'Close'];
    var fieldTypes = ['string', 'date', 'float'];
    tableau.headersCallback(fieldNames, fieldTypes);
}

There they’ve defined the¬†fieldnames they want¬†to have Tableau display (which can be totally different from the¬†JSON names we saw above in the datasource) and the¬†field types.

Field types can be any from this list:

  • bool
  • date
  • datetime
  • float
  • int
  • string

So, let’s look at what’s returned when I make the call to my Boardgames datasource.

Capture.PNG

OK, I can see some fields in there, but it’s kind of messy. So I pasted it back into¬†the online¬†JSON editor.

Here’s what the relevant part looks like when formatted nicely:

Capture.PNG

So, the interesting information there is probably the boardgame’s name, year published, URLs to its image and thumbnail, and the various statuses (own, want, etc.) and the number of times its been played (numplays).

I therefore edited the¬†“getColumnHeaders” code snippet to reflect the fields I wanted, and the type of data that would feature in them.

myConnector.getColumnHeaders = function() {
		var fieldNames = ['BoardgameName','YearPublished','ImageURL','ImageThumbnailURL', 'Plays','Owned','PreviouslyOwned','ForTrade','Want','WantToPlay','WantToBuy','Wishlist','Preordered','LastModified'];
		var fieldTypes = ['string','int','string','string','int','int','int','int','int','int','int','int','int','datetime'];
		tableau.headersCallback(fieldNames, fieldTypes);
	}

Now I’ve defined the fields, I can go back to the¬†retrieving results section of code (if (data.query.results)…) and, now knowing the structure of the JSON generated by my API, parse out and assign to the above variables the data I want.

I decided to call the collection of data I was building “games” rather than “quotes”, because that’s what it is. I next noted that each¬†individual “game” within the JSON is listed in a hierarchy below the “item” entry which itself is below “items”.

Capture.PNG

(“Amyitis” is the name of a boardgame, rather than an allergy to people called Amy, believe it or not).

So, I assigned all the items.item data to “games”

if (data.items.item) {
var games = data.items.item;

And copied the tutorial to loop through all the “items.item” entries, each of which is a boardgame, until we’d reached the end i.e. when the number of times we looped is the same as the length of the data table..

var ii;
for (ii = 0; ii < games.length; ++ii) {

Finally, it’s time to assign the relevant bit of data returned from the API to the variables I’d defined above.

Capture

At first I got a little confused, because the¬†JSON output had a bunch of _ and $¬†entries that didn’t seem similar to what was returned in the tutorial dataset. But it turns out that’s nothing to worry about. Just treat them as though they were any other text.

var entry = {'BoardgameName': games[ii].name[0]._,
'YearPublished': games[ii].yearpublished[0],
'ImageURL': 'https:' + games[ii].image[0],
'ImageThumbnailURL': 'https:' + games[ii].thumbnail[0],
'Plays': games[ii].numplays[0],
'Owned': games[ii].status[0].$.own,
'PreviouslyOwned': games[ii].status[0].$.prevowned,
'ForTrade': games[ii].status[0].$.fortrade,
'Want': games[ii].status[0].$.want,
'WantToPlay': games[ii].status[0].$.wanttoplay,
'WantToBuy': games[ii].status[0].$.wanttobuy,
'Wishlist': games[ii].status[0].$.wishlist,
'Preordered': games[ii].status[0].$.preordered,
'LastModified': games[ii].status[0].$.lastmodified
};
dataToReturn.push(entry);

In the above, you can think about games[ii] as being the individual boardgame record. The “for loop” we defined above substitutes each record into the ‘ii’ variable, so it’s accessing games[0], games[1] etc. which translates into data.items.item[0] and so on, if you remember how we defined the games variable above.

Then to find the boardgame’s name we need to traverse into the¬†“name” chid of the boardgame itself, then look for the first entry below that (always the first one here, so we can refer to that as entry [0]), and look for the field that is shown as an underscore, _.

Capture.PNG

 

Hence:

'BoardgameName': games[ii].name[0]._,

Rinse and repeat this for each element of interest, and you’ve collected your dataset ready for Tableau to use!

Of course,¬†the world is not perfect, and, in reality, I did not manage to do the above¬†without making the odd mistake here and there. I got blank fields sometimes, when I knew they¬†shouldn’t be, or fields with the wrong data in. As I was just using Notepad and the online simulator, I wasn’t really getting many useful error messages.

Lesson learned: you can use the code

tableau.log("YOUR MESSAGE HERE");

to display the message you write in the Google Chrome developer console. You might remember from part 2 that you can bring that up by pressing Ctrl Shift I in Google Chrome, and selecting the “Console” tab.

Why is that useful? Well, one, if it displays your message then it means that piece of code ran, so you can check you’re not skipping over¬†any important section. And secondly, you can append¬†text stored in variables to it. So for instead I could write:

tableau.log("the name of the game is " + games[ii].name[0]._);

And, if my code is correct, it should print out the name of the boardgame as it runs.

If I messed up, perhaps using the wrong index, or forgetting the underscore, it will output something different, and perhaps also an error message, guiding me towards the source of the problem.

That, as they say, is basically that! I ran it through the hosted Tableau Web Data Connector simulator again, ¬†this¬†ticking the box so it will “automatically continue to data gather phase”. And, lo and behold, below you can see that it has interpreted¬†the information about that famous Amyitis game into a nice tabular format.

Capture.PNG

Once¬†you’ve got to that stage, you’ve already won. Just load up Tableau Desktop for real, select to connect to a web data connector and up should pop your interface, and later, your data.

 

Creating my first Tableau web data connector – part 2

Now for the next part of the creating-a-web-data-connector journey: previously, I had got all the software I needed to create a web data connector installed (well, to be fair Notepad comes with Windows so that wasn’t hard) and designed my user interface.

Now it was time to move on to Tableau tutorial section 5 – using the information retrieved from the user of my UI in order to have Tableau connect to the BoardGameGeek API and retrieve the relevant data for analysis.

I read through the Tableau WDC tutorial, and, using some previous programming-other-stuff knowledge, realised that this tutorial was a mix of explaining the fundamentals of connecting to a generic web data connector, and explaining some very specific details of the example Yahoo finance web data connector the example showed.

The first thing to do then was to determine which bits were essential for my simple attempt and which were only there to facilitate the specific Yahoo API.

  • The “code for defining columns” section is¬†obviously important. I need columns to put my data in.
  • The Yahoo Query YQL less so – my API is nothing to do with Yahoo.
  • The helper functions (buildUri, getFormattedDate,makeTwoDigits), no,¬†these are just functions to help construct the YQL query needed to access their example API. It’s great to know that it’s possible to use helper functions like this if needed, but chances are, if I did need some, they’d not be the same unless my API was very similar to their Yahoo example.
  • The “code to get stock data”, well, I don’t want stock data, but I do want some sort of data so obviously some of that’s going to be relevant, so I pasted that into my file as-is, noting from the¬†code that some of it was clearly going to need to be replaced given I don’t want boring finance data when exciting boardgames data is available instead.

It seemed obvious that providing the weblink to my API of choice -http://boardgamegeek.com/xmlapi2/collection?username=[whatever] Рwas going to be key.  So I replaced their connectionURI  code with a simple text variable of that, using the user-supplied BGG username I had got from the previous part.

connectionUri = 'http://boardgamegeek.com/xmlapi2/collection?username=' + bggUsername

 

(a URI is a Uniform Resource Identifier)

Of course this wasn’t going to be enough to make it¬†work – I have not yet told Tableau what data I want it to receive and how to format it – but figured I’d run it through the simulator and see what happened. Nothing especially good did. But I did learn about “CORS”.

Now, again, no particular error messages tend to pop up by default even when things go fatally wrong in the simulator . It might look like it hung, it might go blank or similar, but no friendly message.

Lesson learned: If you are working¬†with the super-basic tools I had at hand, what you need then is the built in developer tools¬†Chrome has which you can bring up by pressing Ctrl¬†Shift +¬†I together. Switch it to show the “Console” tab, and you might see an error message there to help you out, if you’re lucky.

Here’s what I saw:

Untitled picture

Now, it’s been a long time since I did any serious web developing, and I had never suffered an¬†Access-Control-Allow-Origin problem before. A¬†swift bit of Googling led me to this nice¬†explanation¬†where I learnt that it’s a security feature “implemented in browsers to restrict interaction between documents (or scripts) that have different origins”. That makes sense; I don’t own either the BoardGameGeek site nor Github, I’m sad to say.

Of the 3 solutions that¬†Jvaneyck suggests, it seemed like if I could find a JSONP source for my API instead of the XML one then that would be simplest, given I have no control over either server nor intention of building any infrastructure. So,¬†can I get the BoardGameGeek API in JSONP format? Well, another quick Google ™ let me to a post on the BoardGameGeek forum, where a kind user named strathmeyer had offered up “Ajax-ready BGG API for any web programmers out there”.

The best thing is that I’ve enabled CORS headers, so you can use the API straight from the web browser

Yes, here indeed that was the best thing!¬†Access-Control-Allow-Origin issue solved, and all thanks to the work of someone else ūüôā

Lesson learned:¬†I later¬†found that Tableau 1) has a whole page dedicated to this exact problem, and 2)¬†there’s an example in their SDK of various¬†methods to get around this issue when connecting to XML. So next time, I should probably RT original M before going Google crazy.

But for now, all I needed to do was change the URI I pointed to from my web data connector to:

connectionUri = 'http://bgg-api.herokuapp.com/api/v1/collection?username=' + bggUsername

I ran that through the simulator once more, of course not expecting it to work before I had defined what data I was actually after – but it did eliminate the previous error message so I knew things were moving on nicely.

So when it ran through, Ctrl Shift + I back into Chrome to bring up the inspector and saw this message.

Untitled picture

My variable is not defined?!¬†Where did it go? Oops, my fault, I’d embedded a previous function in this one. I needed to be more careful with my bracket-pasting finger. But it was simplicity itself to move the code back to where it should be,¬†and re-simulate.

 

Untitled picture

OK, “uncaught invariant violation” is perhaps a little cryptic, but it’s sounds like it’s¬†talking about having found something it didn’t expect. That could be my data (spoiler: it was). So now I needed to tell it what to expect and what to do with it.

Tune back in, same place, different time, for the exciting conclusion!

 

Creating my first Tableau web data connector – the highs and the lows: part 1

After having successfully (enough) completed my introduction to creating a real live Tableau web data connector, I wanted to follow the lead of one of the inspirations for this otherwise unnecessary¬†effort – Chris Love’s Persiscope on the subject of sharing Tableau-related failure as well as unbridled success – ¬†and document¬†something about the less¬†smooth aspects of the journey it took. If nothing else, they’ll be useful notes if I have to do this again for real in the future.

First of all, I gave myself an unnecessarily harsh setup ūüôā .¬†I was working on my company computer (shhhhh) as it has Tableau Desktop installed, and I figured that’d be important. ¬†The significance of that is that I don’t have the admin¬†rights needed to install any new software¬†on it. And I don’t do web development as part of the day job, so I was restricted to using only very generic tools. In practice, this meant Windows Notepad for the most part, and Google Chrome for testing.

I had downloaded the¬†Tableau Data Connector software development kit¬†¬†I was a little¬†worried¬†at claims that I’d need to install a web server to use it, which I’m not sure our work IT security overlords would appreciate – but in the end, I learned that you can successfully create a Tableau web data connector without having:

  • the web data connector SDK
  • a web server
  • admin rights
  • …Tableau (yes, seriously).

What I did need:

  • a web browser (I used Chrome)
  • a text editor (I used Window notepad, because, well,¬†what more could you want?)

I had also decided to try and work it out without asking anyone anything. The Tableau user community is pretty amazing in terms of generously giving up answers,¬†and¬†I’m sure someone would have solved any problem I encountered so fast all the fun would have gone.

My approach then was to follow the Tableau Web Data Connector tutorial. In this tutorial, they build a web¬†data connector to a financial website to retrieve stock data. I figured for a first try I’d just follow along and just put the details of the BoardGameGeek API I wanted to use in instead of¬†the Yahoo¬†API.

I’m not going to go through the¬†tutorial line by line, as you might as well just read the original. Instead I’ll just intend to highlight some points where I had problems, and what I had to do to overcome them.

The first thing I’d note in¬†hindsight is that the tutorial is not actually “how to build the simplest possible web data connector” as I had imagined it would¬†be from its description as “basic”.¬†It surely is a relatively simple one (no authentication needed etc.) but it does contain at least 3 functions that are only necessary because of the specific¬†API they were connecting to. These are:

  • buildUri
  • getFormattedDate
  • makeTwoDigits

Pretty obvious when you see what they do – but it would be a mistake to imagine you should just copy the function structure of the basic connector and replace it with your own code. You won’t need the above 3 functions for the most part. I felt the tutorial could make that clearer, especially for someone even more novice than I, who had never touched web programming (I’ve done some in the distant past).

But anyway, begin at the beginning. I raced happily through parts 1 and 2 of the tutorial, copying and pasting their base code. Nothing important to change there unless you want to give your version a more appropriate title.

Part 3 instructed me how to create a form for the user to enter information about their data on. Again, I can imagine some data connectors won’t need to prompt the analyst using it¬†for any information, in which case you could also¬†disregard this section. ¬†However I did actually need that feature, as I wanted it to let me enter a BoardGameGeek¬†username in and only retrieve¬†board games that person had in their collection.

In the mean time, I had discovered¬†– joyfully – that there’s a hosted Tableau Web Data Connector simulator on Github, thank you so much! The¬†reason it made me happy is that it meant that I didn’t need to install a web server on my own computer, or do anything with the SDK that might have involved needing to beg for admin rights over my laptop.

Once I developed a connector in Notepad, I could just upload it somewhere (I chose Github Pages myself) and use the hosted simulator to see if it works – which is the reason you don’t actually need a copy of Tableau in the first place.¬†¬†I’m sure it would have been far quicker to iterate if I had installed it locally, but it works just fine in a hosted version – as long as you beware the cache.

Lesson learned: It seems¬†like there’s a lot of caching going on somewhere between or within the hosted Tableau Web Data Connector simulator and my web host (Github). Oftentimes I’d make changes to the code, reupload, but it didn’t seem to change the result as shown in the simulator. Sometimes closing browsers,¬†deleting cookies etc. helped – but not always. I’m sure there is some science to this that one could work out, but given I didn’t¬†need to make a lot of edits¬†in developing a relatively simple data connector, I basically settled for renaming my file “version 1.html”, “version 2.html” etc. whenever I made a modification, which guaranteed a lack of annoying caching action.

But anyway, defining the user interface per the tutorial went smoothly.¬†I entered the web address where I’d uploaded my connector: “amedcalf.github.io/BGGConnector.html”,¬†unticked the “automatically continue to data phase” box (as the data phase is something I defined later, in line with the tutorial), and ran the “interactive phase”.

Up popped my bare-bones form asking for a BoardGameGeek username, I filled it in and pressed OK, and it took me back to the simulator and showed me a form that indicated that it had indeed recognised the information I gave it. Success!

Capture

No data was returned just yet – but it clearly recognised what I had entered into the form just fine.

Lesson learned: Although this wasn’t a problem for me in this stage, I later noticed that if there’s anything wrong in your code then you do not always get an error message per se. Sometimes the simulator just doesn’t close your UI, or it does¬†but doesn’t return you any information – i.e. acts like a faulty web page. In my experience that tends to be when¬†I’d made a minor mistake in my¬†code – a missing semicolon or bracket or similar. Once that was fixed, the simulator worked nicely again¬†(if you want to go wild and use something other than notepad that does syntax highlighting, you’ll probably not suffer this much.) So you should assume that it’s your code, not the simulator, that is broken in such instances ūüôā

It became¬†clear the user interface is defined in¬†normal¬†HTML – so if you are a¬†nifty web designer/coder, or want to learn to be one, you can make it as beautiful as you like. That wasn’t the point of the exercise for me though, so I just made a plain¬†default font¬†and default¬†box to type information into for my effort.

So, that was the user interface “designed” and tested, which takes us up to part 4 of the Tableau tutorial.

Next up: time to actually access the data I wanted. This proved to be a little more difficult first time through though.

Journey complete: a BoardGameGeek collection Tableau web data connector

Several months (!) after setting myself the challenge of becoming familiar with how to¬†create a Tableau Web Data Connector, the good news is…it’s done! Well,¬†done enough that I now understand how to create web data connectors should I ever need one in reality anyway.

As planned, its a connection¬†via the BoardGameGeek API to your very own board game collection, if you have catalogued one at that site. For the uninitiated, BoardGameGeek is an amazing site if you’re the sort of person that likes boardgames, and, yes, cataloguing.

So, for starters, if you have indeed¬†got a collection catalogued, or want to analyse¬†someone else’s collection, then you can now point your Tableau Desktop software to a web connector at¬†http://amedcalf.github.io/BGGConnector.html

 

Important / annoying note (explained below): BEFORE you do this, go and visit the below link in your web browser to make sure your BGG collection is not going to be stuck in an API queue.

https://boardgamegeek.com/xmlapi2/collection?username=Adam121

replacing¬†the “Adam121” with the username you’re interested in. If you get a message about queues, caches or other errors, wait a few seconds and try again, until you end up with some unpleasant looking XML code that looks¬†something like a whole string¬†of this.

Capture

¬†Yes, this is not user friendly and would not be acceptable in a “production” environment, but this is just for fun; I’ll explain more below.

Once you’ve done that, go back into¬†Tableau.

Once you have entered¬†http://amedcalf.github.io/BGGConnector.html into the web data connection datasource part of Tableau (see Tableau’s instructions here), you should get a basic prompt to enter¬†your BoardGameGeek username.¬†You can test it with “Adam121” if you like, without the quotes. Do that, and up should pop the relevant collection ready for usage in Tableau.

Here is a description of the fields you’ll be presented with:

  • Boardgame Name: Textual name of your board game
  • Image Thumbnail URL: text URL of a small image relating to the boardgame, usually its cover.
  • Image URL: same as above, but a big version.
  • Last Modified: date the record was last changed.
  • Year Published: the year the game was published.
  • Plays: how many times you logged that you played this game.

Then there’s a series of flags that are set to either 0 (false) or 1 (true). These are:

  • For Trade
  • Owned
  • Preordered
  • Previously Owned
  • Want
  • Want To Buy
  • Want To Play
  • Wishlist

and are all attributes you can input yourself per game with the BGG collections feature.

Whilst it is based on the BoardGameGeek API, I should also¬†to give maximum¬†thanks to the BoardGameGeek user “Strathmeyer”, whose past¬†repurposing of the BGG API into CORS-enabled JSON is what I ended up using as my source, rather than the API directly. When I come to document my journey, I’ll go through why.

One thing to note: if you try it on a username and it doesn’t work (perhaps giving a “typerror”) then this is probably down to the caching of the API on the server. BoardGameGeek queues API calls for collections¬†that haven’t been requested in a while, and the JSON source¬†seems to cache the “in a queue” response meaning that no data appears.

You can see if this is why you’re having problems if you go to¬†this link,¬†putting¬†the BoardGameGeek user ID you’re after after the last equals sign of the link:

http://bgg-api.herokuapp.com/api/v1/collection?username=YOURUSERNAME

If it returns the message “Your request for this collection has been accepted and will be processed. Please try again later for access” then you’re in the queue. And it seems to¬†cache the fact you’re in the queue, which is why you should go visit¬†https://boardgamegeek.com/xmlapi2/collection?username=YOURUSERNAME before you do anything else, even visit the herokuapp link.

I haven’t taken the time to come up with a nice solution for this, but this¬†is what worked for me. Obviously it’s a very unsuitable¬†requirement should this be the mainstay of someone’s production system, but I’m afraid I was just playing about to see what works ūüôā

So what can you do with this, except poke around for fun?

Well, ever since the¬†great-yet-terrible event that was the¬†downsizing of my infeasibly oversized board games collection, I don’t really have more than would fit on a single screen of BGG to be honest, and I can’t usually find many people¬†to play them with me anyway (insert sympathy sound effect). But if you had a bigger collection, you might find it useful to answer such questions as “How many board games do I own?”, “Which board game have I played the most?”, “How many board games do I want to buy?” and so on. If so, please enjoy!

Just to prove to myself that it works, here’s an example very simple, quite pointless, dashboard that displays my collection and allows you to click on the name of a game in it, and see what the box looks like. Feel free to click through and play with it.

Capture

What’s next? Well, the point of the exercise was not really to find a way to list the remnants of my board games collection¬†– but rather to learn something new, and document something of the¬†journey I went through doing so, warts and all. With¬†that in mind, I’ll¬†be¬†writing a few notes as to the trials and tribulations I went through¬†in creating this, for future¬†reference if nothing else.