Clustering categorical data with R

Clustering is one of the most common unsupervised machine learning tasks. In Wikipedia‘s current words, it is:

the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups

Most “advanced analytics” tools have some ability to cluster in them. For example, Alteryx has K-Centroids AnalysisR, Python, SPSS, Statistica and any other proper data sciencey tools all likely have many methods – and even Tableau, although not necessarily aimed at the same market, just added a user-friendly clustering facility.  You can do the calculations in Excel, should you really want to (although why not cheat and use a nice addin if you want to save time?).

However, many of the more famous clustering algorithms, especially the ever-present K-Means algorithm, are really better for clustering objects that have quantitative numeric fields, rather than those that are categorical. I’m not going delve into the details of why here, but, simplistically, they tend to be based on concepts like Euclidean distance – and in that domain, it’s conceptually difficult to say that [bird] is Euclideanly “closer” to [fish] than [animal]; vs the much more straightforward task of knowing that an income of £100k is nearer to one of £90k than it is to 50p. IBM has a bit more about that here.

But, sometimes you really want to cluster categorical data! Luckily, algorithms for that exist, even if they are rather less widespread than typical k-means stuff.

R being R, of course it has a ton of libraries that might help you out. Below are a couple I’ve used, and a few notes as to the very basics of how to use them – not that it’s too difficult once you’ve found them. The art of selecting the optimum parameters for the very finest of clusters though is still yours to master, just like it is on most quantitative clustering.

The K-Modes algorithm

Like k-means, but with modes, see 🙂 ? A paper called ‘Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values‘ by Huang gives the gory details.

Luckily though, a R implementation is available within the klaR package. The klaR documentation is available in PDF format here and certainly worth a read.

But simplistically, you’re looking at passing a matrix or dataframe into the “kmodes” function.

Imagine you have a CSV file something like:

RecordID FieldA FieldB FieldC FieldD
1 0 0 0 1
2 0 0 0 0
3 0 0 0 1
4 1 1 0 0

Here’s how you might read it in, and cluster the records based on the contents of fields “FieldA”, “FieldB”, “FieldC”, and “FieldD”.

install.packages("klaR")
library(klaR)
setwd("C:/Users/Adam/CatCluster/kmodes")
data.to.cluster <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-kmodes(data.to.cluster[,2:5], 3, iter.max = 10, weighted = FALSE ) #don't use the record ID as a clustering variable!

Here I’ve asked for 3 clusters to be found, which is the second argument of the kmodes function. Just like k-means, you can specify as many as you want so you have a few variations to compare the quality or real-world utility of.

This is the full list of parameters to kmodes, per the documentation.

kmodes(data, modes, iter.max = 10, weighted = FALSE)
  • data: A matrix or data frame of categorical data. Objects have to be in rows, variables
    in columns.
  • modes: Either the number of modes or a set of initial (distinct) cluster modes. If a
    number, a random set of (distinct) rows in data is chosen as the initial modes.
  • iter.max: The maximum number of iterations allowed.
  • weighted: Whether usual simple-matching distance between objects is used, or a weighted version of this distance.

What do you get back?

Well, the kmodes function returns you a list, with the most interesting entries being:

  • cluster: A vector of integers indicating the cluster to which each object is allocated.
  • size: The number of objects in each cluster.
  • modes: A matrix of cluster modes.
  • withindiff: The within-cluster simple-matching distance for each cluster

Here’s an example what it looks like when output to the console:

K-modes clustering with 3 clusters of sizes 3, 5, 12

Cluster modes:
 FieldA FieldB FieldC FieldD
1 1 0 0 0
2 1 0 1 1
3 0 0 0 0

Clustering vector:
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
 3 3 3 1 3 1 2 3 3 3 2 2 2 3 3 2 1 3 3 3

Within cluster simple-matching distance by cluster:
[1] 2 2 8

Available components:
[1] "cluster" "size" "modes" "withindiff" "iterations" "weighted"

So, if you want to append your newly found clusters onto the original dataset, you can just add the cluster back onto your original dataset as a new column, and perhaps write it out as a file to analyse elsewhere, like this:

cluster.output <- cbind(data.to.cluster,cluster.results$cluster)
write.csv(cluster.output, file = "kmodes clusters.csv", row.names = TRUE)

 

The ROCK algorithm

Some heavy background reading on Rock is available in this presentation by Guha et al.

Again, a benevolent genius has popped an implementation into R for our use. This time you can find it in package “cba”. The PDF docs for cba are here.

But the most simplistic usage is very similar to k-modes, albeit with different optional parameters based on the how the algorithms differ.

Here’s what you’d do to cluster the same data as above, and write it back out, this time with the Rock clusters appended. Note here that ideally you’re specifically passing in a matrix to the rockCluster function.

install.packages("cba")
library(cba)
setwd("C:/Users/Adam/CatCluster/rock")
data.to.cluster <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-rockCluster(as.matrix(data.to.cluster[,2:5]), 3 )
cluster.output <- cbind(data.to.cluster,cluster.results$cl)
write.csv(cluster.output, file = "Rock clusters.csv", row.names = TRUE)

The full list of parameters to the relevant function, rockCluster is:

rockCluster(x, n, beta = 1-theta, theta = 0.5, fun = "dist", funArgs = list(method="binary"), debug = FALSE)
  • x: a data matrix; for rockLink an object of class dist.
  • n: the number of desired clusters.
  • beta: optional distance threshold
  • theta: neighborhood parameter in the range [0,1).
  • fun: distance function to use.
  • funArgs: a list of named parameter arguments to fun.
  • debug: turn on/off debugging output.

This is the output, which is of class “rock”, when printed to the screen:

data: x 
 beta: 0.5 
theta: 0.5 
 fun: dist 
 args: list(method = "binary") 
 1 2 3 
14 5 1

The object is a list, and its most useful component is probably “cl”, which is a factor containing the assignments of clusters to your data.

Of course once you have the csv files generated in the above ways, it’s just bog-standard data – so you’re free to visualise in R, or any other tool.

Workaround for the Tableau “custom SQL” restriction on SQL Server variables

SQL Server (among other databases) has a handy feature for easy re-use of queries, in the guise of variables. You can declare variables and use them repeatedly in any query in the current session. That’s pretty handy for any complicated query forms you use repeatedly, where each time you might need to change some basic criteria.

As a trivial example: here’s how to select all the customers I have that live in postcode 12345

DECLARE @postcode varchar(5)
SET @postcode = '12345'

SELECT POSTAL_CODE, CUSTOMER_NO FROM [MY_CUSTOMERS] WHERE POSTAL_CODE = @postcode

I can then run it another day for postcode 23456 by just changing the second line to

SET @postcode = '23456'

This example is a bit silly because you might as well just type 12345 or 23456 into the query itself – but imagine a more complex SQL statement, or set of coordinated SQL statements, with screenfuls of code, and you might see how it makes it much easier to maintain.

But, sad to say, you can’t use these sort of variables as part of the custom SQL feature in Tableau. If you try, then you’ll get this slightly unhelpful message:

[Microsoft][SQL Server Native Client 11.0][SQL Server]Incorrect syntax near the keyword 'DECLARE'.
[Microsoft][SQL Server Native Client 11.0][SQL Server]Incorrect syntax near ')'.

But rather than have to hardcode your queries and modify them for every extract you can use a very simple alternative: our friend, the Tableau parameter.

For each SQL variable, create a Tableau parameter of the same data type. You can put in a default value as the “Current value” when setting it up.

capture

Then, when connecting to your database, paste in the original SQL statement you had into the Custom SQL box, but removing the variable declaration and setting from the top.
SELECT POSTAL_CODE, CUSTOMER_NO FROM [MY_CUSTOMERS] WHERE POSTAL_CODE = @postcode

capture

One at a time, highlight each SQL variable there, i.e. the @postcode part of the above, press the “Insert Parameter” button under the text field and select the appropriate Tableau parameter.

(There is actually the option to create a new parameter at this stage, if you didn’t do so in advance).

You’ll end up with a similar looking SQL statement, but with the variables now referred to in this sort of way:

SELECT POSTAL_CODE, CUSTOMER_NO FROM [MY_CUSTOMERS]
WHERE POSTAL_CODE = <Parameters.Postcode>

And it’ll work!

You can then use the standard Tableau functionality to change the parameters in Tableau, and behind the scenes your SQL statement will update appropriately.

Note firstly, for performance, that each change of this sort of parameter is going to cause the query to be re-run on the database (unless it’s already been cached).

Secondly, if you have created a Tableau Data Extract from this datasource, then altering the parameter won’t change anything until you refresh the extract.

Finally, the Tableau parameter is simply a normal Tableau parameter, just like anything else. So there’s nothing to stop a Tableau user altering it, which depending on your setup might be either exactly what you want, or exactly what you don’t want. If the latter, you might consider having the datasource involving parameters in a separate workbook to the user-facing dashboard (publishing to Tableau Server to share), or at least making sure it’s not visible on the end-user dashboards if they have no reason to change it.

Quick sidenote: I wondered if the “Initial SQL” feature might work, to declare variables upon connection to the datasource which could then be used in custom SQL later. But sadly not, I guess it doesn’t count as the same session. If you try, you’ll see a message like:
[Microsoft][SQL Server Native Client 11.0][SQL Server]Must declare the scalar variable "@postcode"

Free up-to-date UK postcode latitude longitude data

Unless your data comes pre geo-encoded, if you’re trying to do some UK-based geospatial analysis you’ll probably need some easy way of translating addresses into latitude/longitude pairs or some similar co-ordinate system.

Whilst full-address geocoders are available, if you don’t actually need that level of precision then looking up a full postcode is often good enough and far faster (note to US readers: UK postcodes are way more precise than US zipcode – BPH Postcodes quotes an average of about 15 properties per postcode)

A few years ago this data could be a little challenging to obtain –  for free anyway. But now there are various sources offering that up sans charge in various formats and levels of up-to-dateness; some more useful than others

My current favourite is probably Doogal. There, Chris Bell has provided all sorts of address-based tools there. Need to generate a random postcode or calculate the elevation of any route? He has it all.

Most interesting to me are the big CSVs of the UK’s postcodes, past and present, obtainable by pressing “download” at the top right of this page.

It’s full postcode to long/lat mapping and includes the below columns too, which allow for some very useful groupings or lookups.

  • Postcode
  • In Use?
  • Latitude
  • Longitude
  • Easting
  • Northing
  • GridRef
  • County
  • District
  • Ward
  • DistrictCode
  • WardCode
  • Country
  • CountyCode
  • Constituency
  • Introduced
  • Terminated
  • Parish
  • NationalPark
  • Population
  • Households
  • Built up area
  • Built up sub-division
  • Lower layer super output area
  • Rural/urban
  • Region
  • Altitude

From his blog and various appreciative comments, it sounds like it’s kept up to date nice and regularly, which is truly a service to mankind.

If you have any problems with his file, then some other alternatives would include the free version of Codepoint Open (although it seems to show Eastings/Northings rather than long/lat – although it is possible to convert between the two), or a UK-Postcode API.

All this is overkill if you just have  <100 such lookups to make – you can find many web-based batch converters that’ll do the job for you very quickly, and often to address-level accuracy, if you don’t have too many to do. Doogal’s is here.

A few notes on Tableau 10’s cross-database join feature

The new version of Tableau, number 10, got released whilst I was busy sunning myself on holiday. This mean there was at least something to look forward to in returning to the world of work 🙂

It has many, many new features. I suspect one of the most impactful, in this era of data lying around everywhere you look, may end up being cross-database joins.

This means that Tableau can now do “proper” joins (i.e. not blends) that span different databases, even different database technologies. So for instance, you can execute an inner join from one table from SQL Server onto data from another table in Redshift, and analyse the result as though it all came from the same place. Better yet, you can publish the resulting datasource as a Tableau Server datasource so everyone can join in the fun, without even having to care where it came from in the first place.

Below a few notes / warnings from my early experimentations. Some may sound critical, but, overall, this is truly a great feature that will enable the less ETL-loving folks to get more done quicker, and/or without the resources of tech database specialists or sofwtare being needed.

  1. Excel counts as a database (as does any other supported file format). This means you can simulate database joins between multiple Excel files for instance without having to modify them in any way.
  2. You can use a custom SQL statement as the “table” you’re joining to or from in any join: i.e. you’re not restricted to just the formal tables on the database. This means that if you’re an SQL wizard who would prefer to have the database preprocess stuff before it hits Tableau, you can do that in the context of a cross-database join.
  3. Warning! I believe (from experience 🙂 ) Tableau downloads all the data from the tables you choose to your local computer before performing the join. This is kind of predictable if you think it through, but not totally. If you know what an Alteryx “in-database tool” is, then this is not like one of those.

    The joining is on your computer, even if you don’t need to care about that. This means that if you’re inner joining a table of 10 rows in one database to 10 out of a possible 1 billion rows in another database to create an extract, Tableau will be downloading 1 billion and 10 rows to your computer before later returning you the post-join 10 rows as your datasource. You can apply data source filters to your tables before extracting though which might help alleviate some of the potential pain here.

  4. You can do left, inner and right joins very easily, joining on any number of fields as the key. But I didn’t see a facility for conditional joins, in the sense of saying “download all sales from Redshift for a customer whose info is in Oracle where the sales are between the start and end date of the customer’s first 6 weeks according to Oralce”. You would have to let it download and show you all of the joined data and then use formula in Tableau afterwards to filter out what you don’t want.
  5. Some data connection types aren’t allowed to be involved in cross database joins at the moment. A Tableau blog post lists these as:
    • Tableau Server
    • Firebird
    • Google Analytics
    • Microsoft Analysis Services
    • Microsoft PowerPivot
    • Odata
    • Oracle Essbase
    • Salesforce
    • SAP BW
    • Splunk
    • Teradata OLAP Connector

      The big shame for me on that list is Tableau Server. Tableau Server datasources are a very cool way for non-DBAs to share analytically-useful data around the organisation (or DBA professionals to provide a lovely structured, governed, datasource ideal for Tableau users to use). I have however found that they can result in semi-silos, as exporting or manipulating them has certain limitations vs standard databases. If one could cross-database join these types of datasource one day it would be an amazing way of adding extra flexibility to that feature

  6. Performance may be super great, or it may be super slow, depending on your exact database setup.

But all those compliments and complaints aside, debt is very much due to the makers of Tableau 10 for providing one fewer reason for analysts to load up other software, or pester other departments, before getting started on analysing all your lovely data.

Help decide who self-driving cars should kill

Automated self-driving cars are surely on their way. Given the direction of technological development, this seems a safe enough prediction to make – at least when taking the coward’s option of not specifying a time frame.

A self-driving car is, after all, a data processor, and we like to think that we’re getting better at dealing with data every day. Simplistically, in such a car sensors provide some data (e.g. “there is a pedestrian in front of the car”), some automated decision-making module comes up with an intervention (“best stop the car”), and a process is carried out to enact that decision (“put the brakes on”).

Here for example is a visualisation of what a test Google automated car “sees”.

Capture.PNG

My hope and expectation is that, when they have reached a sophisticated enough level of operation and are at a certain threshold of prevalence, road travel will become safer.

Today’s road travel is not super-safe. According to the Association for Safe International Road Travel, around 1.3 million people die in road crashes each year – and 20-50 million more are injured or disabled. It’s the single leading cause of death amongst some younger demographics.

Perhaps automated vehicles could save some of these lives, and prevent many of the serious injuries. After all, a few years ago, The Royal Society for the Prevention of Accidents claimed that 95% of road accidents involve some human error, and 76% were solely due to human factors. There is a lot at stake here. And of course there are many more positive impacts (as well as some potential negatives) one might expect from this sort of automation beyond direct life-saving, which we’ll not go into here.

At this moment in time, humanity is getting closer to developing self-driving cars; perhaps surprisingly close to anyone who does not follow the topic. Certainly we do not have any totally automated car capable of (or authorised to be) driving every road safely at the moment, and that will probably remain true for a while yet. But, piece by piece, some manufacturers are automating at least some of the traditionally human aspects of driving, and several undoubtedly have their sights on full automation one day.

Some examples:

Landrover are shortly to be testing semi-autonomous cars that can communicate with other such cars around them.

The test fleet will be able to recognise cones and barriers using a forward facing 3D-scanning camera; brake automatically when it senses a potential collision in a traffic jam; talk to each other via radio signals and warn of upcoming hazards; and know when an ambulance, police car, or fire engine is approaching.

BMW already sells a suite of “driver assistance” features on some cars, including what they term intelligent parking, intelligent driving and intelligent vision. For people with my driving skill level (I’m not one of the statistically improbable 80% of people who think they are above average drivers), clearly the parking assistant is the most exciting: it both finds a space that your car would actually fit into, and then does the tricky parallel or perpendicular parking steering for you. Here it is in action:

Nissan are developing a “ProPilot” featuring, which also aims to help you drive safely, change lanes automatically, navigate crossroads and park.

Tesla are have probably the most famous “autopilot” system available right now. This includes features that will automatically keep your car in lane at a sensible speed, change lanes safely for you, alert the driver to unexpected dangers and park the car neatly for you. This is likely most of what you need for full automation for some simpler trips, although they are clear its a beta feature and that it is important you keep your hands on the steering wheel and remain observant when using it. Presumably preempting our inbuilt tendency towards laziness, it even goes so far as to sense when you haven’t touched the wheel for a while and tells you to concentrate; eventually coming to a stop if it can’t tell you’re still alive and engaged.

Here’s a couple of people totally disobeying the instructions, and hence nicely displaying its features.

And here’s how to auto-park a Tesla:

 

Uber seems particularly confident (when do they not?). Earlier this month, the Guardian reported that:

Uber passengers in Pittsburgh will be able to hail self-driving cars for the first time within the next few weeks as the taxi firm tests its future vision of transportation in the city. The company said on Thursday that an unspecified number of autonomous Ford Fusions will be available to pick up passengers as with normal Uber vehicles. The cars won’t exactly be driverless – they will have human drivers as backup – but they are the next step towards a fully automated fleet.

uber.jpg

 

And of course Google have been developing a fully self-driving car for a few years now. Here’s a cheesy PR video to show their fun little pods in action.

But no matter how advanced these vehicles get, road accidents will inevitably happen.

In recent times there has been a fatality famously associated with the Tesla autopilot – although as Tesla are obviously at pains to point out, one should remember that it is technically a product in beta and they are clear that you should always concentrate on the road and be ready to take over manually; so this accident might, at best, be attributed to a mix of the autopilot and the human in reality.

However, there will always be some set of circumstances or seemingly unlikely event that neither human or computer would be able to handle without someone getting injured or killed. Computers can’t beat physics, and if another car is heading up your one-way road, which happens to have a brick wall on one side and a high cliff on the other side, at 100 mph then some sort of bad incident is going to happen. The new question we have to ask ourselves in the era of automation is: exactly what incident should that be?

This obviously isn’t actually a new question. In the uncountable number of human-driven road incidents requiring some degree of driver intervention to avoid danger that happen each day, a human is deciding what to do. We just don’t codify it so formally. We don’t sit around planning it out in advance.

In the contrived scenario I described above, where you’re between a wall and a cliff with an oncoming car you can’t get around, perhaps you instinctively know what you’d do. Or perhaps you don’t – but if you are unfortunate enough to have it happen to you, you’ll likely do something. This may or may not the same action as you’d rationally pick beforehand, given the scenario. We rely on a mixture of human instinct, driver training and reflexes to handle these situations, implicitly accepting that the price of over a million deaths a year is worth paying to be able to undergo road travel.

So imagine you’re the programmer of the automated car. Perhaps you believe you might eliminate just half of those deaths if you do your job correctly; which would of course be an awesome achievement. But the car still needs to know what to do if it finds itself between a rock and a hard place. How should it decide? In reality, this is obviously complicated far further insomuch as there are a near-infinite number of scenarios in reality and no-one can explicitly program for each one (hence the need for data-sciencey techniques to learn from experience rather than simple “if X then Y” code). But, simplistically, what “morals” should your car be programmed with when it comes to potentially deadly accidents?

  • Should it always try and save the driver? (akin to a human driver’s instinct for self-preservation, if that’s what you believe we have.)
  • Or should it concentrate on saving any passengers in the same car as the driver?
  • How about the other car driver involved?
  • Or any nearby, unrelated, pedestrians?
  • Or the cute puppy innocently strolling along this wall-cliff precipice?
  • Does it make a difference if the car is explicitly taking an option (“steer left and ram into the car on the opposite side of the road”) vs passively continuing to do what it is doing (“do nothing which will result in you hitting the pedestrian standing in front of the wall”).
    • You might think this isn’t a rational factor, but anyone who has studied the famous “trolley problem” thought experiment will realise people can be quite squeamish about this. In fact, this whole debate boils down to some extent as being a realisation of that very thought experiment.
  • Does it make a difference how many people are involved? Hitting a group of 4 pedestrians vs a car that has 1 occupant? Or vice versa?
  • What about interactions with probabilities? Often you can’t be 100% sure that an accident will result in a death. What if the choice is between a 90% chance of killing 1 person or a 45% chance of killing two people?
  • Does it make a difference what the people are doing? Perhaps the driver is ignoring the speed limit, or pedestrians are jaywalking somewhere they shouldn’t. Does that change anything?
  • Does it even perhaps make a difference as to who the people involved are? Are some people more important to save than others?

Well, the MIT Media Lab is now giving you the opportunity to feed into those sorts of decisions, via its Moral Machine website.

To quote:

From self-driving cars on public roads to self-piloting reusable rockets landing on self-sailing ships, machine intelligence is supporting or entirely taking over ever more complex human activities at an ever increasing pace. The greater autonomy given machine intelligence in these roles can result in situations where they have to make autonomous choices involving human life and limb. This calls for not just a clearer understanding of how humans make such choices, but also a clearer understanding of how humans perceive machine intelligence making such choices.

Effectively, they are crowd-sourcing life-and-death ethics. This is not to say that any car manufacturer will necessarily take the results into account, but at least they may learn what the responding humans (which we must note is far from a random sample of humanity) think they should do, and the level of certainty we feel about it.

Once you arrive, you’ll be presented with several scenarios, and asked what you think the car should do in that scenario. There will always be some death involved (although not always human death!). It’ll also give you a textual description of who and what is happening. It’s then up to you to pick out of the two options given which the car should do.

Here’s an example:

car_ethics.PNG

You see there that a child is crossing the road, although the walk signal is on red, so they should really have waited. The car can choose to hit the child who will then die, or it can choose to ram itself into an inconvenient obstacle whereby the child will live, but the driver will die. What should it do?

You get the picture; click through a bunch on those and not only does MIT gather a sense of humanity’s moral data on these issues, but you get to compare yourself to other respondents on axes such as “saving more lives”, “upholding the law” and so on. You’ll also find out if you have implied gender, age or “social value” preferences in who you choose to kill with your decisions.

This comparison report isn’t going to be overly scientific on an individual level (you only have a few scenarios to choose from apart from anything else) but it may be thought-provoking.

After all, networked cars of the future may well be able to consult the internet and use facts it finds there to aid decisions. A simple extension of Facebook’s ability to face-recognise you in your friends’ photos could theoretically lead to input variables in these decisions like “Hey, this guy only has 5 twitter friends, he’ll be less missed than this other one who has 5000!” or  “Hey, this lady has a particularly high Klout score (remember those?) so we should definitely save her!”.

You don’t think we’d be so callous as to allow the production of a score regarding “who should live?”. Well, firstly, we have to. Having the car kill someone by not changing its direction or speed, when the option is there that it could do so, is still a life-and-death decision, even if it results in no new action.

Plus we already do use scores in domains that infer mortality. Perhaps stretching the comparison to its limits, here’s one example (and please do not take it that I necessarily approve or disapprove of its use, that’s a story for another day – it’s just the first one that leaps to mind).

The National Institute for Health and Care Excellence (NICE) provides guidance to the UK National Health Service on how to improve healthcare. The NHS, nationalised as it is (for the moment…beware our Government’s slow massacre of it though), still exists within the framework of capitalism and is held to account on sticking to a budget. It has to buy medicines from private companies and it can only afford so many. This implies that not everyone can have every treatment on the market. So how does it decide what treatments should be offered to who?

Under this framework, we can’t simply go on “give whatever is most likely to save this person’s life” because some of the best treatments may cost so much that giving it to 10 people, of which 90% will probably be cured, might mean that another 100 people who could have been treated at an 80% success rate will die, because there was no money left for the cheaper treatment.

So how does it work? Well, to over-simplify, they have famously used a data-driven process involving a Quality-adjusted life year (QALYS) metric.

A measure of the state of health of a person or group in which the benefits, in terms of length of life, are adjusted to reflect the quality of life. One QALY is equal to 1 year of life in perfect health.

QALYs are calculated by estimating the years of life remaining for a patient following a particular treatment or intervention and weighting each year with a quality-of-life score (on a 0 to 1 scale). It is often measured in terms of the person’s ability to carry out the activities of daily life, and freedom from pain and mental disturbance.

At least until a few years ago, they had guidelines that an intervention that cost the NHS less that £20k per QALY gained was deemed cost effective. It’s vital to note that this “cost effectiveness” was not the only factor that feeds into whether the treatment should be offered or not, but it was one such factor.

This seemingly quite emotionless method of measurement sits ill with many people: how can you value life in money? Isn’t there a risk that it penalises older people? How do you evaluate “quality”? There are many potential debates, both philosophical and practical.

But if this measure isn’t to be used, then how should we decide how to divide up a limited number of resources when there’s not enough for everyone, and those who don’t get them may suffer, even die?

Likewise, if an automated car cannot keep everyone safe, just as a human-driven car has never been able to, then on which measure involving which data should we base the decision as to who to save on?

But even if we can settle on a consensus answer to that, and technology magically improves to the point where implementing it reliably is childsplay, actually getting these vehicles onto the road en masse is not likely to be simple. Yes, time to blame humans again.

Studies have already looked at the sort of questions that the Moral Machine website poses you. “The Social Dilemma of Autonomous Vehicles” by Bonnefan et al is a paper, published in the journal Science, in which the researchers ran their own surveys as to what people thought these cars should be programmed to do in terms of the balance between specifically protecting the driver vs minimising the total number of causalities, which may include other drivers, pedestrians, and so on.

In general respondents fitted what the researchers termed a utilitarian mindset: minimise the number of casualties overall, no need to try and save the driver at all costs.

In Study 1 (n = 182), 76% of participants thought that it would be more moral for AVs to sacrifice one passenger, rather than kill ten pedestrians (with a 95% confidence interval of 69—82). These same participants were later asked to rate which was the most moral way to program AVs, on a scale from 0 (protect the passenger at all costs) to 100 (minimize the number of casualties). They overwhelmingly expressed a moral preference for utilitarian AVs programmed to minimize the number of casualties (median = 85, Fig. 2a).

(This is also reflected in the results of the Moral Machine website at the time of writing.)

Horray for the driving public; selfless to the last, every life matters, etc. etc. Or does it?

Well, later on, the survey tackled questions around, not only what should these vehicles do in emergencies, but how comfortable would they personally be if vehicles did behave that way, and lastly, how likely would they be to buy one that exhibited that behaviour?

Of course, even in thought experiments, bad things seem worse if they’re likely to happen to you or those you love.

even though participants still agreed that utilitarian AVs were the most moral, they preferred the selfprotective model for themselves.

Once more, it appears that people praise utilitarian, self-sacrificing AVs, and welcome them on the road, without actually wanting to buy one for themselves.

Humans, at least in that study, appear have a fairly high consensus that minimising causalities is key in these decisions. But we also have a predictable tendency to be the sort of freeloaders that prefer for everybody else to follow a net-safety-promoting policy, as long as we don’t have to ourselves. This would seem to be a problem that it’s unlikely even the highest quality data or most advanced algorithm will solve for us at present.

Creating my first Tableau web data connector : part 3

At last, the final part of the trials and tribulations of creating my first Tableau Web Data Connector… Part 1 went through the pre-requisites, and building what I might generously term my “user interface”. Part 2 was a struggle against the forces of web security. And in this part, we battle against the data itself, until the error-message dragon is truly slayed.

So, the in the last part, we ended up with a connector that could ask for the username of a BoardGameGeek user, connect to a CORS-enabled version of the API, and complain that whatever objects it found “are not valid as a React child”.

That makes sense, as I didn’t yet tell it what sort of data objects to expect.

As I’m following the Tableau web data connector tutorial, which accesses data from URLs like this one, I figured I’d look at what data that URL returns precisely, compare it to the web data connector tutorial code, and then, when I understand the relationship between the code and the data from the Yahoo URL, I might be able to adapt the code to fit the format my chosen data arrives in.

Here’s the data one gets if you point your web browser to the URL that the tutorial is teaching you how to build a WDC from:

Capture

OK, stock-pickers will immediately recognise that it’s stock quote data, with dates, highs, lowers, symbols and all the other lovely gubbins that comes along with basic stock charts. It’s in JSON format, which is human-readable, but only in small doses.

So I need to understand the relationship between the above data, and this piece of code:

if (data.query.results) {
              var quotes = data.query.results.quote;
              var ii;
              for (ii = 0; ii < quotes.length; ++ii) {
                  var entry = {'Ticker': quotes[ii].Symbol,
                               'Day': quotes[ii].Date,
                               'Close': quotes[ii].Close};
                  dataToReturn.push(entry);
              }

I can kind of see it by eye – there’s obviously JSON entries for Symbol, Date, Close, query and so on. But can’t we make it easier?

Yes we can, because there’s such a thing as a web JSON viewer. Past all that stock-related text into something like http://www.jsoneditoronline.org/ and you get a nice hierarchical visualisation of the JSON structure, like this:

Capture

So if we assume that “data” is referring to the overall set of data itself, and use dot notation to traverse the hierarchy, we can see a line in the code that says:

var quotes = data.query.results.quote;

That would seem to fit into the JSON structure above, where below the “object” level you have a level called query, a sublevel called results, and a sublevel called quote.

Capture

The variables “quotes” that the code creates therefore is basically referring to everything at/below the “quote” level of the JSON hierarchy.

Then you get a bunch of records which are the data of interest itself. These are numbered 0 for the first datapoint, 1 for the next, and so on.

If you know a bit of basic programming, you might note that the WDC code has “for loop” with a counter variable called “ii” that is set to 0 at first and is incrementing by one each time it runs. That seems to fit nicely in with the idea of it iterating through datapoint 0, 1, …n until it gets to the end (which is effectively the length of the quotes dataset, i.e. quotes.length).

Each one of these JSON records then has a few attributes – and they include the ones mentioned in the code as being part of whichever “quotes” subrecord we’re on; “Symbol”, “Date”, “Close”.

   var entry = {'Ticker': quotes[ii].Symbol,
                               'Day': quotes[ii].Date,
                               'Close': quotes[ii].Close};

Capture

Awesome. So where does “Ticker”, “Day” and “Close” on the left hand side of those code lines come from?

Well, that’s up to us. Remember the “add code for defining columns” part of the tutorial?

myConnector.getColumnHeaders = function() {
    var fieldNames = ['Ticker', 'Day', 'Close'];
    var fieldTypes = ['string', 'date', 'float'];
    tableau.headersCallback(fieldNames, fieldTypes);
}

There they’ve defined the fieldnames they want to have Tableau display (which can be totally different from the JSON names we saw above in the datasource) and the field types.

Field types can be any from this list:

  • bool
  • date
  • datetime
  • float
  • int
  • string

So, let’s look at what’s returned when I make the call to my Boardgames datasource.

Capture.PNG

OK, I can see some fields in there, but it’s kind of messy. So I pasted it back into the online JSON editor.

Here’s what the relevant part looks like when formatted nicely:

Capture.PNG

So, the interesting information there is probably the boardgame’s name, year published, URLs to its image and thumbnail, and the various statuses (own, want, etc.) and the number of times its been played (numplays).

I therefore edited the “getColumnHeaders” code snippet to reflect the fields I wanted, and the type of data that would feature in them.

myConnector.getColumnHeaders = function() {
		var fieldNames = ['BoardgameName','YearPublished','ImageURL','ImageThumbnailURL', 'Plays','Owned','PreviouslyOwned','ForTrade','Want','WantToPlay','WantToBuy','Wishlist','Preordered','LastModified'];
		var fieldTypes = ['string','int','string','string','int','int','int','int','int','int','int','int','int','datetime'];
		tableau.headersCallback(fieldNames, fieldTypes);
	}

Now I’ve defined the fields, I can go back to the retrieving results section of code (if (data.query.results)…) and, now knowing the structure of the JSON generated by my API, parse out and assign to the above variables the data I want.

I decided to call the collection of data I was building “games” rather than “quotes”, because that’s what it is. I next noted that each individual “game” within the JSON is listed in a hierarchy below the “item” entry which itself is below “items”.

Capture.PNG

(“Amyitis” is the name of a boardgame, rather than an allergy to people called Amy, believe it or not).

So, I assigned all the items.item data to “games”

if (data.items.item) {
var games = data.items.item;

And copied the tutorial to loop through all the “items.item” entries, each of which is a boardgame, until we’d reached the end i.e. when the number of times we looped is the same as the length of the data table..

var ii;
for (ii = 0; ii < games.length; ++ii) {

Finally, it’s time to assign the relevant bit of data returned from the API to the variables I’d defined above.

Capture

At first I got a little confused, because the JSON output had a bunch of _ and $ entries that didn’t seem similar to what was returned in the tutorial dataset. But it turns out that’s nothing to worry about. Just treat them as though they were any other text.

var entry = {'BoardgameName': games[ii].name[0]._,
'YearPublished': games[ii].yearpublished[0],
'ImageURL': 'https:' + games[ii].image[0],
'ImageThumbnailURL': 'https:' + games[ii].thumbnail[0],
'Plays': games[ii].numplays[0],
'Owned': games[ii].status[0].$.own,
'PreviouslyOwned': games[ii].status[0].$.prevowned,
'ForTrade': games[ii].status[0].$.fortrade,
'Want': games[ii].status[0].$.want,
'WantToPlay': games[ii].status[0].$.wanttoplay,
'WantToBuy': games[ii].status[0].$.wanttobuy,
'Wishlist': games[ii].status[0].$.wishlist,
'Preordered': games[ii].status[0].$.preordered,
'LastModified': games[ii].status[0].$.lastmodified
};
dataToReturn.push(entry);

In the above, you can think about games[ii] as being the individual boardgame record. The “for loop” we defined above substitutes each record into the ‘ii’ variable, so it’s accessing games[0], games[1] etc. which translates into data.items.item[0] and so on, if you remember how we defined the games variable above.

Then to find the boardgame’s name we need to traverse into the “name” chid of the boardgame itself, then look for the first entry below that (always the first one here, so we can refer to that as entry [0]), and look for the field that is shown as an underscore, _.

Capture.PNG

 

Hence:

'BoardgameName': games[ii].name[0]._,

Rinse and repeat this for each element of interest, and you’ve collected your dataset ready for Tableau to use!

Of course, the world is not perfect, and, in reality, I did not manage to do the above without making the odd mistake here and there. I got blank fields sometimes, when I knew they shouldn’t be, or fields with the wrong data in. As I was just using Notepad and the online simulator, I wasn’t really getting many useful error messages.

Lesson learned: you can use the code

tableau.log("YOUR MESSAGE HERE");

to display the message you write in the Google Chrome developer console. You might remember from part 2 that you can bring that up by pressing Ctrl Shift I in Google Chrome, and selecting the “Console” tab.

Why is that useful? Well, one, if it displays your message then it means that piece of code ran, so you can check you’re not skipping over any important section. And secondly, you can append text stored in variables to it. So for instead I could write:

tableau.log("the name of the game is " + games[ii].name[0]._);

And, if my code is correct, it should print out the name of the boardgame as it runs.

If I messed up, perhaps using the wrong index, or forgetting the underscore, it will output something different, and perhaps also an error message, guiding me towards the source of the problem.

That, as they say, is basically that! I ran it through the hosted Tableau Web Data Connector simulator again,  this ticking the box so it will “automatically continue to data gather phase”. And, lo and behold, below you can see that it has interpreted the information about that famous Amyitis game into a nice tabular format.

Capture.PNG

Once you’ve got to that stage, you’ve already won. Just load up Tableau Desktop for real, select to connect to a web data connector and up should pop your interface, and later, your data.

 

Creating my first Tableau web data connector – part 2

Now for the next part of the creating-a-web-data-connector journey: previously, I had got all the software I needed to create a web data connector installed (well, to be fair Notepad comes with Windows so that wasn’t hard) and designed my user interface.

Now it was time to move on to Tableau tutorial section 5 – using the information retrieved from the user of my UI in order to have Tableau connect to the BoardGameGeek API and retrieve the relevant data for analysis.

I read through the Tableau WDC tutorial, and, using some previous programming-other-stuff knowledge, realised that this tutorial was a mix of explaining the fundamentals of connecting to a generic web data connector, and explaining some very specific details of the example Yahoo finance web data connector the example showed.

The first thing to do then was to determine which bits were essential for my simple attempt and which were only there to facilitate the specific Yahoo API.

  • The “code for defining columns” section is obviously important. I need columns to put my data in.
  • The Yahoo Query YQL less so – my API is nothing to do with Yahoo.
  • The helper functions (buildUri, getFormattedDate,makeTwoDigits), no, these are just functions to help construct the YQL query needed to access their example API. It’s great to know that it’s possible to use helper functions like this if needed, but chances are, if I did need some, they’d not be the same unless my API was very similar to their Yahoo example.
  • The “code to get stock data”, well, I don’t want stock data, but I do want some sort of data so obviously some of that’s going to be relevant, so I pasted that into my file as-is, noting from the code that some of it was clearly going to need to be replaced given I don’t want boring finance data when exciting boardgames data is available instead.

It seemed obvious that providing the weblink to my API of choice -http://boardgamegeek.com/xmlapi2/collection?username=[whatever] – was going to be key.  So I replaced their connectionURI  code with a simple text variable of that, using the user-supplied BGG username I had got from the previous part.

connectionUri = 'http://boardgamegeek.com/xmlapi2/collection?username=' + bggUsername

 

(a URI is a Uniform Resource Identifier)

Of course this wasn’t going to be enough to make it work – I have not yet told Tableau what data I want it to receive and how to format it – but figured I’d run it through the simulator and see what happened. Nothing especially good did. But I did learn about “CORS”.

Now, again, no particular error messages tend to pop up by default even when things go fatally wrong in the simulator . It might look like it hung, it might go blank or similar, but no friendly message.

Lesson learned: If you are working with the super-basic tools I had at hand, what you need then is the built in developer tools Chrome has which you can bring up by pressing Ctrl Shift + I together. Switch it to show the “Console” tab, and you might see an error message there to help you out, if you’re lucky.

Here’s what I saw:

Untitled picture

Now, it’s been a long time since I did any serious web developing, and I had never suffered an Access-Control-Allow-Origin problem before. A swift bit of Googling led me to this nice explanation where I learnt that it’s a security feature “implemented in browsers to restrict interaction between documents (or scripts) that have different origins”. That makes sense; I don’t own either the BoardGameGeek site nor Github, I’m sad to say.

Of the 3 solutions that Jvaneyck suggests, it seemed like if I could find a JSONP source for my API instead of the XML one then that would be simplest, given I have no control over either server nor intention of building any infrastructure. So, can I get the BoardGameGeek API in JSONP format? Well, another quick Google ™ let me to a post on the BoardGameGeek forum, where a kind user named strathmeyer had offered up “Ajax-ready BGG API for any web programmers out there”.

The best thing is that I’ve enabled CORS headers, so you can use the API straight from the web browser

Yes, here indeed that was the best thing! Access-Control-Allow-Origin issue solved, and all thanks to the work of someone else 🙂

Lesson learned: I later found that Tableau 1) has a whole page dedicated to this exact problem, and 2) there’s an example in their SDK of various methods to get around this issue when connecting to XML. So next time, I should probably RT original M before going Google crazy.

But for now, all I needed to do was change the URI I pointed to from my web data connector to:

connectionUri = 'http://bgg-api.herokuapp.com/api/v1/collection?username=' + bggUsername

I ran that through the simulator once more, of course not expecting it to work before I had defined what data I was actually after – but it did eliminate the previous error message so I knew things were moving on nicely.

So when it ran through, Ctrl Shift + I back into Chrome to bring up the inspector and saw this message.

Untitled picture

My variable is not defined?! Where did it go? Oops, my fault, I’d embedded a previous function in this one. I needed to be more careful with my bracket-pasting finger. But it was simplicity itself to move the code back to where it should be, and re-simulate.

 

Untitled picture

OK, “uncaught invariant violation” is perhaps a little cryptic, but it’s sounds like it’s talking about having found something it didn’t expect. That could be my data (spoiler: it was). So now I needed to tell it what to expect and what to do with it.

Tune back in, same place, different time, for the exciting conclusion!

 

The Tableau #MakeoverMonday doesn’t need to be complicated

For a while, a couple of  key members of the insatiably effervescent Tableau community, Andy Cotgreave and Andy Kriebel, have been running a “Makeover Monday” activity. Read more and get involved here – but a simplistic summary would be that they distribute a nicely processed dataset on a topic of the day that relates to someone else’s existing visualisation, and all the rest of us Tableau fans can have a go at making our own chart, dashboard or similar to share back with the community so we can inspire and learn from each other.

It’s a great idea, and generates a whole bunch of interesting entries each week. But Andy K noticed that each Monday’s dataset was getting way more downloads than the number of charts later uploaded, and opened a discussion as to why.

There are of course many possible reasons, but one that came through strongly was that, whilst they were interested in the principle, people didn’t think they had the time to produce something comparable to some of the masterpieces that frequent the submissions. That’s a sentiment I wholeheartedly agree with, and, in retrospect – albeit subconsciously – why I never gave it a go myself.

Chris Love, someone who likely interacts with far more Tableau users than most of us do, makes the same point in his post on the benefits of Keeping It Simple Stupid. I believe it was written before the current MakeoverMonday discussions began in earnest, but was certainly very prescient in its applications to this question.

Despite this awesome community many new users I speak to are often put off sharing their work because of the high level of vizzes out there. They worry their work simply isn’t up to scratch because it doesn’t offer the same level of complexity.

 

To be clear, the original Makeover Monday guidelines did include the guideline that it was quite proper to just spend an hour fiddling around with it. But firstly, after a hard day battling against the dark forces of poor data quality and data-free decisions at work, it can be a struggle to keep on trucking for another hour, however fun it would be in other contexts.

And that’s if you can persuade your family that they should let you keep tapping away for another hour doing what, from the outside, looks kind of like you forgot to finish work. In fact a lot of the worship I have for the zens is how they fit what they do into their lives.

But, beyond that, an hour is not going to be enough to “compete” with the best of what you see other people doing in terms of presentation quality.

I like to think I’m quite adept with Tableau (hey, I have a qualification and everything :-)), but I doubt I could create and validate something like this beauty using an unfamiliar dataset on an unfamiliar topic in under an hour.

 

It’s beautiful; the authors of this and many other Monday Makeovers clearly have an immense amount of skill and vision. It is fascinating to see both the design ideas and technical implementation required to coerce Tableau into doing certain non-native things. I love seeing this stuff, and very much hope it continues.

But if one is not prepared to commit the sort of time needed to do that regularly to this activity, then one has to try and get over the psychological difficulty of sharing a piece of work which one perceives is likely to be thought of as “worse” than what’s already there. This is through no fault of the MakeoverMonday chiefs, who make it very clear that producing a NYT infographic each week is not the aim here – but I certainly see why it’s a deterrent from more of the data-downloaders uploading their work. And it’s great to see that topic being directly addressed.

After all, for those of us who use Tableau for the day-to-day joys of business, we probably don’t rush off and produce something like this wonderful piece every time some product owner comes along to ask us an “urgent” question.

Instead, we spend a few minutes making a line chart, that gives them some insight into the answer to their question. We upload an interactive bar chart, with default Tableau colours and fonts, to let them explore a bit deeper and so on. We sit in a meeting and dynamically provide an answer to enable live decision-making that before we had tools like this would have had to wait a couple of weeks to get a csv report on. Real value is generated, and people are sometimes even impressed, despite the fact that we didn’t include hand-drawn iconography, gradient-filled with the company colours.

Something like this perhaps:

Yes, it’s “simple”, it’s unlikely to go Tableau-viral, but it makes a key story held within that data very clear to see. And its far more typical of the day-to-day Tableau use I see in the workplace.

For the average business question, we probably do not spend a few hours researching and designing a beautiful colour scheme in order to perform the underlying maths needed to make a dashboard combining a hexmap, a Sankey chart and a network graph in a tool that is not primarily designed to do any of those things directly.

No-one doubts that you can cajole Tableau into such artistry, and there is sometimes real value obtainable by doing so,  or that those who carry it out may be creative geniuses -but unless they have a day job that is very different than that of mine and my colleagues, then I suspect it’s not their day-to-day either. It’s probably more an expression of their talent and passion for the Tableau product.

Pragmatically, if I need to make, for instance, a quick network chart for “business”, then, all other things being equal, I’m afraid I’m more likely I get out a tool that’s designed to do that rather than take a bit more time to work out how to implement it in Tableau, no matter how much I love it (by the way, Gephi is my tool of choice for that – it is nowhere near as user friendly as Tableau, but it is specifically designed for that sort of graph visualisation; also recent versions of Alteryx can do the basics). Honestly, it’s rare for me that these more unusual charts need to be part of a standard dashboard; our organisation is simply not at a level of viz-maturity where these diagrams are the most useful for most people in the intended audience, if indeed they are for many organisations.

And if you’re a professional whose job is creating awesome newspaper style infographics, then I suspect that you’re not using Tableau as the tool that provides the final output either, more often than not. That’s not its key strength in my view; that’s not how they sell it – although they are justly proud of the design-thought that does go into the software in general. But if paper-WSJ is your target audience, you might be better of using a more custom design-focused tool, like Adobe Illustrator (and Coursera will teach you that specific use-case, if you’re interested).

I hope nothing here will cause offence. I do understand the excitement and admire anyone’s efforts to push the boundaries of the tool – I have done so myself, spending way more time than is strictly speaking necessary in terms of a theoretical metric of “insights generated per hour” to make something that looks cool, whether in or out of work. For a certain kind of person it’s fun, it is a nice challenge, it’s a change from a blue line on top of an orange line, and sometimes it might even produce a revelation that really does change the world in some way.

This work surely needs to be done; adherents to (a bastardised version of) Thomas Kuhn’s theory of scientific revolutions might even claim this “pushing to the limits” as one of the ways of engendering the mini-crisis necessary to drive forward real progress in the field. I’m sure some of the valuable Tableau “ideas“, that feed the development of the software in part, have come from people pushing the envelope, finding value, and realising there should be an easier way to generate it.

There’s also the issue of engagement: depending on your aim, optimising your work for being shared worldwide may be more important to you than optimising it for efficiency, or even clarity and accuracy. This may sound like heresy, and it may even touch on ethical issues, but I suspect a survey of the most well-known visualisations outside of the data community would reveal a discontinuity with the ideals of Stephen Few et al!

But it may also be intimidating to the weary data voyager when deciding whether to participate in these sort of Tableau community activities if it seems like everyone else produces Da Vinci masterpieces on demand.

Now, I can’t prove this with data right now, sorry, but I just think it cannot be the case. You may see a lot of fancy and amazing things on the internet – but that’s the nature of how stuff gets shared around; it’s a key component of virality. If you create a default line chart, it may actually be the best answer to a given question, but outside a small community who is actively interested in the subject domain at hand, it’s not necessarily going to get much notice. I mean, you could probably find someone who made a Very Good Decision based even on those ghastly Excel 2003 default charts with the horrendous grey background if you try hard enough.

excel2003

Never forget…

 

So, anyway, time to put my money where my mouth is and actually participate in MakeoverMonday. I don’t need to spend even an hour making something if I don’t want to, right?  (after all, I’ve used up all my time writing the above!)

Tableau is sold with emphasis on its speed of data sense-marking, claiming to enable producing something reasonably intelligible 10-100x faster than other tools. If we buy into that hype, then spending 10 minutes of Tableau time (necessitating making 1 less cup of tea perhaps) should enable me to produce something that it could have taken up to 17 hours to produce in Excel.

OK, that might be pushing the marketing rather too literally, but the point is hopefully clear. For #MakeoverMonday, some people may concentrate on how far can they push Tableau outside of its comfort zone, others may focus on how they can integrate the latest best practice in visual design, whereas here I will concentrate on whether I can make anything intelligible in the time that it takes to wait for a coffee in Starbucks (on a bad day) – the “10 minute” viz.

So here’s my first “baked in just 10 minutes” viz on the latest MakeoverMonday topic – the growth of the population of Bermuda. Nothing fancy, time ran out just as I was changing fonts, but hey, it’s a readable chart that tells you something about the population change in Bermuda over time. Click through for the slightly interactive version – although of course, it, for instance, has the nasty default tooltips, thanks to the 10 minutes running out just as I was changing the font for the chart titles…

Bermuda population growth.png

 

 

Creating my first Tableau web data connector – the highs and the lows: part 1

After having successfully (enough) completed my introduction to creating a real live Tableau web data connector, I wanted to follow the lead of one of the inspirations for this otherwise unnecessary effort – Chris Love’s Persiscope on the subject of sharing Tableau-related failure as well as unbridled success –  and document something about the less smooth aspects of the journey it took. If nothing else, they’ll be useful notes if I have to do this again for real in the future.

First of all, I gave myself an unnecessarily harsh setup 🙂 . I was working on my company computer (shhhhh) as it has Tableau Desktop installed, and I figured that’d be important.  The significance of that is that I don’t have the admin rights needed to install any new software on it. And I don’t do web development as part of the day job, so I was restricted to using only very generic tools. In practice, this meant Windows Notepad for the most part, and Google Chrome for testing.

I had downloaded the Tableau Data Connector software development kit  I was a little worried at claims that I’d need to install a web server to use it, which I’m not sure our work IT security overlords would appreciate – but in the end, I learned that you can successfully create a Tableau web data connector without having:

  • the web data connector SDK
  • a web server
  • admin rights
  • …Tableau (yes, seriously).

What I did need:

  • a web browser (I used Chrome)
  • a text editor (I used Window notepad, because, well, what more could you want?)

I had also decided to try and work it out without asking anyone anything. The Tableau user community is pretty amazing in terms of generously giving up answers, and I’m sure someone would have solved any problem I encountered so fast all the fun would have gone.

My approach then was to follow the Tableau Web Data Connector tutorial. In this tutorial, they build a web data connector to a financial website to retrieve stock data. I figured for a first try I’d just follow along and just put the details of the BoardGameGeek API I wanted to use in instead of the Yahoo API.

I’m not going to go through the tutorial line by line, as you might as well just read the original. Instead I’ll just intend to highlight some points where I had problems, and what I had to do to overcome them.

The first thing I’d note in hindsight is that the tutorial is not actually “how to build the simplest possible web data connector” as I had imagined it would be from its description as “basic”. It surely is a relatively simple one (no authentication needed etc.) but it does contain at least 3 functions that are only necessary because of the specific API they were connecting to. These are:

  • buildUri
  • getFormattedDate
  • makeTwoDigits

Pretty obvious when you see what they do – but it would be a mistake to imagine you should just copy the function structure of the basic connector and replace it with your own code. You won’t need the above 3 functions for the most part. I felt the tutorial could make that clearer, especially for someone even more novice than I, who had never touched web programming (I’ve done some in the distant past).

But anyway, begin at the beginning. I raced happily through parts 1 and 2 of the tutorial, copying and pasting their base code. Nothing important to change there unless you want to give your version a more appropriate title.

Part 3 instructed me how to create a form for the user to enter information about their data on. Again, I can imagine some data connectors won’t need to prompt the analyst using it for any information, in which case you could also disregard this section.  However I did actually need that feature, as I wanted it to let me enter a BoardGameGeek username in and only retrieve board games that person had in their collection.

In the mean time, I had discovered – joyfully – that there’s a hosted Tableau Web Data Connector simulator on Github, thank you so much! The reason it made me happy is that it meant that I didn’t need to install a web server on my own computer, or do anything with the SDK that might have involved needing to beg for admin rights over my laptop.

Once I developed a connector in Notepad, I could just upload it somewhere (I chose Github Pages myself) and use the hosted simulator to see if it works – which is the reason you don’t actually need a copy of Tableau in the first place.  I’m sure it would have been far quicker to iterate if I had installed it locally, but it works just fine in a hosted version – as long as you beware the cache.

Lesson learned: It seems like there’s a lot of caching going on somewhere between or within the hosted Tableau Web Data Connector simulator and my web host (Github). Oftentimes I’d make changes to the code, reupload, but it didn’t seem to change the result as shown in the simulator. Sometimes closing browsers, deleting cookies etc. helped – but not always. I’m sure there is some science to this that one could work out, but given I didn’t need to make a lot of edits in developing a relatively simple data connector, I basically settled for renaming my file “version 1.html”, “version 2.html” etc. whenever I made a modification, which guaranteed a lack of annoying caching action.

But anyway, defining the user interface per the tutorial went smoothly. I entered the web address where I’d uploaded my connector: “amedcalf.github.io/BGGConnector.html”, unticked the “automatically continue to data phase” box (as the data phase is something I defined later, in line with the tutorial), and ran the “interactive phase”.

Up popped my bare-bones form asking for a BoardGameGeek username, I filled it in and pressed OK, and it took me back to the simulator and showed me a form that indicated that it had indeed recognised the information I gave it. Success!

Capture

No data was returned just yet – but it clearly recognised what I had entered into the form just fine.

Lesson learned: Although this wasn’t a problem for me in this stage, I later noticed that if there’s anything wrong in your code then you do not always get an error message per se. Sometimes the simulator just doesn’t close your UI, or it does but doesn’t return you any information – i.e. acts like a faulty web page. In my experience that tends to be when I’d made a minor mistake in my code – a missing semicolon or bracket or similar. Once that was fixed, the simulator worked nicely again (if you want to go wild and use something other than notepad that does syntax highlighting, you’ll probably not suffer this much.) So you should assume that it’s your code, not the simulator, that is broken in such instances 🙂

It became clear the user interface is defined in normal HTML – so if you are a nifty web designer/coder, or want to learn to be one, you can make it as beautiful as you like. That wasn’t the point of the exercise for me though, so I just made a plain default font and default box to type information into for my effort.

So, that was the user interface “designed” and tested, which takes us up to part 4 of the Tableau tutorial.

Next up: time to actually access the data I wanted. This proved to be a little more difficult first time through though.

#VisualizeNoMalaria: Let’s all help build an anti-Malaria dataset

As well as just being plain old fun, data can also be an enabler for “good” in the world. Several organisations are clearly aware of this; both Tableau and Alteryx now have wings specifically for doing good. There are whole organisations set up to promote beneficial uses of data, such as DataKind, and a bunch of people write reports on the topic – for example Nesta’s report “Data for good“.

And it’s not hard to get involved. Here’s a simple task you can do in a few minutes (or a few weeks if you have the time) from the comfort of your home, thanks to a collaboration between Tableau, PATH and the Zambian government: Help them map Zambian buildings.

Whyso? For the cause of eliminating of the scourge of malaria from Zambia. In order to effectively target resources at malaria hotspots (and in future to predict where the disease might flare up); they’re

developing maps that improve our understanding of the existing topology—both the natural and man-made structures that are hospitable to malaria. The team can use this information to respond quickly with medicine to follow up and treat individual malaria cases. The team can also deploy resources such as indoor spraying and bed nets to effectively protect families living in the immediate vicinity.

Zambia isn’t like Manhattan. There’s no nice straightforward grid of streets that even a crazy tourist could understand with minimal training. There’s no 3d-Google-Earth-building level type resource available. The task at hand is therefore establishing, from satellite photos, a detailed map of where buildings and hence people are. One day no doubt an AI will be employed for this job, but right now it remains one for us humans.

Full instructions are in the Tableau blog post, but honestly, it’s pretty easy:

  • If you don’t already have an OpenStreetMap user account, make a free one here.
  • Go to http://tasks.hotosm.org/project/1985 and log in with the OpenStreetMap account
  • Click a square of map, “edit in iD editor”, scan around the map looking for buildings and have fun drawing a box on top of them.

It may not be a particularly fascinating activity for you to do over the long term, but it’s more fun than a game of Threes – and you’ll be helping to build a dataset that may one day save a serious amount of lives, amongst other potential uses.

Well done to all concerned for making it so easy! And if you’ve never poked around the fantastic collaborative project that is OpenStreetMap itself, there’s a bunch of interesting stuff available there for the geographically-inclined data analyst.