Do good and bad viz choices exist?

Browsing the wonderful timeline of Twitter one evening, I noted an interesting discussion on subjects including Tableau Public, best practice, chart choices and dataviz critique. It’s perhaps too long to go into here, but this tweet from Chris Love caught my eye.

Not being particularly auspicious with regards to summarising my thoughts into 140 characters, I wanted to explore some thoughts around the subject here. Overall, I would concur with the sentiment as expressed – particularly when it had to be crammed into such a small space, and taken out of context as I have here 🙂

But, to take the first premise, whilst there are probably no viz types that are inherently terrible or universally awesome, I think one can argue that there are good or bad viz choices in many situations. It might be the case in some instances that there’s no best or worst viz choice (although I think we may find that there often is, at least out of the limited selection most people are inclined to use). Here I am imagining something akin to a data-viz version of Harris’ “moral landscape“; it may not be clear what the best chart is, but there will be local maximums that are unquestionably better for purpose than some surrounding valleys.

So, how do we decide what the best, or at least a good, viz choice is? Well, it surely comes down to intention. What is the aim of the author?

This is not necessarily self-evident, although I would suggest defaulting to something like “clearly communicating an interesting insight based on an amalgamation of datapoints” as a common one. But there are others:

  • providing a mechanism to allow end-users to explore large datasets which may or may not contain insights,
  • providing propaganda to back up an argument,
  • or selling a lot of books or artwork

to name a few.

The reason we need to understand the intention is because that should be the measure of whether the viz is good or bad.

Imagine my aim is to communicate that 10% of my customers are so unprofitable that we would be better off without them to an audience of ten may-as-well-be-clones business managers – note that the details of the audience is very important here too.

I’ll go away and draw 2 different visualisations of the same data (perhaps a bar chart and, hey, why not, a 3-d hexmap radial chart 🙂 ). I’ll then give version 1 to five of the managers, and version 2 to the other five. Half an hour later, I’ll quiz them on what they learned . Simplistically, I shall feel satisfied that whichever one of them generated the correct understanding in the most managers was the better viz in this instance.

Yes yes, this isn’t a perfect double-blind controlled experiment, but hopefully the point is apparent. “Proper” formal research on optimising data visualisation is certainly done, and very necessary it is too. There’s far too many examples to list, but classics in the field might include the paper “Graphical Perception” by Cleveland and McGill, which helped us understand which types of charts were conducive to being visually decoded accurately by us humans and our built-in limitations.

Commercially, companies like IBM or Autodesk or Google have research departments tackling related questions. In academia, there’s groups like the University of Washington Interactive Data Lab (which, interestingly enough, started out as the Stanford Vizualisation Group whose work on “Polaris” was later released commercially as none other than Tableau software).

If you’re looking for ideas to contribute to on this front, Stephen Few maintains a list of some research he’d like to see done on the subject in future, and no doubt there are infinitely many more possibilities if none of those pique your curiosity.

But the point is: for certain given aims, it is often possible to use experimental procedures and the resulting data, to say, as surely as we can say many things, visualisation A is better than visualisation B at achieving its aim.

But not go too far in expressing certainty here! There are several things to note, all contributing to the fact that very often there is not one best viz for a single dataset – context is key.

  • What is the aim of the viz? We covered that one already. Using a set of attractive colours may be more important than correct labelling on axes if you’re wanting to sell a poster for instance. Certain types of chart make for easier and more accurate types of particular comparisons than others. If you’re trying to learn or teach how to create a particular type of uber-creative chart in a certain tool, then you’re going to rather fail to accomplish that if you end up making a bar chart.
  • Who is the audience? For example, some charts can convey a lot of information is a small space; for instance box-and-whisker plots. An analyst or statistician will probably very happily receive these plots to understand and compare distributions and other descriptive stats in the right circumstances. I love them.However, extensive experience tells me that, no, the average person in the street does not. They are far less intuitive than bar or line charts to the non-analytically inclined/trained. However inefficient you might regard it, a table and 3 histograms might communicate the insight to them more successfully than a boxplot would. If they show an interest, by all means take the time to explain how to read a box plot; extol the virtues of the data-based lifestyle we all know; rejoice in being able to teach a fellow human a useful new piece of knowledge. But, in reality, your short-term job is more likely to be to communicate an important insight rather than provide an A-level statistics course – and if you don’t do well at fulfilling what you’re being employed to do, then you might not be employed to do it for all that long.

As well as there being no single best viz type in a generic sense, there’s also no one universally worst viz type. If there was, the datarati would just ban it. Which, I guess, some people are inclined to do – but, sorry, pie charts still exist. And they’re still at least “locally-good” in some contexts – like this one (source: everywhere on the internet):

pie

But, hey, you don’t have the time to run multiple experiments on multiple audiences. Let’s imagine you also are quite new to the game, with very little personal experience. How would you know which viz type to pick? Well, this is going to be a pretty boring answer sorry – and there’s more to elaborate on later, but, one way relates to the fact that, just like in any other field there, are actually “experts” in data viz. And outside of Michael Gove’s deluded rants, we should acknowledge they usually have some value.

In 1928, Bertrand Russell wrote an essay called ‘On the Value of Scepticism‘, where he laid out 3 guidelines for life in general.

 (1) that when the experts are agreed, the opposite opinion cannot be held to be certain;

(2) that when they are not agreed, no opinion can be regarded as certain by a non-expert;

and (3) that when they all hold that no sufficient grounds for a positive opinion exist, the ordinary man would do well to suspend his judgment.

So, we can bastardise these a bit to give it a dataviz context. If you’re really unsure of what viz to pick, then refer to some set of experts (to which we must acknowledge there’s subjectivity in picking…perhaps more on this in future).

If “experts” mostly think that data of type D used to convey an insight of type I to an audience of type A for purpose P is best represented in a line chart, then that’s probably the way to go if you don’t have substantial reason to believe otherwise. Russell would say that at least you can’t be held as being “certainly wrong” in your decision, even if your boss complains. Likewise, if there’s honestly no concurrence in opinion, then, have a go and take your pick of the suggestions – again, no-one should tell you off for because you did something unquestionably wrong!

For example, my bias is towards feeling that, when communicating “standard” insights efficiently via charts to a literate but non-expert audience, you can’t go too far wrong in reading some of Stephen Few’s books. Harsh and austere they may seem at times, but I believe them to be based on quality research in fields such as human perception as well as experience in the field.

But that’s not to say that his well founded, well presented guidelines, are always right. Just because 90% of the time you might be most successful in representing a certain type of time series as a line chart doesn’t mean that you always will be. Remember also, you may have a totally different aim to the audience to whom Mr Few aims his books at, in which case you cannot assume at all that the same best-practice standards would apply.

And, despite the above guidelines, because (amongst other reasons) not all possible information is ever available to us at any given time, sometimes experts are simply wrong. It turns out that the earth probably isn’t the centre of the universe, despite what you’d probably hear if you went back to experts from a millennia ago. You should just take care to find some decent reason to doubt the prevailing expertise, rather than simply ignoring it.

What we deem as the relative “goodness” of data viz techniques is also surely not static over time. For one, not all forms of data visualisation have existed since the dawn of mankind.

The aforementioned box and whisker plot is held to have been invented by John Tukey. He was only born in 1915, so if I were to travel back 200 years in time with my perfectly presented plot, then it’s unlikely I’d find many people to who find it intuitive to interpret. Hence, if my aim was to be to communicate insights quickly and clearly, then on the balance of probabilities this would probably be a bad attempt. It may not be the worst attempt, as the concept is still valid and hence could likely be explained to some inhabitants of the time – but in terms of bang for buck, there’d be no doubt be higher peaks in the “communicating data insights quickly” landscape available to me nearby.

We should also remember that time hasn’t stopped. Contrary to Francis Fukuyama’s famous essay and book, we probably haven’t reached the end of history even politically just yet, and we most certainly haven’t done so in the world of data. Given the rate of usable data creation, it might be that we’ve only dipped our toe in so far. So, what we think is best practice today may likely not be the same a hundred years hence; some of it may not be so even next year.

Some, but not all, obstacles or opportunities surround technology. Already the world has moved very quickly from graph paper, to desktop PCs, to people carrying around super-computers that only have small screens in their pockets. The most effective, most efficient, ways to communicate data insights will differ in each case. As an example I’m very familiar with, the  Tableau software application, clearly acknowledged this in their last release which includes facilities for displaying data differently depending on what device they’re been viewed on. Not that we need to throw the baby out with the bathwater, but even our hero Mr Tukey may not have had the iPhone 7 in mind when considering optimum data presentation.

Smartwatches have also appeared, albeit are not so mainstream at the moment. How do you communicate data stories when you have literally an inch of screen to play with? Is it possible? Almost certainly so, but probably not in the same way as on a 32 inch screen; and are the personal characteristics and needs of smart watch users anyway the same as the audience who views vizzes on a larger screen?

And what if Amazon (Echo), Google (Home) and others are right to think that in the future a substantial amount of our information based interactions may be done verbally, to a box that sits on the kitchen counter and doesn’t even have a screen? What does “data visualisation” mean in this context? Is it even a thing? But a lot of the questions I might want to ask my future good friend Alexa might well be questions that can only answered by some transformation and re-presentation in audio form of data.

I already can verbally ask my phone to provide me some forms of dataviz. In the below example, it shows me a chart and a summary table. It also provides me a very brief audio summary for the occasions where I can’t view the screen, shown in the bold text above the chart. But, I can’t say I’ve heard of a huge amount of discussion about how to optimise the audio part of the “viz” for insight. Perhaps there should be.

image

Technology aside though, the field should not rest on its laurels; the line chart may or may not ever die, but experimentation and new ideas should always be welcomed. I’d argue that we may be able to prove  in many cases that, today, for a given audience, for a given aim, with a given dataset, out of the various visualisations we most commonly have access to, that one is demonstrably better than another, and that we can back that up via the scientific method.

But what if there’s an even better one out there we never even thought of? What if there is some form of time series that is best visualised in a pie chart? OK, it may seem pretty unlikely but, as per other fields of scientific endeavour, we shouldn’t stop people testing their hypotheses – as long as they remain ethical – or the march of progress may be severely hampered.

Plus, we might all be out of a job. If we fall into the trap of thinking the best of our knowledge today is the best of all knowledge that will ever be available, that the haphazard messy inefficiencies of creativity are a distraction from the proven-efficient execution of the task at hand, then it’ll not be too long before a lot of the typical role of a basic data analyst is swallowed up in the impending march of our robotic overlords.

Remember, a key job of a lot of data-people is really to answer important questions, not to draw charts. You do the second in order to facilitate the first, but your personal approach to insight generation is often in actuality a means to another end.

Your customer wants to know “in what month were my sales highest?”. And, lo and behold, when I open a spreadsheet in the sort of technology that many people treat as the norm these days, Google sheets, I find that I can simply type or speak in the question “What month were my sales highest?” and it tells me very clearly, for free, immediately, without employing anyone to do anything or waiting for someone to get back from their holiday.

capture

Yes, that feature only copes with pretty simplistic analysis at the moment, and you have to be careful how you phrase your questions – but the results are only going to get better over time, and spread into more and more products. Microsoft PowerBI already has a basic natural language feature, and Tableau is at a minimum researching into it. Just wait until this is all hooked up to the various technological “cognitive services” which are already on offer in some form or other. A reliable, auto-generated answer to “what will my sales be next week if I launch a new product category today?” may free up a few more people to spend time with their family, euphemistically or otherwise.

So in the name of progress, we can and should, per Chris’ original tweet, be open to giving and receiving constructive criticism, whether positive or negative. There is value in this, even in the unlikely event that we have already hit on the single best, universal, way of of representing a particular dataset for all time.

Recall John Stuart Mill’s famous essay, “On Liberty” (written in 1869, yes, even before the boxplot existed). It’s so very quotable for many parts of life, but let’s take for example a paragraph from chapter two, regarding the “liberty of thought and discussion”. Why shouldn’t we ban opinions, even when we believe we know them to be bad opinions?

But the peculiar evil of silencing the expression of an opinion is, that it is robbing the human race; posterity as well as the existing generation; those who dissent from the opinion, still more than those who hold it.

If the opinion is right, they are deprived of the opportunity of exchanging error for truth: if wrong, they lose, what is almost as great a benefit, the clearer perception and livelier impression of truth, produced by its collision with error.

Are pie charts good for a specific combination of time series data, audience and aim?

Well – assuming a particularly charitable view of human discourse –  after rational discussion we will either establish that yes, they actually are, in which case the naysayers can “exchange error for truth” to the benefit of our entire field.

Or, if the consensus view of “no way” holds strong, then, having been tested, we will have reinforced the reason why this is in both the minds of the questioner, and ourselves – hence helping us remember the good reasons why we hold our opinions, and ensuring we never lapse into the depths of pseudo-religious dogma.

Advertisements

Remember the exciting new features Tableau demoed at #data15 – have we got them yet?

As we get closer towards the thrills of this year’s Tableau Conference (#data16), I wanted to look back at one of the most fun parts of the last year’s conference – the “devs on stage” section. That’s the part where Tableau employees announce and demonstrate some of the new features that they’re working on. No guarantees are made as to whether they’ll ever see the light of day, let alone be in the next release –  but, in reality, the audience gets excited enough that there’d probably be a riot if none of them ever turned up.

Having made some notes of what was shown in last year’s conference (which was imaginatively entitled #data15), I decided to review the list and see how many of those features have turned up so far. After all, it’s all very well to announce fun new stuff to a crowd of 10,000 over-excited analysts…but does Tableau tend to follow through on it? Let’s check!

(Please bear in mind that these are just the features I found significant enough to scrawl down through the jet-lag; it’s not necessarily a comprehensive review of what was on show.)

Improvements in the Data category:

Feature Does it exist yet?
Improvements to the automatic data cleanup feature recently released that can import Excel type files that are formatted in an otherwise painful way for analysis Yes – Tableau 9.2 brought features like “sub-table detection” to its data interpreter feature
Can now understand hundreds of different date formats Hmm…I’m not sure.  I’ve not had any problems with dates, but then again I was lucky enough never to have many!
The Data Source screen will now allow Tableau to natively “union” data (as in SQL UNION), as well as join it, just by clicking and dragging. Yes – Tableau 9.3 allows drag and drop unioning. But only on Excel and text files. Here’s hoping they expand the scope of that to databases in the future.
Cross-database joins Yes, cross-database joins are in Tableau 10.

Improvements in the Visualisation category:

Feature Does it exist yet?
Enhancements to the text table visualisation Yes – Tableau 9.2 brought the ability to show totals at the top of columns, and 9.3 allowed excluding totals from colour-coding.
Data highlighter Yes – Tableau 10 includes the highlighter feature.
New native geospatial geographies Yes – 9.2 and 9.3 both added or updated some geographies.
A connector to allow connection to spatial data files No – I don’t think I’ve seen this one anywhere.
Custom geographic territory creation Yes – Tableau 10 has a couple of methods to let you do that.
Integration with Mapbox Yes- Tableau 9.2 lets you use Mapbox maps.
Tooltips can now contain worksheets themselves. No – not seen this yet.

Improvements in the Analysis category:

Feature Does it exist yet?
Automatic outlier detection No
Automatic cluster detection Yes, that’s a new Tableau 10 feature
You can “use” reference lines / bands now for things beyond just static display Hmm…I don’t recall seeing any changes in this area. No?

Improvements in the Self-Service category:

Feature Does it exist yet?
There will be a custom server homepage for each user Not sure – the look and feel of the home page has changed, and the user can mark favourites etc. but I have not noticed huge changes in customisation from previous versions.
There will be analytics on the workbooks themselves  Yes – Tableau 9.3 brought content analytics to workbooks on server.Some metadata is shown in the content lists directly, plus you can sort by view count.
Searching will become better Yes – also came with Tableau 9.3. Search shows you the most popular results first, with indicators as to usage.
Version control Yes – Tableau 9.3 brought workbook revision history for server, and Tableau 10 enhanced it.
Improvements to security UI Yes – not 100% sure which version, but the security UI changed. New features were also added, such as setting and locking project permissions in 9.2.
A web interface for managing the Tableau server Not sure about this one, but I don’t recall seeing it anywhere. I’d venture “no”, but am open to correction!

Improvements in the Dashboarding category:

Feature Does it exist yet?
Improvements to web editing Yes – most versions of Tableau since then have brought improvements here. In Tableau 10 you can create complete dashboards from scratch via the web.
Global formatting  Yes, this came in Tableau 10.
Cross datasource filtering Yes, this super-popular feature also came with Tableau 10.
Device preview Yes, this is available in Tableau 10.
Device specific dashboards. Yes, also from Tableau 10.

Improvements in the Mobile category:

Feature Does it exist yet?
A  Tableau iPhone app Yes – download it here. An Android app was also released recently.
 iPad app – Vizable Was actually launched at #data15, so yes, it’s here.

Summary

Hey, a decent result! Most of the features demonstrated last year are already in the latest official release.

And for some of those that aren’t, such as outlier detection, it feels like a framework has been put in place for the possible later integration of them. In that particular case, you can imagine it being located in the same place, and working in the same way, as the already-released clustering function.

There are perhaps a couple that it’s slightly sad to see haven’t made it just yet – I’m mainly thinking of embedded vizzes in tooltips here. From the celebratory cheers, that was pretty popular with the assembled crowds when demoed in 2015, so it’ll be interesting to see whether any mention of development on that front is noted in this year’s talks.

There are also some features released that I’d like to see grow in scope – the union feature would be the obvious one for me. I’d love to see the ability to easily union database tables beyond Excel/text sources. And now we have cross-database joins, perhaps even unioning between different technology stacks.

Bonus points due: In my 2015 notes, I had mentioned that a feature I had heard a lot of colleague-interest in, that was not mentioned at all in the keynote, was data driven alerting; the ability to be notified only if your KPI goes wild for instance. Sales managers might get bored of checking their dashboards each day just to see if sales were down when 95% of the time everything is fine, so why not just send them an email when that event actually occurs?

Well, the exciting news on that front is that some steps towards that have been announced for Tableau 10.1, which is in beta now so will surely be released quite soon.

Described as “conditional subscriptions”, the feature will allow you to “receive email updates when data is present in your viz”. That’s perhaps a slight abstraction from the most obvious form of data-driven alerting. But it’s easy to see that, with a bit of thought, analysts will be able to build vizzes that give exactly the sort of alerting functionality my colleagues, and many many others in the wider world, have been asking for. Thanks for that, developer heroes!

 

Clustering categorical data with R

Clustering is one of the most common unsupervised machine learning tasks. In Wikipedia‘s current words, it is:

the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups

Most “advanced analytics” tools have some ability to cluster in them. For example, Alteryx has K-Centroids AnalysisR, Python, SPSS, Statistica and any other proper data sciencey tools all likely have many methods – and even Tableau, although not necessarily aimed at the same market, just added a user-friendly clustering facility.  You can do the calculations in Excel, should you really want to (although why not cheat and use a nice addin if you want to save time?).

However, many of the more famous clustering algorithms, especially the ever-present K-Means algorithm, are really better for clustering objects that have quantitative numeric fields, rather than those that are categorical. I’m not going delve into the details of why here, but, simplistically, they tend to be based on concepts like Euclidean distance – and in that domain, it’s conceptually difficult to say that [bird] is Euclideanly “closer” to [fish] than [animal]; vs the much more straightforward task of knowing that an income of £100k is nearer to one of £90k than it is to 50p. IBM has a bit more about that here.

But, sometimes you really want to cluster categorical data! Luckily, algorithms for that exist, even if they are rather less widespread than typical k-means stuff.

R being R, of course it has a ton of libraries that might help you out. Below are a couple I’ve used, and a few notes as to the very basics of how to use them – not that it’s too difficult once you’ve found them. The art of selecting the optimum parameters for the very finest of clusters though is still yours to master, just like it is on most quantitative clustering.

The K-Modes algorithm

Like k-means, but with modes, see 🙂 ? A paper called ‘Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values‘ by Huang gives the gory details.

Luckily though, a R implementation is available within the klaR package. The klaR documentation is available in PDF format here and certainly worth a read.

But simplistically, you’re looking at passing a matrix or dataframe into the “kmodes” function.

Imagine you have a CSV file something like:

RecordID FieldA FieldB FieldC FieldD
1 0 0 0 1
2 0 0 0 0
3 0 0 0 1
4 1 1 0 0

Here’s how you might read it in, and cluster the records based on the contents of fields “FieldA”, “FieldB”, “FieldC”, and “FieldD”.

install.packages("klaR")
library(klaR)
setwd("C:/Users/Adam/CatCluster/kmodes")
data.to.cluster <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-kmodes(data.to.cluster[,2:5], 3, iter.max = 10, weighted = FALSE ) #don't use the record ID as a clustering variable!

Here I’ve asked for 3 clusters to be found, which is the second argument of the kmodes function. Just like k-means, you can specify as many as you want so you have a few variations to compare the quality or real-world utility of.

This is the full list of parameters to kmodes, per the documentation.

kmodes(data, modes, iter.max = 10, weighted = FALSE)
  • data: A matrix or data frame of categorical data. Objects have to be in rows, variables
    in columns.
  • modes: Either the number of modes or a set of initial (distinct) cluster modes. If a
    number, a random set of (distinct) rows in data is chosen as the initial modes.
  • iter.max: The maximum number of iterations allowed.
  • weighted: Whether usual simple-matching distance between objects is used, or a weighted version of this distance.

What do you get back?

Well, the kmodes function returns you a list, with the most interesting entries being:

  • cluster: A vector of integers indicating the cluster to which each object is allocated.
  • size: The number of objects in each cluster.
  • modes: A matrix of cluster modes.
  • withindiff: The within-cluster simple-matching distance for each cluster

Here’s an example what it looks like when output to the console:

K-modes clustering with 3 clusters of sizes 3, 5, 12

Cluster modes:
 FieldA FieldB FieldC FieldD
1 1 0 0 0
2 1 0 1 1
3 0 0 0 0

Clustering vector:
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
 3 3 3 1 3 1 2 3 3 3 2 2 2 3 3 2 1 3 3 3

Within cluster simple-matching distance by cluster:
[1] 2 2 8

Available components:
[1] "cluster" "size" "modes" "withindiff" "iterations" "weighted"

So, if you want to append your newly found clusters onto the original dataset, you can just add the cluster back onto your original dataset as a new column, and perhaps write it out as a file to analyse elsewhere, like this:

cluster.output <- cbind(data.to.cluster,cluster.results$cluster)
write.csv(cluster.output, file = "kmodes clusters.csv", row.names = TRUE)

 

The ROCK algorithm

Some heavy background reading on Rock is available in this presentation by Guha et al.

Again, a benevolent genius has popped an implementation into R for our use. This time you can find it in package “cba”. The PDF docs for cba are here.

But the most simplistic usage is very similar to k-modes, albeit with different optional parameters based on the how the algorithms differ.

Here’s what you’d do to cluster the same data as above, and write it back out, this time with the Rock clusters appended. Note here that ideally you’re specifically passing in a matrix to the rockCluster function.

install.packages("cba")
library(cba)
setwd("C:/Users/Adam/CatCluster/rock")
data.to.cluster <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-rockCluster(as.matrix(data.to.cluster[,2:5]), 3 )
cluster.output <- cbind(data.to.cluster,cluster.results$cl)
write.csv(cluster.output, file = "Rock clusters.csv", row.names = TRUE)

The full list of parameters to the relevant function, rockCluster is:

rockCluster(x, n, beta = 1-theta, theta = 0.5, fun = "dist", funArgs = list(method="binary"), debug = FALSE)
  • x: a data matrix; for rockLink an object of class dist.
  • n: the number of desired clusters.
  • beta: optional distance threshold
  • theta: neighborhood parameter in the range [0,1).
  • fun: distance function to use.
  • funArgs: a list of named parameter arguments to fun.
  • debug: turn on/off debugging output.

This is the output, which is of class “rock”, when printed to the screen:

data: x 
 beta: 0.5 
theta: 0.5 
 fun: dist 
 args: list(method = "binary") 
 1 2 3 
14 5 1

The object is a list, and its most useful component is probably “cl”, which is a factor containing the assignments of clusters to your data.

Of course once you have the csv files generated in the above ways, it’s just bog-standard data – so you’re free to visualise in R, or any other tool.