The Datasaurus: a monstrous Anscombe for the 21st century

Most people trained in the ways of data visualisation will be very familiar with Anscombe’s Quartet. For the uninitiated, it’s a set of 4 fairly simple looking X-Y scatterplots that look like this.

Anscombe's Quartet

What’s so great about those then? Well, the reason data vizzers get excited starts to become clear when you realise that the dotted grey lines I have superimposed on each small chart are in fact the mean average of X and Y in each case. And they’re basically the same for each chart.

The identikit summary stats go beyond mere averages. In fact, the variance of both X and Y (and hence the standard deviation) is also the pretty much the same in every chart. As is the correlation coefficient of X and Y, and the regression line that would be the line of best fit if were you to generate a linear model based on each of those 4 datasets.

The point is to show the true power of data visualisation. There are a bunch of clever-sounding summary stats (r-squared is a good one) that some nefarious statisticians might like to baffle the unaware with – but they are oftentimes so summarised that they can lead you to an entirely misleading perception, especially if you are not also an adept statistician.

For example, if someone tells you that their fancy predictive model demonstrates that the relationship between x and y can be expressed as “y = 3 + 0.5x” then you have no way of knowing whether the dataset the model was trained on was that from Anscombe 1, for which it’s possible that it may be a good model, or Anscombe 2, for which it is not, or Anscombe 3 and 4 where the outliers are make that model sub-par in reality, to the point where a school child issued with a sheet of graph paper could probably make a better one.

Yes analytics end-users, demand pictures! OK, there are so many possible summary stats out there that someone expert in collating and mentally visualising the implication of a combination of a hand-picked collection of 30 decimal numbers could perhaps have a decent idea of the distribution of a given set of data – but, unless that’s a skill you already have (clue: if the word “kurtosis” isn’t intuitive to you, you don’t, and it’s nothing to be ashamed of), then why spend years learning to mentally visualise such things, when you could just go ahead and actually visualise it?

But anyway, the quartet was originally created by Mr Anscombe in 1973. Now, a few decades later, it’s time for an even more exciting scatterplot collection, courtesy of Justin Matejka and George Fitzmaurice, take from their paper “Same Stats, Different Graphs“.

They’ve taken the time to create the Datasaurus Dozen. Here they are:

Datasaurus Dozen.png

What what? A star dataset has the same summary statistics as a bunch of lines, an X, a circle or a bunch of other patterns that look a bit like a migraine is coming on?

Yes indeed. Again, these 12 charts all have the same (well, extremely similar) X & Y means, the same X & Y standard deviations and variances, and also the same X & Y linear correlations.

12 charts are obviously more dramatic than 4, and the Datasaurus dozen certainly has a bunch of prettier shapes, but why did they call it Datasaurus? Purely click-bait? Actually no (well, maybe, but there is a valid reason as well!).

Because the 13th of the dozen (a baker’s dozen?) is the chart illustrated below. Please note that if you found Jurassic Park to be unbearably terrifying you should probably close your eyes immediately.

Datasaurus Main

Raa! And yes, this fearsome vision from the distant past also has a X mean of 54.26, and Y mean of 47.83, and X standard deviation of 16.76, a Y standard deviation of 26.93 and a correlation co-efficient of -0.06, just like his twelve siblings above.

If it’s hard to believe, or you just want to play a bit, then the individual datapoints that I put into Tableau to generate the above set of charts is available in this Google sheet – or a basic interactive viz version can be found on Tableau Public here.

Full credit is due to Wikipedia for the Anscombe dataset and Autodesk Research for the Datasaurus data.

data.world: the place to go for your open data needs?

Somewhere in my outrageously long list of data-related links to check out I found “data.world“. Not only is that a nice URL, it also contains a worthy service that I can imagine being genuinely useful in future, if it takes off like it should. At first glance, it’s a platform for hosting data – seemingly biased towards the “open” variant of data although I see they also offer to host private data too – with some social bits and pieces overlaid.

What’s interesting about this particular portal to me over and above a bunch of other sites with a mass of open data available on them is:

  1. Anyone can upload any conventional dataset (well, that they are legally allowed to do) – so right now it contains anything from World Bank GDP info through to a list of medieval battles, and much more besides. Therefore it presumably seeks to be a host for all the world’s useful data, rather than that of a certain topic or producer. Caveat user etc. presumably applies, but the vision is nice.
  2. You can actually do things with the data on the site itself. For example, you can join one set of data to another hosted on the site, even if it’s from a totally different project from a totally different author, directly on the site. You can run queries or see simple visualisations.
  3. It’s very easy to get the data out, and hence use it in other tools should you want to do more complicated stuff later on.
  4. It’ll also host data documentation and sample queries (for example, SQL that you can run live) to provide contextual information and shortcuts for analysts who need to use data that they might not be intimately familiar with.
  5. There’s a social side. It also allows chat conversations between authors, users and collaborators. You can see what’s already been asked or answered about each dataset. You can “follow” individual people, or curated collection of subject-based datasets.

So let’s test a few features out with a simple example.

The overriding concept is that of a dataset. A dataset is more than a table; it can include many tables and a bunch of other sorts of non-data files that aid with the use of the data itself – for instance documentation, notebooks or images. Each user can create datasets, name and describe them appropriately, and decide whether they should be public or private.

Here’s one I prepared earlier (with a month of my Fitbit step count data in as it happens).

Capture

You can make your dataset open or private, attribute a license to be explicit about its re-use, and add tags to aid discovery. You can even add data via a URL, and later refresh if the contents of that URL changes.

As you can see, after import it shows a preview of the data it read in at the bottom of the screen.  If there were multiple files, you’d be able to filter or sort them to find the one you want.

If you hit the little “i” icon next to any field name, you get a quick summary visualisation and data description, dependent on data type. This is very useful to get a quick overview of what your field contains, and if it was read in correctly. In my view, this sort of thing should be a standard feature in most analytical tools (it already is in some).

Capture

I believe tags, field names and descriptions are searchable – so if you do a nice job with those then it’ll help people find what you’re sharing.

Other common actions now available after you’ve uploaded or discovered a data table of  interest would be to:

You can also “explore” the data. This expands the data table to take up most of the screen, enabling easier sorting, filtering and so on. More interestingly, you can open a chart view where you can make basic charts to understand your data in more detail.

Now, this isn’t going to replace your dedicated visualisation tool – it has only the most basic of customisations available at the moment – but it handles simple exploration requirements in a way that is substantially less time consuming than downloading and importing your data into another tool.

It even suggests some charts you might like to make, allowing 1-click creation. On my data, for example, it offered to make me a chart of “Count of records by Date” or “Count of records by Steps”. It seems to take note of the data types, for instance defaulting to a line chart for the count by date, and a histogram for the count by steps.

Here’s the sort of output the 1-click option gives you:

Capture

OK, that’s not a chart you’re going to send to Nature right away, but it does quickly show the range of my data, let me see check for impossible outliers, and gives some quick insights into the distribution. Apparently I commonly do between about 5000 and 7500 steps…and I don’t make the default Fitbit 10k steps target very often. Oops.

These charts can then immediately be downloaded or shared as png or pdf, with automatically generated URLs like https://data.world/api/chart/export/d601307c3e790e5d05aa17773f81bd6446cdd148941b89b243d9b78c866ccc3b.png

Here I would quite like a 1-click feature to save & publish any chart that was particularly interesting with the dataset itself -but I understand why that’s probably not a priority unless the charting aspect becomes more of a dedicated visualisation feature rather than a quick explore mechanic.

For now, you could always export the graphic and include it as, for example, an image file in the dataset. Here for example is  a dataset where the author has taken the time to provide a great description with some findings and related charts to the set of tables they uploaded.

One type of artefact you can save online with the dataset are queries. Yes, you can query your file live onsite, with either (a variant of) SQL or SPARQL. Most people are probably more familiar with SQL, so let’s start there.

Starting a new query will give you a basic SELECT * LIMIT query, but you’re free to use many (but not all) standard SQL features to change up your dataset into a view that’s useful to you.

Let’s see, did I ever make my 10k step goal in December? If so, on which days?

Capture

Apparently I did, on a whopping four days, the dates of which are outlined above. I guess I had a busy Christmas eve.

These results then behave just like a data table, so they can then be exported, linked to or visualised as a chart.

Once you’re happy with your query, if you think it’s useful for the future you can save it, or if it might aid other people, then you can publish it. A published query remains with the dataset, so next time someone comes to look at the dataset, they’ll see a list of the queries saved which they can re-use or adapt for their own needs. No more need for hundreds of people to transform a common dataset in exactly the same way again and again!

Interestingly, you can directly query between different datasets in the same query, irrespective of data table, data set, or author. Specifying the schemas feels a little fiddly at the moment, but it’s perfectly doable once you understand the system (although there’s no doubt room for future UI improvement here).

Imagine for instance that, for no conceivable reason, I was curious as to which celebrities sadly died on the days I met my 10k steps goal. Using a combination of my dataset and the 2016 celebrity deaths list from popculture, I can query like this:

Capture.PNG

…only to learn the sad news that a giant panda called Pan Pan sadly expired during one of my goal-meeting days.

Of course, these query results can be published, shared, saved, explored and so on just like we saw previously.

Now, that’s a silly example but the idea of, not only being able to download open data, but have subject matter experts combine and publish useful data models as a one-time effort for data-consumers to use in future is an attractive feature. Together with the ability to upload documentation, images or even analytical notebooks you may see how this could become an invaluable resource of data and experience – even within a single organisation, let alone as a global repository of open data.

Of course, as with most aggregation or social sites, there’s a network effect: how useful this site ends up being depends on factors such as how many people make active use of it, how much data is uploaded to it and what the quality of the data is.

If one day it grew to the point was the default place to look for public data, without becoming a nightmare to find those snippets of data gold in amongst its prospectively huge collection, it would potentially be an incredibly useful service.

The “nightmare to find” aspect is not a trivial point – there are already several open data portals (for instance government based ones) which offer a whole load of nice datasets, but often it is hard to find the exact content of data at the granularity that you’re after even when you know it exists – and these are on sites that are often quite domain-limited which in some ways makes the job easier. At data.world there is already a global search (which includes the ability to search specifically on recency, table name or field name if you wish), tags and curated collections which I think shows the site takes the issue seriously.

For analyst confidence, some way of understanding data quality would also be useful. The previews of field types and contents are already useful here. Social features, to try and surface a concept similar to “institutional knowledge”, might also be overlaid. There’s already a basic “like” facility. Of course this can be a challenging issue for any data catalogue that, almost by definition, needs to offer upload access to all.

For browser-haters, it isn’t necessary to use the site directly in order to make use of its contents. There’s already an API which gives you the ability to programmatically upload, query and download data. This opens up some interesting future possibilities. Perhaps, if data.world does indeed become a top place to look for the data of the world, your analytics software of choice might in future include a feature such that you can effectively search a global data catalogue from the comfort of your chart-making screen, with a 1-click import once you’ve found your goal. ETL / metadata tools could provide a easy way to publish to the results of your manipulations, and so on.

The site is only in preview mode at present, so it’s not something to stake your life on. But I really like the concept and the execution so far is way beyond some other efforts I’ve seen in the past. If I find I’ve created a public dataset I’d like to share, I would certainly feel happy to distribute it and all supporting documents and queries via this platform. So best of luck to data.world in the, let’s say,”ambitious” mission of bringing together the world’s freely available – yet often sadly undiscoverable – data in a way that encourages people to actually make valuable use of it.

Lessons from what happened before Snow’s famous cholera map changed the world

Anyone who studies any amount of the history of, or the best practice for, data visualisation will almost certainly come across a handful of “classic” vizzes. These specific transformations of data-into-diagram have stuck with us through the mists of time in order to become examples that teachers, authors, conference speakers and the like repeatedly pick to illustrate certain key points about the power of dataviz.

A classic when it comes to geospatial analysis is John Snow’s “Cholera map”. Back in the 1850s, it was noted that some areas of the country had a lot more people dying from cholera than other places. At the time, cholera’s transmission mechanism was unknown, so no-one really knew why. And if you don’t know why something’s happening, it’s usually hard to take action against it.

Snow’s map took data that had been gathered about people who had died of cholera, and overlaid the locations where these people resided against a street map of a particularly badly affected part of London. He then added a further data layer denoting the local water supplies.

snowmap

(High-resolution versions available here).

By adding the geospatial element to the visualisation, geographic clusters showed up that provided evidence to suggest that use of a specific local drinking-water source, the now-famous Broad Street public well, was the key common factor for sufferers of this local peak of cholera infection.

Whilst at the time scientists hadn’t yet proven a mechanism for contagion, it turned out later that the well was indeed contaminated, in this case with cholera-infected nappies. When locals pumped water from it to drink, many therefore tragically succumbed to the disease.

Even without understanding the biological process driving the outbreak – nobody knew about germs back then –  seeing this data-driven evidence caused  the authorities to remove the Broad Street pump handle, people could no longer drink the contaminated water, and lives were saved. It’s an example of how data visualisation can open ones’ eyes to otherwise hidden knowledge, in this case with life-or-death consequences.

But what one hears a little less about perhaps is that this wasn’t the first data-driven analysis to confront the same problem. Any real-world practising data analyst might be unsurprised to hear that there’s a bit more to the story than a swift sequence of problem identification -> data gathering -> analysis determining the root cause ->  action being taken.

Snow wasn’t working in a bubble. Another gentleman, by the name of William Farr, whilst working at the General Register Office, had set up a system that recorded people’s deaths along with their cause. This input seems to have been a key enabler of Snow’s analysis.

Lesson 1: sharing data is a Very Good Thing. This is why the open data movement is so important, amongst other reasons. What if Snow hadn’t been able examine Farr’s dataset – could lives have been lost? How would the field of epidemiology have developed without data sharing?

In most cases, no single person can reasonably be expected to both be the original source of all the data they need and then go on to analyse it optimally. “Gathering data” does not even necessarily involve the same set of skills as “analysing data” does – although of course a good data practitioner should usually understand some of the theory of both.

As it happens, William Farr had gone beyond collecting the data. Being of a statistical bent, he had actually already used the same dataset himself to analytically tackle the same question – why are there relatively more cholera deaths in some places than others? He’d actually already found what appeared to be an answer. It later turned out that his conclusion wasn’t correct – but it certainly wasn’t obvious at the time. In fact, it likely seemed more intuitively correct than Snow’s theory back then.

Lesson 2: Here then is a real life example then of the value of analytical iteration. Just because one person has looked at a given dataset doesn’t mean that it’s worthless to have someone else re-analyse it – even if the former analyst has established a conclusion. This is especially important when the stakes are high, and the answer in hand hasn’t been “proven” by virtue of any resulting action confirming the mechanism. We can be pleased that Snow didn’t just think “oh, someone’s already looked at it” and move on to some shiny new activity.

So what was Farr’s original conclusion? Farr had analysed his dataset, again in a geospatial context, and seen a compelling association between the elevation of a piece of land and the number of cholera deaths suffered by people who live on it. In this case, when the land was lower (vs sea level for example) then cholera deaths seemed to increase.

In June 1852, Farr published a paper entitled “Influence of Elevation on the Fatality of Cholera“. It included this table:

farrtable

The relationship seems quite clear; cholera deaths per 10k persons goes up dramatically as the elevation of the land goes down.

Here’s the same data, this time visualised in the form of a linechart, from a 1961 keynote address on “the epidemiology of airborne infection”, published in Bacteriology Reviews. Note the “observed mortality” line.

farrchart.gif

Based on that data, his elevation theory seems a plausible candidate, right?

You might notice that the re-vizzed chart also contains a line concerning the calculated death rate according to “miasma theory”, which seems to have an outcome very similar on this metric to the actual cholera death rate. Miasma was a leading theory of disease-spread back in the nineteenth century, with a pedigree encompassing many centuries. As the London Science Museum tells us:

In miasma theory, diseases were caused by the presence in the air of a miasma, a poisonous vapour in which were suspended particles of decaying matter that was characterised by its foul smell.

This theory was later replaced with the knowledge of germs, but at the time the miasma theory was a strong contender for explaining the distribution of disease. This was probably helped because some potential actions one might take to reduce “miasma” evidently would overlap with those of dealing with germs.

After analysing associations between cholera and multiple geo-variables (crowding, wealth, poor-rate and more), Farr’s paper selects the miasma explanation as the most important one, in a style that seems  quite poetic these days:

From an eminence, on summer evenings, when the sun has set, exhalations are often seen rising at the bottoms of valleys, over rivers, wet meadows, or low streets; the thickness of the fog diminishing and disappearing in upper air. The evaporation is most abundant in the day; but so long as the temperature of the air is high, it sustains the vapour in an invisible body, which is, according to common observation, less noxious while penetrated by sunlight and heat, than when the watery vapour has lost its elasticity, and floats about surcharged with organic compounds, in the chill and darkness of night.

The amount of organic matter, then, in the atmosphere we breathe, and in the waters, will differ at different elevations; and the law which regulates its distribution will bear some resemblance to the law regulating the mortality from cholera at the various elevations.

As we discover later, miasma theory wasn’t correct, and it certainly didn’t offer the optimum answer to addressing the cluster of cholera cases Snow examined.But there was nothing impossible or idiotic about Farr’s work. He (as far as I can see at a glance) gathered accurate enough data and analysed them in a reasonable way. He was testing a hypothesis that was based on the common sense at the time he was working, and found a relationship that does, descriptively, exist.

Lesson 3: correlation is not causation (I bet you’ve never heard that before 🙂 ). Obligatory link to the wonderful Spurious Correlations site.

Lesson 4: just because an analysis seems to support a widely held theory, it doesn’t mean that the theory must be true.

It’s very easy to lay down tools once we seem to have shown that what we have observed is explained by a common theory. Here though we can think of Karl Popper’s views of scientific knowledge being derived via falsification. If there are multiple competing theories in play, the we shouldn’t assume certainty that the dominant one is correct until we have come up with a way of proving the case either way. Sometimes, it’s a worthwhile exercise to try to disprove your findings.

Lesson 5: the most obvious interpretation of the same dataset may vary depending on temporal or other context.

If I was to ask a current-day analyst (who was unfamiliar with the case) to take a look at Farr’s data and provide a view with regards to the explanation of the differences in cholera death rates, then it’s quite possible they’d note the elevation link. I would hope so. But it’s unlikely that, even if they used precisely the same analytical approach, they would suggest that miasma theory is the answer. Whilst I’m hesitant to claim there’s anything that no-one believes, for the most part analysts will probably place an extremely low weight on discredited scientific theories from a couple of centuries ago when it comes to explaining what data shows.

This is more than an idealistic principle – parallels, albeit usually with less at stake, can happen in day-to-day business analysis. Preexisting knowledge changes over time, and differs between groups. Who hasn’t seen (or had of being) the poor analyst who revealed a deep, even dramatic, insight into business performance predicated on data which was later revealed to have been affected by something entirely different.

For my part, I would suggest to learn what’s normal, and apply double-scepticism (but not total disregard!) when you see something that isn’t. This is where domain knowledge is critical to add value to your technical analytical skills. Honestly, it’s more likely that some ETL process messed up your data warehouse, or your store manager is misreporting data, than overnight 100% of the public stopped buying anything at all from your previously highly successful store for instance.

Again, here is an argument for sharing one’s data, holding discussions with people outside of your immediate peer group, and re-analysing data later in time if the context has substantively changed. Although it’s now closed, back in the deep depths of computer data viz history (i.e. the year 2007), IBM launched a data visualisation platform called “Many Eyes”. I was never an avid user, but the concept and name rather enthralled me.

Many Eyes aims to democratize visualization by providing a forum for any users of the site to explore, discuss, and collaborate on visual content…

Sadly, I’m afraid it’s now closed. But other avenues of course exist.

In the data-explanation world, there’s another driving force of change – the development of new technologies for inferring meaning from datapoints. I use “technology” here in the widest possible sense, meaning not necessarily a new version of your favourite dataviz software or a faster computer (not that those don’t help), but also the development of new algorithms, new mathematical processes, new statistical models, new methods of communication, modes of thought and so on.

One statistical model, commonplace in predictive analysis today, is logistic regression. This technique was developed in the 1950s, so was obviously unavailable as a tool for Farr to use a hundred years beforehand. However, in 2004, Bingham et al. published a paper that re-analysed Farr’s data, but this time using logistic regression. Now, even here they still find a notable relationship between elevation and the cholera death rate, reinforcing the idea that Farr’s work was meaningful – but nonetheless conclude that:

Modern logistic regression that makes best use of all the data, however, shows that three variables are independently associated with mortality from cholera. On the basis of the size of effect, it is suggested that water supply most strongly invited further consideration.

Lesson 6: reanalysing data using new “technology” may lead to new or better insights (as long as the new technology is itself more meritorious in some way than the preexisting technology, which is not always the case!).

But anyway, even without such modern-day developments, Snow’s analysis was conducted, and provided evidence that a particular water supply was causing a concentration of cholera cases in a particular district of London. He immediately got the authorities to remove the handle of the contaminated pump, hence preventing its use, and hundreds of people were immediately saved from drinking its foul water and dying.

That’s the story, right? Well, the key events themselves seem to be true, and it remains a great example of that all-too-rare phenomena of data analysis leading to direct action. But it overlooks the point that, by the time the pump was disabled, the local cholera epidemic had already largely subsided.

The International Journal of Epidemiology published a commentary regarding the Broad Street pump in 2002, which included a chart using data taken from Whitehead’s “Remarks on the outbreak of cholera in Broad Street, Golden Square, London, in 1854” paper, which was published in 1867. The chart shows, quite vividly, that by the date that the handle of the pump was removed, the local cholera epidemic that it drove was likely largely over.

handle

As Whitehead wrote:

It is commonly supposed, and sometimes asserted even at meetings of Medical Societies, that the Broad Street outbreak of cholera in 1854 was arrested in mid-career by the closing of the pump in that street. That this is a mistake is sufficiently shown by the following table, which, though incomplete, proves that the outbreak had already reached its climax, and had been steadily on the decline for several days before the pump-handle was removed

Lesson 7: timely analysis is often vital – but if it was genuinely important to analyse urgently, then it’s likely important to take action on the findings equally as fast.

It seems plausible that if the handle had been removed a few days earlier, many more lives could have been saved. This was particularly difficult in this case, as Snow had the unenviable task of persuading the authorities too take action based on a theory that was counter to the prevailing medical wisdom at the time. At least any modern-day analysts can take some solace in the knowledge that even our highest regarded dataviz heroes had some frustration in persuading decision makers to actually act on their findings.

This is not at all to reduce Snow’s impact on the world. His work clearly provided evidence that helped lead to germ theory, which we now hold to be the explanatory factor in cases like these. The implications of this are obviously huge. We save lives based on that knowledge.

Even in the short term, the removal of the handle, whilst too late for much of the initial outbreak, may well have prevented a deadly new outbreak. Whitehead happily acknowledged this in his article.

Here I must not omit to mention that if the removal of the pump-handle had nothing to do with checking the outbreak which had already run its course, it had probably everything to do with preventing a new outbreak; for the father of the infant, who slept in the same kitchen, was attacked with cholera on the very day (Sept. 8th) on which the pump-handle was removed. There can be no doubt that his discharges found their way into the cesspool, and thence into the well. But, thanks to Dr. Snow, the handle was then gone.

Lesson 8: even if it looks like your analysis was ignored until it was too late to solve the immediate problem, don’t be too disheartened –  it may well contribute towards great things in the future.

Books I read in 2016

Reading is one of the favoured hobbies in the DabblingWithData household. In 2016 my beloved fiance invited me to participate in the Goodreads Reading Challenge. It’s simple enough – you set a target and then see if you can read that many books.

The challenge does have its detractors; you can see that an obsession with it will perversely incentivise reading “Spot the Dog” over “Lord of the Rings“. But if you participate in good spirits, then you end up building a fun log of your reading which, if nothing else, gives you enough data that you’ll remember at least the titles of what you read in years hence.

I don’t quite recall where the figure came from, but I had my 2016 challenge set at 50 books. Fifty, you might say, that’s nearly one a week! Surely not possible – or so I thought. I note however that my chief competitor, following a successful year, has set this year’s target to 100, so apparently it’s very possible for some people).

Anyway, Goodreads has both a CSV export feature of the books you log as having read in the competition, and also an API.  I therefore thought I’d have a little explore of what I managed to read. Who knows, perhaps it’ll help improve my 2017 score!

Please click through for slightly more interactive versions of any chart, or follow this link directly. Most data is taken directly from Goodreads, with a little editing by hand.

How much did I read.png

Oh no, I missed my target 😦 Yes, fifty books proved too challenging for me in 2016 – although I got 80% of the way there, which I don’t think is too terrible. My 2017 target remains at fifty.

The cumulative chart shows a nice boost towards the end of August, which was summer holiday time for me. This has led me to conclude the following actionable step: have more holidays.

I was happy to see that I hadn’t subconsciously tried to cheat too much by reading only short books. From the nearly 14k page-equivalents I ploughed through, the single most voluminous book was Anathem. Anathem is a mix of sci-fi and philosophy, full of slightly made-up words just to slow you down further – an actual human:alien glossary is generously included in the back of the book.

The shortest was the Ladybird Book of the Meeting. This was essential reading for work purposes of course, and re-taught me eternal truths such as “Meetings are important because they give everyone a chance to talk about work. Which is easier than doing it”.

Most of my books were in the 2-400 page range – although of course different books make very different usages of a “page”.

So what did I read about?

what-did-i-read

Science fiction is #1 by book volume. I have an affinity for most things that have been deemed geeky through history (and perhaps you do too, if you got this far in!), so this isn’t all that surprising.

Philosophy at #2 is a relatively new habit, at least as a concerted effort. I felt that I’d got into the habit of concentrating too much on data (heresy I know), technology and related subjects in previous years’ reading habits – so thought I’d broaden my horizons a bit by looking into, well, what Google tells me is merely the study of “the fundamental nature of knowledge, reality, and existence”. It’s very interesting, I promise. Although it can be pretty slow to read as every other sentence one does risk ending up staring at the ceiling wondering whether the universe exists, and other such critical issues. Joking aside, the study of epistemology, reality and so on might not be a bad idea for analysty types.

Lower down we’ve got the cheap thriller and detective novels that are somewhat more relaxing, not requiring either a glossary or a headache tablet.

I was a little surprised at what a low proportion of my books were read in eBook format. For most – not all – books, I think eReaders give a much superior reading experience to ye olde paper. This I’m aware is a controversial minority  opinion but I’ll stick to it and point you towards a recent rant on the Hello Internet podcast to explain why.

 

So I’d have guessed a 80-90% eBook rate – but a fair number of paper books actually slipped in. Typically I suspect these are ones I borrowed, or ones that aren’t available in eBook formats. Some of Asimov’s books, of which I read a few this year, for instance are usually not available on Kindle.

On which subject, authors. Most included authors only fed my book habit once last year, although the afore-mentioned Asimov got his hooks into me. This was somewhat aided by the discovery of a cluster of his less well-known books fortuitously being available for 50p each at a charity sale. But if any readers are interested in predictive analytics and haven’t read the Foundation Trilogy, I’d fully recommend even a full price copy for an insight into what the world might have to cope with if your confusion matrix ever showed perfection in all domains.

Sam Harris was the second most read. That fits in with the philosophy theme. He’s also one of the rare people who can at times express opinions that intuitively I do not agree with at all, but does it in a way such that the train of thought that led him to his conclusions is apparent and often quite reasonable. He is, I’m aware, a controversial character on most sides of any political spectrum for one reason or another.

Back to format – I started dabbling with audio books, although at first did not get on so well with them; there’s a certain amount of concentration needed which comes easier to me when visual-reading than audio-reading. But I’m trying again this year, and it’s going better – practice makes perfect?

The “eBook /Audio” category refers to a couple of lecture series from the Great Courses  which give you  a set of half hour lectures to listen to, and an accompanying book to follow along with. These are not free but they cover a much wider range of topics than the average online MOOC seems to (plus you don’t feel bad about not doing assignments – there are none).

Lastly, the GoodReads rating. Do I read books that other people think are great choices? Well, without knowing the background distribution of ratings, and taking into account the number of reviews and from whom, it’s hard to do much except assume a relative ranking when the sample gets large enough.

It does look like my books are on the positive side of the 5-points scale, although definitely not the amongst GoodReads’ most popular. Right now, that list starts with The Hunger Games, which I have read and enjoyed, but it wasn’t in 2016. Looking down the global popularity list, I do see quite a few I’ve had a go at in the past, but almost none that I regret choosing one of my actual choices over this year at first sight!

For the really interested readers out there, you can see the full list of my books and links to the relevant Goodreads pages on the last tab of the viz.

5 Power BI features that might make Tableau users a little jealous

New year, new blog post, new tool version to play with! It’s clear that the field of data-related stuff progresses extremely rapidly at present, and hence it behoves those of us of an analyst bent to, now and then, go explore tools that we don’t use day-to-day. We may already have our favourites in each category, but, unless we’ve done a recent review, it’s quite possible the lesser-loved packages have developed a whole new bunch of goodies since the last checkup.

With that in mind, I’ve taken a look at the latest version of Microsoft Power BI. It’s billed in this manner by its creators:

Power BI transforms your company’s data into rich visuals for you to collect and organize so you can focus on what matters to you.

It’s therefore an obvious competitor for software like Tableau, Qlikview, chart.io, and many others, and largely can replace Microsoft’s previous PowerView offering, which was accessed directly via Excel. In a similar way to the Tableau suite, there’s a Power BI desktop package that analysts install locally on their computer primarily to manipulate data and construct visuals, and a web-based Power BI service that allows for publication and distribution of the resulting file. Actually the online service is pretty powerful in terms of allowing you to create reports and dashboards via the web, and includes a few other nifty features designed to improve the usability of this software genre – so even some analysts might get a lot out of the web-based version alone.

A lot of Power BI is actually free of charge to use, although there is an enhanced “Pro” edition at around US$10 a month, replete with plenty of more enterprisey features as you can see on their comparison chart. If you’re working somewhere with an Office 365 subscription, you might find you already have access to Power BI, even if you didn’t know about it. So, there’s not much to stop you having a play with it if you’re even remotely interested.

Anyhow, this post is not to review Power BI overall, but rather to point out 5 features that stood out to me as not being present in my current dataviz software of choice, Tableau. These therefore aren’t necessarily the general “5 best features of Power BI” – both Tableau and Power BI can create a pretty line chart, so it’s not really worth pointing that out in this context. My choices should then really be considered from the context of someone already deeply familiar with what Tableau or other competitors already offer.

Also note that software packages aren’t supposed to be feature-identical; many programs aimed at solving the same sort of problems may be completely different in their philosophy of design. Adding some features necessitates a cost in terms of whether other features can be supported. This then is not a request to Tableau and competitors to copy these features. But I do vehemently think it’s useful for day-to-day data practitioners to remain aware of what software features are out there in the wild today, just case it gives you a better option to solve a particular problem you encounter one day.

As a spoiler: for what it’s worth, my dive into Power BI hasn’t resulted in me throwing my lovely copy of Tableau away, not a chance; you can pry that from my cold dead hands etc. There’s a certain fluidity in Tableau, especially when used for adhoc analysis, that I’ve not yet encountered in its more obvious competitors, which seems very conducive to digging for insights.

But it has led me to believe that the Microsoft offering has improved substantially since the time years ago I used to battle against v1 PowerPivot (which itself was great for some specific data manipulation activities…but eventually I got tired of the out-of-memory errors!). And, especially due to the way its licensed – to be blunt, far cheaper than Tableau for some configurations – it’ll remain in my mind when considering tools for future projects.

So, in no particular order, here’s some bits and pieces that piqued my curiosity:

1: Focus mode

Let’s start with a simple one. Dashboards typically contain several charts or tables that are designed to provide insight upon a given topic. Ideally the combination of content that makes up a dashboard should usually fit on a single screen, and an overall impression of “is it good or bad news?” should be available at a glance.

In designing dashboards, especially those that are useful for multiple audiences, there’s often therefore a tension between providing enough visualisations such that every user has the information they need, vs making the screen so cluttered or hard to navigate through that no user enjoys the experience of trying to decipher 1-inch square charts whatsoever.

For cases where a particular chart on a dashboard is of interest to a user, Power BI has a “focus” mode that allows the observer to zoom in and interact with that single chart on a dashboard or report on a near-fullscreen basis, without requiring any extra development work on the part of the analyst.

It’s a simple enough concept – the user just clicks a button on whichever visualisation they’re interested in, and it zooms in to fill up most of the screen until they click out of it. It keeps its original interactivity, plus displays some extra meta-information that might be useful (last refresh time etc.). But the main point is it becomes big enough to potentially help generate deeper insights for a particularly interested end user in a way that a little 1 inch square chart shoved at the bottom of a dashboard might struggle to do, even if the 1 inch version is more appropriate for the average dashboard viewer.

If that description isn’t clear, then it’s probably better seen in video form. For example:

 

2: Data driven alerts

Regular readers might have established that I’m a big fan of alerting, when it comes to trying to promote data driven decision making. I’m fairly convinced that many dashboards come with a form of “engagement decay”, where the stakeholder is initially obsessively excited with their ability to access data. But as time goes on they get quite bored of checking to see if everything’s OK – especially if everything usually is OK – and hence stop taking the time to consult a potentially valuable source of decision making.

So, for these types of busy execs, and anyone else wanting to optimise productivity, I like alerts. Just have the dashboard send some sort of notification whenever there’s actually something “interesting” to see.

Sure enough, Power BI has the capacity to alert the user upon certain KPI events, via its own web-based notification centre or, more usefully, email or phone app.

powerbialert

The implementation is pretty simple and somewhat restrictive at the moment. Alerts can only be set up on “numeric tiles featuring cards, KPIs, and gauges”, the alert triggers are basic above X or below X type affairs, and you’re restricted to being alerted once an hour or once a day. So there’s a lot of potential room for development – I’d like to see statistical triggers for instance – “alert me if something unusual happens”.

The good news for Tableau users is that Tableau has promised a similar feature will be coming to their software in the future (and to some extent an analyst can create similar functionality event now with the “don’t send email if view is empty” option recently added). But if you want a nice simple “send me an email whenever my sales drop below £10,000” feature that non-analytical folks can easily use, then Power BI can do that right now.

3: Custom visualisations

All mainstream dataviz products should be able to squeeze out the tried-and-tested basic varieties of visuals; line chart, bar chat, scatterplot et al. And >= 90% of the time this is often enough, in fact usually the best approach for clarity. But sometimes, for better or worse, that’s not sufficient for certain use-cases. You can see this tension surfacing within the Tableau community where, despite the large number of proven chart types it can handle,  there are even larger number of blogs, references documents et al. as to what form one has to coerce your data into order to simulate more esoteric visualisation types within software that has not been natively designed to produce them.

A couple of common examples in recent times would include Sankey charts or hexagonal binning. Yes, you can construct these types of viz in Tableau and other competing products – but it requires a bit of workaroundy pre-work, and entirely interrupts the naturalistic method of exploring data that these tools seek to provide. For example, an average user wishing to construct a Sankey chart in Tableau, may want to search out and thoroughly read one or many of a profusion of useful posts, including those here, here, here, and here and several more places throughout the wilds of the web.

It’s very cool that these resources exist – but imagine if instead of having to rely on researching and recreating clever people’s ingenious workarounds, an expert could just provide a one-click solution to your problem. Or you could share your genius more directly with your peers.

Power BI presents an API where an advanced user can create their own visualisation types. These then integrate within Power BI’s toolbox, as though Microsoft had provided them in the base package. Hence data vizzers of all skill levels can use that type of visual without the need for any programming or mathematical workarounds. It should be noted that the procedure for creating these does require learning a superset of Javascript called Typescript, which would certainly not be expected of most Power BI audiences.

But this barrier is alleviated via the existence of a public gallery of these visualisations that Microsoft maintains, which allows generous developers to share their creations world-wide. A Power BI user wouldn’t have to think about the mathematical properties underyling a Sankey plot – they could just download a Sankey chart type addin such as this one.

sankey.PNG

Now, this open access does introduce some risks of course. Thanks to Spiderman, we all know what great power comes with. And even on the public custom visuals gallery, you’ll see some entries that, well, let’s say Stephen Few might object to.

pyramid

Bonus feature: you can also display native R graphics in your Power BI dashboard, with some limitations.

4: “Pin anything to  dashboard” for non-analyst end users

To understand this one, you need to know something about the Power BI object types. Simply that a “report” is made out of a “dataset”, and a “dashboard” is usually, but not exclusively, made out of components of reports*. A dataviz expert can publish any combination of those (or even publish a mixed set of them as a content pack, which any interested users can download to use with a few clicks – another potentially nifty idea!).

(* Tableau users – you can then think of a report as a worksheet, but a worksheet that can support multiple vizzes with arbitrary placement.)

Reports are what they sound like; the electronic equivalent of a notebook with between zero and many data visualisations on each page concerning a particular topic. Note though an important limitation of being restricted to a single datasource per report. In Power BI you create reports with the simple drag and drop of charting components and configurations, after selecting the appropriate datasource. Charts stick around, in interactive form, wherever you drag them to, almost  as though you were making a Powerpoint slide. No “containers” needed, Tableau-fans 🙂

Dashboards however have a more fixed format; always appearing as though they were a set of tiles, each with a different item in. There’s no restriction on data sources, but some restrictions on functionality; such as no-cross filtering between independent tiles. A dashboard tile can be any viz from any report, a whole report itself (which can then cross-filter within the scope of the report) or some miscellaneous other stuff including “live” Excel workbooks, static images, and even answers to natural language questions you may have asked in the fancy Q&A functionality (“what were our sales last month?”).

So, what’s this about non-analysts? Well, a difference between Power BI dashboards and those from some other tools is that even people considered as as being solely viz consumers can legitimately create their own dashboards. A non-analytical end-user can choose to pin any individual chart from any individual report (or the other types of items listed above) to a new dashboard and hence create a smorgasbord showing exactly the parts of each report / pre-made dashboard they are actually interested in all on one page. After all, the individual viz consumer is by definition best placed to know what’s most important to them.

Here’s what that looks like in reality:

This is perhaps one approach to solving the problem that often in reality the analyst is designing a dashboard for an multi-person audience, within which each individual has slightly different needs. Each user might be interested in a different 3 of the 5 charts in your dashboard. Here, each user could then choose to pin their favourite 3 to their own start up page, or any other dashboard they have control over, together with their favourite data table from another report and most loved Excel workbook, if they insist.

How this actually plays out in practice with novice users would be interesting to see. I think a certain type of non-analyst power user would find this pretty useful, and it’s a more realistic a concept of “even non-analysts can make dashboards with no training” than a lot of these types of tools foolishly promise.

5: More powerful data manipulation tools

This one is more for advanced users. Power BI lets you manipulate the data (you might even say business-user “ETL”) before you start employing it in your visualisations. Most dashboarding tools likely let you do this to some extent – Tableau recently improved its ability to union data for instance, together with some cleaning features, and it’s had joining and blending for a while. You can also write VizQL formulae to produce calculations at the time of connecting to data.

Power BI’s query editor seems to be more powerful than many, with a couple of particular nice features.

Firstly, it uses a language called ‘M’ which is specifically designed with data mashups in mind. Once you’ve obtained your data with the query editor, you can then go on to use the DAX language (designed for data analysis, and whose CALCULATE() function has a soft spot in my heart from previous projects) throughout Power BI in terms of working on data you already have access to.

The query editor is fully web-data enabled; even scraping data right off appropriately formatted web pages without any scripting work at all. Here’s the Microsoft team grabbing and applying a few transforms to IMDB data.

One query-editor feature I particularly like somewhat addresses the disadvantage that some of these user-friendly manipulation tools have vs scripting languages like R; that of reproducibility.

In Power BI, as you go through and apply countless modifications to your incoming dataset, a list of “applied steps” appears to the side of your data pane. Here’s an example from the getting started guide.

appliedsteps

It’s a chronological list of everything you’ve done to manipulate the data, and you also have the ability to go back and delete or edit the steps as you please. No more wondering “how on earth did I get the data into this format?” after an hour of fiddling around transforming data.

There’s plenty of built-in options for cleaning up mucky data; including unpivoting, reordering, replacing values and a fill-down type operation that fills down data until it next sees a value in the same column,  which handles those annoying Excel sheets where each group of rows only has its name filled in on the top row. Unioning and joining is of course very possible,  and you’ll have access to a relationships diagram view, for anyone who fancies having a look at, or modifying, how tables relate to each other.

Analysts are not limited to connecting to existing data either. Non-DBA types can create new tables directly in Power BI and type or paste data directly into them if you wish (although I’d be wary of over-using this feature…be sure to future-proof your work!). You can also upload your standard Excel workbooks directly to the service for web Power BI to access to its underlying data.

If Power BI already has the data tables you want, but they’re just formatted suboptimally or over-granular, then you can use DAX to create calculated tables whereby you use the contents of other imported tables to build your own in-memory virtual table. This might allow you to, for instance, reduce your use of intermediate database temporary tables for some operations, perhaps performing some 1-time aggregation before analysing for instance.

Retrieving Adobe SiteCatalyst data with R

Adobe SiteCatalyst (part of Adobe Analytics) is a nicely comprehensive tool for tracking user interactions upon one’s website, app and more. However, in the past I’ve had a fair amount of trouble de-siloing its potentially immensely useful data into external tools, such that I could connect, link and process it for insights over and above those you can get within the default web tool (which, to be fair, is itself improving over time).

I’ve written in the past about my happiness when Alteryx released a data connector allowing one to access Sitecatalyst data from inside their own tool. That’s still great, but the tool is necessarily constrained to the specific tasks the creator designed it to do, and subject to the same API limits as everyone else is. I have no doubt that there are ways and means to get around that in Alteryx (after all, it speaks R). But sometimes, when trying to automate operations, coding in something like might actually be easier…or at least cheaper!

With that in mind, after having a recent requirement to analyse web browsing data at a individual customer level, I successfully experimented with the awesome RSiteCatalyst package, and have some notes below as to the method which worked well for me.

Note that RSiteCatalyst is naturally subject to the usual Adobe API limits – the main one that causes me sadness being the inability to retrieve over 50k rows at a time – but, due to the incessant flexibility of R and the comprehensiveness of this package, I’ve not encountered a problem I couldn’t solve just yet.

So, how to set up?

Install the RSiteCatalyst package

First, open up R, or your favourite R environment, and install the RSiteCatalyst package (the first time you use it) or load the library (each new session).

if(!require(RSiteCatalyst)) install.packages("RSiteCatalyst")
library(RSiteCatalyst)

Log in to your SiteCatalyst account

You next need to authenticate against your Adobe Sitecatalyst installation, using the SCAuth function. There’s an old way involving a shared secret, and a new way using OAuth. The latter is to be preferred, but at the time I first looked at it there seemed to be a web service issue that prevented the OAuth process completing. At the moment then, I can confirm that the old method still works!

 key <- "username:company>"
 secret <- "https://sc.omniture.com/p/suite/1.3/index.html?a=User.GetAccountInfo >"
 SCAuth(key, secret)

 

Retrieve metadata about your installation

Once you’re in, there’s a plethora of commands to retrieve useful metadata, and then to run and retrieve various types of report data. For several such commands, you’ll need to know the ID of the Adobe Report Suite concerned, which is fortunately as easy as:

suites <- GetReportSuites()

whereby you’ll receive a dataframe containing all available report suites by title and ID.

If you already know the title of the report suite you’re interested in then you can grab the ID directly with something like:

my.rsid <- suites[suites$site_title=="My favourite website",1]

You can then find out which elements are available within your report suite:

elements.available <- GetElements(my.rsid)

and later on which metrics are available for a given element

metrics.available <- GetMetrics(my.rsid, elements = "<<myFavouriteElement>>")

Retrieve Adobe Analytics report data

There are a few different RSitecatalyst functions you can call, depending on the type of report you’re interested in. In each case they start with “queue”, as what you’re actually doing is submitting a data request to the Sitecatalyst reporting queue.

If your request is pretty quick, you can wait a few seconds and get the results sent back to you immediately in R. If it’s going to take a very long time, you can instead store a report request ID and then use the GetReport function to go back and get it later, once it’s finished.

The current list of queue functions, which are named such that Adobe aficionados will probably be able to guess which type of data they facilitate, is:

  • QueueDataWarehouse
  • QueueFallout
  • QueueOvertime
  • QueuePathing
  • QueueRanked
  • QueueSummary
  • QueueTrended

Here I’ll show just a couple of examples – but the full official documentation for all of them and much more besides is available at Cran.

Firstly, an “over  time” report to get me the daily pageview and visit counts my site received in the first half of 2016.

Here’s how the documentation describes this type of report:

“A QueueOvertime report is a report where the only granularity allowed is time. This report allows for a single report suite, time granularity, multiple metrics, and a single segment”

An example would be a day by day count of total pageviews to your site.

Remember that above we set the variable “my.rsid” to the Report Suite ID of the report suite I am looking at. So:

dailyStats <- QueueOvertime(my.rsid,
                            date.from = "2016-01-01",
                            date.to = "2016-06-30",
                            metrics = c("pageviews","visits"),
                            date.granularity = "day"
                            )

The parameters are fairly self-evident, especially in combination with the documentation. There are many more available, including the ability to isolate by segment and to forecast data.

Your site will probably have some different metrics available to mine, depending on how the Adobe admin set it up, but the basics like page views and visits should be accessible pretty much anywhere.

What you’ll get back with the above command, i.e. the contents of the dailyStats variable after it has run, is a dataframe in this sort of format:

datetime name year month day segment.id segment.name pageviews visits
01/01/2016 Fri. 1 Jan. 2016 2016 1 1 100 75
02/01/2016 Sat. 2 Jan. 2016 2016 1 2 200 150
03/01/2016 Sun. 3 Jan. 2016 2016 1 3 250 180

Which you can then go on to process as you would any other such snippet of data in R.

My second, slightly more complex, example concerns the QueueRanked function. It’s described like this:

A QueueRanked report is a report that shows the ranking of values for one or more elements relative to a metric, aggregated over the time period selected.

That’s a bit more vague, but was useful to me in my original goal of identifying specifically which customers logged into my website in December, and how many times they visited.

A key feature of this function is that you can ask for the results from rank [x] to rank [y], instead of just the top [n].

This is super-useful where, like I did,  you expect to get over 50k rows. 50k is maximum row limit you can retrieve in one request via the Adobe API, which this R package uses. But R is full of the typical program language features like loops, thus allowing one to iterate through the commands to retrieve for instance results 1-50,000, then results 50,001 -100,000, then 100,001 – 150,000 and so on.

So, I built a loop that would generate these “ranked reports”, starting at row ‘i’ and giving me the next ‘count.step’ records, where count.step = 50000, the maximum I’m allowed to retrieve in one go.

Thus, I’d call the function repeatedly, each time asking for the next 50,000 records. At some point, when there were no more customers to download, I’d get a blank report sent back to me. At that point, I know I have everything so quit the loop.

I wanted to retrieve the ID of the customer using the website, which in my setup is stored in an custom element called “prop1”. All that sort of detail is controlled by your Adobe Sitecatalyst administrator, should you have exactly the same sort of requirement as I did – so best go ask them which element to look in, as there’s no real chance your setup is identical to mine at that level.

Nonetheless, the code pattern below could likely be used without much modification in order to iterate through any SiteCatalyst data that exceeds the API row limits.

 

count.limit <- 500000 #the max number of records we're interested in
count.step <- 50000 #how many records to retrieve per request, must not exceed 50k
count.start <- 1 #which record number to start with
CustomerList <- NULL #a variable to store the results in
fromDate <- "2016-12-01"
toDate <- "2016-12-31"

for(i in seq(1, count.limit, by = count.step)) {
  print(paste("Requesting rows",i, "through", i + count.step - 1))

  tempCustomerList <- QueueRanked(my.rsid,
                          date.from = fromDate,
                          date.to = toDate,
                          metrics = "visits",
                          elements = "prop1",
                          top = count.step,
                          start = i
                          )

  if  (nrow(tempCustomerList) == 0 ) {   # no more rows were returned - presumably we have them all now
    print("Last batch had no rows, exiting loop")
    break
  }
   
  tempCustomerList$batch.start.row <- i

  CustomerList <- rbind(customerList, tempCustomerList)

}

 

After running this, you should end up with a “CustomerList” dataframe that looks something like this, where “name” is the value of prop1, in my case the customer ID:

name url visits segment.id segment.name batch.start.row
ID123 10 1
ID234 5 1

which, again, you can process as though it was any standard type of R data.

Actually you can use variables, CTEs and other fancy SQL with Tableau after all

A few months ago, I blogged about how you can use Tableau parameters when connecting to many database datasources in order to exert the same sort of flexibility that SQL coders can build into their queries using SQL variables.

This was necessary because Tableau does not let you use SQL variables, common table expressions, temp table creation or other such fanciness when defining the data you wish to analyse, even via custom SQL. You can copy and paste a SQL query using those features that works fine in Microsoft’s SQL Server Management Studio or Oracle’s SQL Developer into Tableau’s custom SQL datasource box, and you’ll get nothing but errors.

But it turns out that there are actually ways to use these features in Tableau, as long as you only need to run them once per session – upon connection to the database.

Why do this? Well, if you have a well-designed data warehouse or other analytically-friendly datasource hopefully you’ll never need to. However, if you’re accessing “sub-optimally” designed tables for your task, or large raw unprocessed operational data tables, or other such scenarios that consistently form one of the bane of an analyst’s life, then you might find manipulating or summarising data before you use it in Tableau makes life much easier, much faster, or both.

To do this, we’ll use the “Initial SQL” feature of Tableau. This is supported on some, but not all, of the database types Tableau can connect to. You’ll know if your data connection does support it, because an option to use it will appear if you click on the relevant connection at the top left of the data source screen.

capture

If you click that option, you’ll get a text box where you can go crazy with your SQL in order to set up and define whatever final data table you’d like to connect to. Note that you’ll be restricted to the features of your database and the permissions your login gives you – so no standard table creation if you don’t have write-permissions etc. etc. But, to take the example of SQL Server, in my experience most “normal” users can write temporary tables.

Tableau isn’t great at helping you write this initial SQL – so if you’re not super-proficient and intend to write something complicated then you might want to play with it in a friendlier SQL tool first and paste it into Tableau when you know it works, or ask a friendly database-guru to do it for you.

Below then is an example of how one can use variables, CTEs and temporary tables in order to pre-build a data table you can then analyse in Tableau, by virtue of pasting code into the initial SQL box. This code will be re-run and hence refreshed every time you open the workbook. But as the name “initial SQL” suggests, it will not be refreshed every time you create or modify a new chart if you’ve not re-connected to the datasource inbetween.

(For what it’s worth, this uses the default demo database in SQL server – AdventureWorks. It’s the equivalent of Tableau’s famous Superstore dataset, in the Microsoft world 🙂 )

DECLARE @granularity AS varchar(50);
SET @granularity = 'yearly';

WITH OrdersCTE AS
(
SELECT OrderDate, SalesOrderID, TotalDue
FROM [SalesLT].[SalesOrderHeader]
)

SELECT
CASE WHEN @granularity = 'yearly' THEN CONVERT(varchar,YEAR(OrderDate))
WHEN @granularity = 'monthly' THEN CONVERT(varchar,YEAR(OrderDate)) + '-' + CONVERT(varchar,MONTH(OrderDate))
WHEN @granularity = 'daily' THEN CONVERT(date,OrderDate)
END AS Period,

COUNT(SalesOrderID) AS CountOfSales,
SUM(TotalDue) AS TotalDue
INTO #initialSQLDemo
FROM OrdersCTE

GROUP BY
CASE WHEN @granularity = 'yearly' THEN CONVERT(varchar,YEAR(OrderDate))
WHEN @granularity = 'monthly' THEN CONVERT(varchar,YEAR(OrderDate)) + '-' + CONVERT(varchar,MONTH(OrderDate))
WHEN @granularity = 'daily' THEN CONVERT(date,OrderDate)
END;

Now, this isn’t supposed to be an SQL tutorial, so I’ll not explain in detail what the above does. But for those of you already familiar with (T)SQL, you’ll note that I set a variable as to how granular I want my dates to be, use a CTE to build a cut-down version of a sales table with fewer columns, and then aggregate my sales table to the level of date I asked for in my variable into a temporary table I called #initialSQLDemo.

In the above, it’s set to summarise sales by year. This means Tableau will only ever receive 1 row per year. You’ll not be able to drill down into more granular dates – but if this is detailed enough for what you want, then this might provide far better performance than if you connected to a table with 1 row per minute and, implicitly or explicitly, ask Tableau to constantly aggregate it in expensive ways.

Later on, perhaps I realise I need daily data. In which case, I can just change the second line above to :

SET @granularity = 'daily';

…which will expose data at a daily level to Tableau, and I’m good to go.

SQL / Adventureworks aficionados will probably realise that my precise example is very contrived, but hopefully it’s clear how the initial SQL feature works, and hence in the real world you’ll be able to think of cases that are truly useful. Note though that if your code does something that is slow, then you will experience this slowness whenever you open your workbook.

A quick note on temporary tables:

If you’re following along, you might notice something. Although I created a temp table called #initialSQLDemo (which being temporary, will be deleted from the database as soon as you close your session), it never appears in the Table list on the left of the Tableau data source  screen.

capture

Why so? Well, in SQL Server, temporary tables are created in a database called “tempdb” that Tableau doesn’t seem to show. This is not a fatal deal though, as they’re still accessible via the “New Custom SQL” option shown above.

In my example then, I dragged New Customer SQL to the data pane and entered a simple “give me everything” into the resulting box.

capture

Now, there is a downside in that I understand using custom SQL can reduce performance in some cases, as Tableau is limited in how it can optimise queries. But a well-designed temporary table might in any case be intrinsically faster to use that an over-sized ill-designed permanent table.

When venturing back into Tableau’s main viz-design screen, you’ll see then your temporary table is treated just as any other table source would be. Tableau doesn’t care that it’s temporary.

capture

Although we should note that if you want to use the table in the future in a different Tableau workbook, you’d have to have it run the initial SQL again there, as standard temporary tables are not shareable and do not persist between database sessions.

 

Future features coming to Tableau 10.2 and beyond – that they didn’t blog about

Having slowly de-jetlagged from this year’s (fantastic and huge) Tableau conference, I’d settled down to write up my notes regarding the always-thrilling “what new features are on the cards?” sessions, only to note that Tableau have already done a pretty good job of summarising it on their own blog here, here and here.

There’s little point in my replicating that list verbatim, but I did notice that a few things that I’d noted down from the keynote announcements that weren’t immediately obvious in Tableau’s blog posts. I have listed some of those for below for reference. Most are just fine details, but one or two seem more major to me.

Per the conference, I’ll divide this up into “probably coming soon” vs “1-3 year vision”.

Coming soon:

Select from tooltip – a feature that will no doubt seem like it’s always been there as soon as we get it.

We can already customise tool tips to show pertinent information about a data point that don’t influence the viz itself. For example, if we’re scatter-plot analysing sales and profit per customer, perhaps we’d like to show whether the customer is a recent customer vs a long term customer in the tool tip when hovered over.

In today’s world, as you hover over a particular customer’s datapoint, the tooltip indeed may tell you that it’s a recent customer. But what’s the pattern in the other datapoints that are also recent customers?

In tomorrow’s world you’ll be able to click where it tells you “recent customer” and all the other “recent customers” in the viz will be highlighted. It’s nothing that you can’t get the same end result today with the use of the highlighter tool, but likely far more convenient in certain situations..

A couple of new web-authoring features, to add to the list on the official blog post.

  1. You can create storypoints on the web
  2. You’ll be able to enable full-screen mode on the web

Legends per measure: this might not sound all that revolutionary, but when you think it through, it enables this sort of classic viz: a highlighted table on multiple measures – where each measure is highlighted independently of the others.

legendpermeasure.PNG
Having average sales of £10000 doesn’t any more have to mean that the high customer age of 100 in the same table is highlighted as though it was tiny.

Yes, there are workarounds to make something that looks similar to the above today – but it’s one of those features that I have found those people yet to be convinced of the merits of Tableau react negatively to when it turns out it’s not a simple operation, after they compare it to other tools (Excel…). Whilst recreating what you made in another tool is often exactly the wrong approach to using a new tool, this type of display is one of the few I see a good case for making easy enough to create.

In the 1-3 year future:

Tableau’s blog does talk about the new super-fast data engine, Hyper, but doesn’t dwell on one cool feature that was demoed on stage.

Creating a Tableau extract is sometimes a slow process. Yes, Hyper should make it faster, but at the end of the day there are factors like remote database performance and network speed that might mean there’s simply no practical way to speed it up.  Today you’re forced to sit and stare at the extract creation process until it’s done.

Hyper, though, can do its extract-making process in the background, and let you use it piece-by-piece, as it becomes available.

So if you’re making an extract of sales from the last 10 years, but so far only the information from the last 5 years has arrived to the extract creation engine, you can already start visualising what happened in the last 5 years. Of course you’ll not be able to see years 6-10 at the moment, as it’s still winging its way to you through the wifi. But you can rest safe in the knowledge that once the rest of the data has arrived it’ll automatically update your charts to show the full 10 year range. No more excuses for long lunches, sorry!

It seems to me that this, and features like incremental refresh, also open the door to enabling near real-time analysis within an extract.

Geographic augmentation – Tableau can plot raw latitude and longitude points with ease. But in practice, they are just x-y points shown over a background display; there’s no analytical concept present that point x,y is part of the state of Texas whereas point y,z is within New York. But there will be. Apparently we will be able to roll up long/lat pairs to geographic components like zip, state, and so on, even when the respective dimension doesn’t appear in the data.

Web authoring – the end goal is apparently that you’ll be able to do pretty much everything you can do publishing-wise in Tableau Desktop on the web. In recent times, each iteration has added more and more features – but in the longer term, the aim is to get to absolute parity.

We were reassured that this doesn’t mean that the desktop product is going away; it’s simply a different avenue of usage, and the two technologies will auto-sync so that you could start authoring on your desktop app, and then log into a website from a different computer and your work will be there waiting for you, without the need to formally  publish it.

It will be interesting to see whether, and how, this affects licensing and pricing as today there is a large price differential between for instance a Tableau Online account and Tableau Desktop Professional, at least in year one.

And finally, some collaboration features on Tableau server.

The big one, for me, is discussions (aka comments).  Right alongside any viz when published will be a discussion pane. The intention is that people will be able to comment, ask questions, explain what’s shown and so on.

But, doesn’t Tableau Server already have this? Well, yes, it does have comments, but in my experience they have not been greatly useful to many people.

The most problematic issue in my view has been the lack of notifications. That is to say,  a few months after publishing a delightful dashboard, a user might have a question about a what they’re seeing and correctly pop a comment on the page displaying the viz. Great.

But the dashboard author, or whichever SME might actually be able to answer the question, isn’t notified in any way.  If they happen to see that someone commented by chance, then great, they can reply (note that the questioner will not be notified that someone left them an answer though). But, unless we mandate everyone in the organisation to manually check comments on every dashboard they have access to every day, that’s rather unlikely to be the case.

And just opening the dashboard up may not even be enough, as today they tend to be displayed “below the fold” for any medium-large sized dashboard. So comments go unanswered, and people get grumpy and stop commenting, or never notice that they can even comment.

The new system however will include @user functionality, which will email the user when a comment or question has been directed at them.  I’m also hoping that you’ll be able to somehow subscribe to dashboards, projects or the server such that you get notified if any comments are left that you’re entitled to see , whether or not you’re mentioned in them.

As they had it on the demo at least, the comments also show on the right hand side of the dashboard rather than below it – which given desktop users tend to have wide rather than tall screens should makes them more visible. They’ll also be present in the mobile app in future.

Furthermore, each time a comment is made, the server will store and show the state of the visualisation at that time, so that future readers can see exactly what the commenter was looking at when they made their comments. This will be great for the very many dashboards that are set up to autorefresh or allow view customisation.

Conversation.PNG

(My future comment wishlist #1: ability to comment on an individual datapoint, and have that comment shown wherever that datapoint is seen).

Lastly, sandboxes. Right now, my personal experience has been that there’s not a huge incentive to publish work-in-progress to a Tableau server in most cases. Depending on your organisation’s security setup, anything you publish might automatically become public before you’re ready, and even if not, then unless you’re pretty careful with individual permissions it can be the case that you accidentally share your file too widely, or not widely enough, and/or end up with a complex network of individually-permissioned files that are easy to get mixed up.

Besides, if you always operate from the same computer, there’s little advantage (outside of backups) to publishing it if you’re not ready for someone else to look at it. But now, with all this clever versioning, recommendy, commenty, data-alerty stuff, it becomes much more interesting to do so.

So, there will apparently be a user sandbox; a private area on the server where each Tableau user can upload and work on their files, safe in the knowledge that what they do there is private – plus they can customise which dashboards, metrics and so on are shown when they enter their sandbox.

But, better yet, team sandboxes! So, in one click, you’ll be able to promote your dashboard-in-progress to a place where just your local analytics team can see it, for instance, and get their comments, feedback and help developing it, without having to fiddle around with setting up pseudo-projects or separate server installations for your team.

Furthermore, there was mention of a team activity newsfeed, so you’ll be able to see what your immediate team members have been up to in the team sandbox since you last took a peek. This should be helpful for raising awareness of what each team member is working on high, further enhancing the possibilities for collaboration and reducing the likelihood of duplicate work.

Finally, it’s mentioned on Tableau’s blogs, but I wanted to extend a huge cheer and many thanks for the forthcoming data driven alerting feature! Lack of this style of alerting and insufficient collaboration features were the two most common complaints I have heard about Tableau Server from people considering the purchase of something that can be decidedly non-trivial in cost. Other vendors have actually gone so far as to sell add-on products to try and add these features to Tableau Server, many of which are no doubt very good -but it’s simply impossible to integrate them into the overall Tableau install as seamlessly as Tableau themselves could do.

Now we’re in 2016, where the average Very Important And Busy Executive feels like they don’t have time to open up a dashboard to see where things stand, it’s a common and obvious feature request to want to be alerted only when there is actually something to worry about – which may then result in opening the dashboard proper to exploring what’s going on. And, I have no doubt, creative analysts are going to find any number of uses to put it to outside of the obvious “let me know if my sales are poor today”.

(My future data driven alert wishlist #1: please give include a trigger to the effect of “if this metric has an unusual value”, meaning to base it on a statistical calculation derived from on historic variance/std dev/ etc. rather than having to put a flat >£xxxx in as criteria).

What people claim to believe: Hillary Clinton edition

Back to political opinion polls today I’m afraid. Yep, the UK’s Brexit is all done and dusted (haha) but now our overseas friends seem to be facing what might be an even more unlikely choice in the grand US presidential election 2016.

Luckily, the pollsters are on hand to guide us through the inner minds and intentions of the voters-to-be. At last glance, it was looking pretty good for a Clinton victory -although, be not complacent ye Democrats, given the lack of success in the field of polling with regards to the afore-mentioned Brexit or perhaps the 2015 General Election here in the UK.

Below is perhaps my favourite most terrifying poll of recent times. It’s a recent poll carried out by the organisation “Public Policy Polling” concerning residents of the state of Florida. As usual, they asked several questions about the respondents’ characteristics and viewpoints, which lets us divide up the responses into those coming from Clinton supporters vs those coming from Trump supporters.

There are many insidious facts one could elucidate here on both sides, but given that at the moment the main polls are very in favour of a Clinton win (but see previous comment re complacency…), let’s pick out some that might hold relevance in a world where Clinton semi-landslides to victory.

Firstly, it shouldn’t particularly matter, but one can’t help but notice that Clinton is of the female persuasion. But, hey, rational voters look at policies, competence, experience or similar attributes, so a basic demographic fact alone doesn’t matter, right?

Wrong: the survey shows that just 69% of all respondents thought that gender didn’t make a difference. And, predictably, twice as many thought that the US would be better off with a male president than those who thought it would be better of with a female president. The effect is notably strongest within Trump supporters, where nearly 20x the proportion of people think the US would be better with a male president than with a female one.

manorwoman

Now, I can imagine some kind of halo effect where it’s hard for people to totally differentiate “my favourite candidate is a man and I can’t imagine having a favourite candidate that is not like him” from “my favourite candidate is a man but the fact he happens to be a man is incidental”.

But that nearly 40% of Trump supporters here claim that generically the president should be a man (implying that if it was Ms Trump vs Mr Clinton, they might vote differently), it seems potentially a stronger signal of inequality than that, especially when compared to the lower bias between Clinton supporters and preferring a woman – which is equally as illogical, but at least has a lower incidence. We can note also a pro-male bias in the “not sure” population too.

Of course we don’t actually have an example of what the US is like when it has a female president, because none of the 43 serving presidents to date have been women.

But we do know part of what Hilary Clinton is already presidentially responsible for apparently. “Coincidentally” (hmm…) her husband was one of the previous 43 male presidents, and apparently the majority of Trump supporters think it’s perfectly right to hold her responsible for his “behaviour”.

Yep, anything he did, for good or bad (which, let’s face it, is probably biased towards the bad for those people who support the opposing party and/or don’t appreciate cheating spouses) is in some sense his wife’s fault, for the Trumpians.

responsible

But if she’s so obviously bad, then why does she actually poll quite well, at the time of writing? Well, of course there can be only one reason. The whole election is a fraud. And given we haven’t actually had the election yet, I guess the allegation must also entail that poll respondents are also lying about their intentions, and/or that all the publishers of polls are equally as corrupt as the electoral system of the US.

rigged

Yes, THREE-QUARTERS of Trump supporters polled here apparently believe that if, as seems quite likely, Clinton wins then it can only be because the election was rigged. The whole democratic process is a sham. The US has fallen prey to semi-visible forces of uber-powerful corruption. We should presumably therefore ignore the result and give Trump the golden throne (to fit inside his golden house). Choice of winner aside, this is a pretty scary indictment on the respect that citizens feel for their own democratic system. This is not to say whether they are right or wrong to feel this way; to us Brits, I think it sometimes seems that in the US money has even greater hold over some theoretically democratic outcomes in the US than it does over here – but that so many have so little regard for the system is surely…a concern.

But wait, it’s not just that she may hypothetically commit electoral fraud in the near future. She has apparently already committed crimes serious enough that she should already be locked up in prison.

prison

Over EIGHTY PERCENT of Trump supporters polled here think she should literally go to prison; and this isn’t predicated on her winning. Well, there’s no shortage of bad things that can be laid at her door I’m sure, she has after all been serving at a high level of politics for a while already and, without being an expert, it seems like there are many serious allegations that people lay at the Clintons’ feet. But it’s perhaps quite surprising that the large majority of her opponent’s supporters want to throw someone who is likely to be their next president in jail. I don’t think even the Blair war-crimes movement ever got quite that far!

Unless…well. I’m only sad they didn’t ask the same question about Trump. Perhaps we could be more at ease if at least the same proportion of people thought he should be locked up. An oft overlooked fact is that analysis is often meaningless without some sort of carefully-chosen comparison. Perhaps there’s a baseline figure of people that think any given prominent politician should be jailed (but I’ve not seen research on that).

It’s hard to imagine though that the fact Trump has himself actually appeared to threaten her with jail doesn’t play some role here with his supporters though. It is apparently unprecedented for a major party nominee to have said publicly that his opponent should be jailed – but say it he did, most famously during their second presidential debate. As the Guardian reports:

Trump, embracing the spirit of the “lock her up” mob chants at his rallies, threatened: “If I win I am going to instruct my attorney general to get a special prosecutor to look into your situation – there has never been so many lies and so much deception,” he threatened.

Clinton said it was “awfully good” that someone with the temperament of Trump was not in charge of the law in the country, provoking another Trump jab: “Because you’d be in jail.”

Eric Holder, who once was the US attorney general, didn’t really seem to like that plan.

So we’ve established that in the eyes of the average Florida Trump supporter polled here that if Clinton wins then the whole shebang was fraudulent, she already should have been locked up in prison, and, besides, the fact that she’s a women should probably ban her from applying to the office of the president in the first place. That’s a strong indictment. But, of course, there’s another level to explore.

Is Hillary Clinton a malevolent paranormal entity, intent on destroying humankind?

demon

Erm…2 out of every 5 Trump supporters here think yes, she definitely is an actual demon. And the majority aren’t sure that she is not an actual demon.

Even only just over 50% of the “not sure” supporters are also sure she’s not an actual demon. It’s also entertaining to contemplate the c. 10% of her supports that think she might be demonic yet still fancy her as president.

The lower figures might be down to some variant of the excellent StarSlateCodex’s concept of the “Lizardman’s Constant” which can perhaps be summed up as there’s a lower bound % of people who will believe, or claim to believe, any polled sentiment.

But there they benchmark that at around 4%, and ten times that proportion of Trump supporters here respond that they are certain that Clinton is a literal demon. There are many ways to introduce biases that lead to this sort of result, which StarSlateCodex does go over. But 40% is…big…if this poll is even remotely respectable.

So, where has this idea that she’s a demon come from? Have Trump supporters as a collective seen some special evidence that proves this must be true, that somehow the rest of us have overlooked? Surely each individual doesn’t randomly become subject to these thoughts which even believers would probably term an unusual state of affairs -is there no smoke without fire? (pun intended)

Well, perhaps it has something to do with a subset of famous-enough people have stated that she is.

Trump himself did refer to her as a devil, although in fairness that just maybe possibly might be an unfortunate turn of phrase, if we want to be charitable. After all, to his credit, evidence suggests he’s not great at following a script (or at least not one you’d imagine a typical political spinner would write).

Perhaps more pertinently, for certain a certain subsection of viewers anyway, is presenter Alex Jones of “Infowars” fame (a website that apparently gets more monthly visitors than e.g. the Economist or Newsweek), he who Trump says of “your reputation is amazing…I will not let you down”, who did go on a bit of a rant on this subject.

MediaMatters have kindly transcribed:

She is an abject, psychopathic, demon from Hell that as soon as she gets into power is going to try to destroy the planet. I’m sure of that, and people around her say she’s so dark now, and so evil, and so possessed that they are having nightmares, they’re freaking out… I mean this woman is dangerous, ladies and gentleman. I’m telling you, she is a demon. This is Biblical.

There’s so much more if you’re into that sort of stuff; see it all on this video, including the physical evidence he presents of Clinton’s demonness (spoiler alert: she smells bad, and Obama is obviously one too because sometimes flies land on him).

Unfortunately I’m not aware of time series data on perception of Clinton’s level of demonicness – so I’m afraid there’s no temporal analysis to present on causal factors here.

At first glance some of this might seem kind of amusing in a macabre way – especially to us foreigners for whom the local political process is hugely less pleasant or equitable than it should be, but it doesn’t usually come with claims of supernatural possession. But the outcome may not be so funny. In the likely (but not certain) event that Clinton wins, Florida at least seems to have a significant bunch of people who think the whole debacle was rigged, and Clinton should have a gender change, an exorcism and a long spell in jail before even being considered for for the presidency.

Update 1: this sort of stuff probably doesn’t help matters – from former Congressmen / Radio host Joe Walsh:

Update 2: the polls are a lot closer now then they were when I started writing.

Do good and bad viz choices exist?

Browsing the wonderful timeline of Twitter one evening, I noted an interesting discussion on subjects including Tableau Public, best practice, chart choices and dataviz critique. It’s perhaps too long to go into here, but this tweet from Chris Love caught my eye.

Not being particularly auspicious with regards to summarising my thoughts into 140 characters, I wanted to explore some thoughts around the subject here. Overall, I would concur with the sentiment as expressed – particularly when it had to be crammed into such a small space, and taken out of context as I have here 🙂

But, to take the first premise, whilst there are probably no viz types that are inherently terrible or universally awesome, I think one can argue that there are good or bad viz choices in many situations. It might be the case in some instances that there’s no best or worst viz choice (although I think we may find that there often is, at least out of the limited selection most people are inclined to use). Here I am imagining something akin to a data-viz version of Harris’ “moral landscape“; it may not be clear what the best chart is, but there will be local maximums that are unquestionably better for purpose than some surrounding valleys.

So, how do we decide what the best, or at least a good, viz choice is? Well, it surely comes down to intention. What is the aim of the author?

This is not necessarily self-evident, although I would suggest defaulting to something like “clearly communicating an interesting insight based on an amalgamation of datapoints” as a common one. But there are others:

  • providing a mechanism to allow end-users to explore large datasets which may or may not contain insights,
  • providing propaganda to back up an argument,
  • or selling a lot of books or artwork

to name a few.

The reason we need to understand the intention is because that should be the measure of whether the viz is good or bad.

Imagine my aim is to communicate that 10% of my customers are so unprofitable that we would be better off without them to an audience of ten may-as-well-be-clones business managers – note that the details of the audience is very important here too.

I’ll go away and draw 2 different visualisations of the same data (perhaps a bar chart and, hey, why not, a 3-d hexmap radial chart 🙂 ). I’ll then give version 1 to five of the managers, and version 2 to the other five. Half an hour later, I’ll quiz them on what they learned . Simplistically, I shall feel satisfied that whichever one of them generated the correct understanding in the most managers was the better viz in this instance.

Yes yes, this isn’t a perfect double-blind controlled experiment, but hopefully the point is apparent. “Proper” formal research on optimising data visualisation is certainly done, and very necessary it is too. There’s far too many examples to list, but classics in the field might include the paper “Graphical Perception” by Cleveland and McGill, which helped us understand which types of charts were conducive to being visually decoded accurately by us humans and our built-in limitations.

Commercially, companies like IBM or Autodesk or Google have research departments tackling related questions. In academia, there’s groups like the University of Washington Interactive Data Lab (which, interestingly enough, started out as the Stanford Vizualisation Group whose work on “Polaris” was later released commercially as none other than Tableau software).

If you’re looking for ideas to contribute to on this front, Stephen Few maintains a list of some research he’d like to see done on the subject in future, and no doubt there are infinitely many more possibilities if none of those pique your curiosity.

But the point is: for certain given aims, it is often possible to use experimental procedures and the resulting data, to say, as surely as we can say many things, visualisation A is better than visualisation B at achieving its aim.

But not go too far in expressing certainty here! There are several things to note, all contributing to the fact that very often there is not one best viz for a single dataset – context is key.

  • What is the aim of the viz? We covered that one already. Using a set of attractive colours may be more important than correct labelling on axes if you’re wanting to sell a poster for instance. Certain types of chart make for easier and more accurate types of particular comparisons than others. If you’re trying to learn or teach how to create a particular type of uber-creative chart in a certain tool, then you’re going to rather fail to accomplish that if you end up making a bar chart.
  • Who is the audience? For example, some charts can convey a lot of information is a small space; for instance box-and-whisker plots. An analyst or statistician will probably very happily receive these plots to understand and compare distributions and other descriptive stats in the right circumstances. I love them.However, extensive experience tells me that, no, the average person in the street does not. They are far less intuitive than bar or line charts to the non-analytically inclined/trained. However inefficient you might regard it, a table and 3 histograms might communicate the insight to them more successfully than a boxplot would. If they show an interest, by all means take the time to explain how to read a box plot; extol the virtues of the data-based lifestyle we all know; rejoice in being able to teach a fellow human a useful new piece of knowledge. But, in reality, your short-term job is more likely to be to communicate an important insight rather than provide an A-level statistics course – and if you don’t do well at fulfilling what you’re being employed to do, then you might not be employed to do it for all that long.

As well as there being no single best viz type in a generic sense, there’s also no one universally worst viz type. If there was, the datarati would just ban it. Which, I guess, some people are inclined to do – but, sorry, pie charts still exist. And they’re still at least “locally-good” in some contexts – like this one (source: everywhere on the internet):

pie

But, hey, you don’t have the time to run multiple experiments on multiple audiences. Let’s imagine you also are quite new to the game, with very little personal experience. How would you know which viz type to pick? Well, this is going to be a pretty boring answer sorry – and there’s more to elaborate on later, but, one way relates to the fact that, just like in any other field there, are actually “experts” in data viz. And outside of Michael Gove’s deluded rants, we should acknowledge they usually have some value.

In 1928, Bertrand Russell wrote an essay called ‘On the Value of Scepticism‘, where he laid out 3 guidelines for life in general.

 (1) that when the experts are agreed, the opposite opinion cannot be held to be certain;

(2) that when they are not agreed, no opinion can be regarded as certain by a non-expert;

and (3) that when they all hold that no sufficient grounds for a positive opinion exist, the ordinary man would do well to suspend his judgment.

So, we can bastardise these a bit to give it a dataviz context. If you’re really unsure of what viz to pick, then refer to some set of experts (to which we must acknowledge there’s subjectivity in picking…perhaps more on this in future).

If “experts” mostly think that data of type D used to convey an insight of type I to an audience of type A for purpose P is best represented in a line chart, then that’s probably the way to go if you don’t have substantial reason to believe otherwise. Russell would say that at least you can’t be held as being “certainly wrong” in your decision, even if your boss complains. Likewise, if there’s honestly no concurrence in opinion, then, have a go and take your pick of the suggestions – again, no-one should tell you off for because you did something unquestionably wrong!

For example, my bias is towards feeling that, when communicating “standard” insights efficiently via charts to a literate but non-expert audience, you can’t go too far wrong in reading some of Stephen Few’s books. Harsh and austere they may seem at times, but I believe them to be based on quality research in fields such as human perception as well as experience in the field.

But that’s not to say that his well founded, well presented guidelines, are always right. Just because 90% of the time you might be most successful in representing a certain type of time series as a line chart doesn’t mean that you always will be. Remember also, you may have a totally different aim to the audience to whom Mr Few aims his books at, in which case you cannot assume at all that the same best-practice standards would apply.

And, despite the above guidelines, because (amongst other reasons) not all possible information is ever available to us at any given time, sometimes experts are simply wrong. It turns out that the earth probably isn’t the centre of the universe, despite what you’d probably hear if you went back to experts from a millennia ago. You should just take care to find some decent reason to doubt the prevailing expertise, rather than simply ignoring it.

What we deem as the relative “goodness” of data viz techniques is also surely not static over time. For one, not all forms of data visualisation have existed since the dawn of mankind.

The aforementioned box and whisker plot is held to have been invented by John Tukey. He was only born in 1915, so if I were to travel back 200 years in time with my perfectly presented plot, then it’s unlikely I’d find many people to who find it intuitive to interpret. Hence, if my aim was to be to communicate insights quickly and clearly, then on the balance of probabilities this would probably be a bad attempt. It may not be the worst attempt, as the concept is still valid and hence could likely be explained to some inhabitants of the time – but in terms of bang for buck, there’d be no doubt be higher peaks in the “communicating data insights quickly” landscape available to me nearby.

We should also remember that time hasn’t stopped. Contrary to Francis Fukuyama’s famous essay and book, we probably haven’t reached the end of history even politically just yet, and we most certainly haven’t done so in the world of data. Given the rate of usable data creation, it might be that we’ve only dipped our toe in so far. So, what we think is best practice today may likely not be the same a hundred years hence; some of it may not be so even next year.

Some, but not all, obstacles or opportunities surround technology. Already the world has moved very quickly from graph paper, to desktop PCs, to people carrying around super-computers that only have small screens in their pockets. The most effective, most efficient, ways to communicate data insights will differ in each case. As an example I’m very familiar with, the  Tableau software application, clearly acknowledged this in their last release which includes facilities for displaying data differently depending on what device they’re been viewed on. Not that we need to throw the baby out with the bathwater, but even our hero Mr Tukey may not have had the iPhone 7 in mind when considering optimum data presentation.

Smartwatches have also appeared, albeit are not so mainstream at the moment. How do you communicate data stories when you have literally an inch of screen to play with? Is it possible? Almost certainly so, but probably not in the same way as on a 32 inch screen; and are the personal characteristics and needs of smart watch users anyway the same as the audience who views vizzes on a larger screen?

And what if Amazon (Echo), Google (Home) and others are right to think that in the future a substantial amount of our information based interactions may be done verbally, to a box that sits on the kitchen counter and doesn’t even have a screen? What does “data visualisation” mean in this context? Is it even a thing? But a lot of the questions I might want to ask my future good friend Alexa might well be questions that can only answered by some transformation and re-presentation in audio form of data.

I already can verbally ask my phone to provide me some forms of dataviz. In the below example, it shows me a chart and a summary table. It also provides me a very brief audio summary for the occasions where I can’t view the screen, shown in the bold text above the chart. But, I can’t say I’ve heard of a huge amount of discussion about how to optimise the audio part of the “viz” for insight. Perhaps there should be.

image

Technology aside though, the field should not rest on its laurels; the line chart may or may not ever die, but experimentation and new ideas should always be welcomed. I’d argue that we may be able to prove  in many cases that, today, for a given audience, for a given aim, with a given dataset, out of the various visualisations we most commonly have access to, that one is demonstrably better than another, and that we can back that up via the scientific method.

But what if there’s an even better one out there we never even thought of? What if there is some form of time series that is best visualised in a pie chart? OK, it may seem pretty unlikely but, as per other fields of scientific endeavour, we shouldn’t stop people testing their hypotheses – as long as they remain ethical – or the march of progress may be severely hampered.

Plus, we might all be out of a job. If we fall into the trap of thinking the best of our knowledge today is the best of all knowledge that will ever be available, that the haphazard messy inefficiencies of creativity are a distraction from the proven-efficient execution of the task at hand, then it’ll not be too long before a lot of the typical role of a basic data analyst is swallowed up in the impending march of our robotic overlords.

Remember, a key job of a lot of data-people is really to answer important questions, not to draw charts. You do the second in order to facilitate the first, but your personal approach to insight generation is often in actuality a means to another end.

Your customer wants to know “in what month were my sales highest?”. And, lo and behold, when I open a spreadsheet in the sort of technology that many people treat as the norm these days, Google sheets, I find that I can simply type or speak in the question “What month were my sales highest?” and it tells me very clearly, for free, immediately, without employing anyone to do anything or waiting for someone to get back from their holiday.

capture

Yes, that feature only copes with pretty simplistic analysis at the moment, and you have to be careful how you phrase your questions – but the results are only going to get better over time, and spread into more and more products. Microsoft PowerBI already has a basic natural language feature, and Tableau is at a minimum researching into it. Just wait until this is all hooked up to the various technological “cognitive services” which are already on offer in some form or other. A reliable, auto-generated answer to “what will my sales be next week if I launch a new product category today?” may free up a few more people to spend time with their family, euphemistically or otherwise.

So in the name of progress, we can and should, per Chris’ original tweet, be open to giving and receiving constructive criticism, whether positive or negative. There is value in this, even in the unlikely event that we have already hit on the single best, universal, way of of representing a particular dataset for all time.

Recall John Stuart Mill’s famous essay, “On Liberty” (written in 1869, yes, even before the boxplot existed). It’s so very quotable for many parts of life, but let’s take for example a paragraph from chapter two, regarding the “liberty of thought and discussion”. Why shouldn’t we ban opinions, even when we believe we know them to be bad opinions?

But the peculiar evil of silencing the expression of an opinion is, that it is robbing the human race; posterity as well as the existing generation; those who dissent from the opinion, still more than those who hold it.

If the opinion is right, they are deprived of the opportunity of exchanging error for truth: if wrong, they lose, what is almost as great a benefit, the clearer perception and livelier impression of truth, produced by its collision with error.

Are pie charts good for a specific combination of time series data, audience and aim?

Well – assuming a particularly charitable view of human discourse –  after rational discussion we will either establish that yes, they actually are, in which case the naysayers can “exchange error for truth” to the benefit of our entire field.

Or, if the consensus view of “no way” holds strong, then, having been tested, we will have reinforced the reason why this is in both the minds of the questioner, and ourselves – hence helping us remember the good reasons why we hold our opinions, and ensuring we never lapse into the depths of pseudo-religious dogma.