Lessons from what happened before Snow’s famous cholera map changed the world

Anyone who studies any amount of the history of, or the best practice for, data visualisation will almost certainly come across a handful of “classic” vizzes. These specific transformations of data-into-diagram have stuck with us through the mists of time in order to become examples that teachers, authors, conference speakers and the like repeatedly pick to illustrate certain key points about the power of dataviz.

A classic when it comes to geospatial analysis is John Snow’s “Cholera map”. Back in the 1850s, it was noted that some areas of the country had a lot more people dying from cholera than other places. At the time, cholera’s transmission mechanism was unknown, so no-one really knew why. And if you don’t know why something’s happening, it’s usually hard to take action against it.

Snow’s map took data that had been gathered about people who had died of cholera, and overlaid the locations where these people resided against a street map of a particularly badly affected part of London. He then added a further data layer denoting the local water supplies.

snowmap

(High-resolution versions available here).

By adding the geospatial element to the visualisation, geographic clusters showed up that provided evidence to suggest that use of a specific local drinking-water source, the now-famous Broad Street public well, was the key common factor for sufferers of this local peak of cholera infection.

Whilst at the time scientists hadn’t yet proven a mechanism for contagion, it turned out later that the well was indeed contaminated, in this case with cholera-infected nappies. When locals pumped water from it to drink, many therefore tragically succumbed to the disease.

Even without understanding the biological process driving the outbreak – nobody knew about germs back then –  seeing this data-driven evidence caused  the authorities to remove the Broad Street pump handle, people could no longer drink the contaminated water, and lives were saved. It’s an example of how data visualisation can open ones’ eyes to otherwise hidden knowledge, in this case with life-or-death consequences.

But what one hears a little less about perhaps is that this wasn’t the first data-driven analysis to confront the same problem. Any real-world practising data analyst might be unsurprised to hear that there’s a bit more to the story than a swift sequence of problem identification -> data gathering -> analysis determining the root cause ->  action being taken.

Snow wasn’t working in a bubble. Another gentleman, by the name of William Farr, whilst working at the General Register Office, had set up a system that recorded people’s deaths along with their cause. This input seems to have been a key enabler of Snow’s analysis.

Lesson 1: sharing data is a Very Good Thing. This is why the open data movement is so important, amongst other reasons. What if Snow hadn’t been able examine Farr’s dataset – could lives have been lost? How would the field of epidemiology have developed without data sharing?

In most cases, no single person can reasonably be expected to both be the original source of all the data they need and then go on to analyse it optimally. “Gathering data” does not even necessarily involve the same set of skills as “analysing data” does – although of course a good data practitioner should usually understand some of the theory of both.

As it happens, William Farr had gone beyond collecting the data. Being of a statistical bent, he had actually already used the same dataset himself to analytically tackle the same question – why are there relatively more cholera deaths in some places than others? He’d actually already found what appeared to be an answer. It later turned out that his conclusion wasn’t correct – but it certainly wasn’t obvious at the time. In fact, it likely seemed more intuitively correct than Snow’s theory back then.

Lesson 2: Here then is a real life example then of the value of analytical iteration. Just because one person has looked at a given dataset doesn’t mean that it’s worthless to have someone else re-analyse it – even if the former analyst has established a conclusion. This is especially important when the stakes are high, and the answer in hand hasn’t been “proven” by virtue of any resulting action confirming the mechanism. We can be pleased that Snow didn’t just think “oh, someone’s already looked at it” and move on to some shiny new activity.

So what was Farr’s original conclusion? Farr had analysed his dataset, again in a geospatial context, and seen a compelling association between the elevation of a piece of land and the number of cholera deaths suffered by people who live on it. In this case, when the land was lower (vs sea level for example) then cholera deaths seemed to increase.

In June 1852, Farr published a paper entitled “Influence of Elevation on the Fatality of Cholera“. It included this table:

farrtable

The relationship seems quite clear; cholera deaths per 10k persons goes up dramatically as the elevation of the land goes down.

Here’s the same data, this time visualised in the form of a linechart, from a 1961 keynote address on “the epidemiology of airborne infection”, published in Bacteriology Reviews. Note the “observed mortality” line.

farrchart.gif

Based on that data, his elevation theory seems a plausible candidate, right?

You might notice that the re-vizzed chart also contains a line concerning the calculated death rate according to “miasma theory”, which seems to have an outcome very similar on this metric to the actual cholera death rate. Miasma was a leading theory of disease-spread back in the nineteenth century, with a pedigree encompassing many centuries. As the London Science Museum tells us:

In miasma theory, diseases were caused by the presence in the air of a miasma, a poisonous vapour in which were suspended particles of decaying matter that was characterised by its foul smell.

This theory was later replaced with the knowledge of germs, but at the time the miasma theory was a strong contender for explaining the distribution of disease. This was probably helped because some potential actions one might take to reduce “miasma” evidently would overlap with those of dealing with germs.

After analysing associations between cholera and multiple geo-variables (crowding, wealth, poor-rate and more), Farr’s paper selects the miasma explanation as the most important one, in a style that seems  quite poetic these days:

From an eminence, on summer evenings, when the sun has set, exhalations are often seen rising at the bottoms of valleys, over rivers, wet meadows, or low streets; the thickness of the fog diminishing and disappearing in upper air. The evaporation is most abundant in the day; but so long as the temperature of the air is high, it sustains the vapour in an invisible body, which is, according to common observation, less noxious while penetrated by sunlight and heat, than when the watery vapour has lost its elasticity, and floats about surcharged with organic compounds, in the chill and darkness of night.

The amount of organic matter, then, in the atmosphere we breathe, and in the waters, will differ at different elevations; and the law which regulates its distribution will bear some resemblance to the law regulating the mortality from cholera at the various elevations.

As we discover later, miasma theory wasn’t correct, and it certainly didn’t offer the optimum answer to addressing the cluster of cholera cases Snow examined.But there was nothing impossible or idiotic about Farr’s work. He (as far as I can see at a glance) gathered accurate enough data and analysed them in a reasonable way. He was testing a hypothesis that was based on the common sense at the time he was working, and found a relationship that does, descriptively, exist.

Lesson 3: correlation is not causation (I bet you’ve never heard that before ūüôā ). Obligatory link to the wonderful Spurious Correlations site.

Lesson 4: just because an analysis seems to support a widely held theory, it doesn’t mean that the theory must be true.

It’s very easy to lay down tools once we seem to have shown that what we have observed is explained by a common theory. Here though we can think of Karl Popper’s views of scientific knowledge being derived via falsification. If there are multiple competing theories in play, the we shouldn’t assume certainty that the dominant one is correct until we have come up with a way of proving the case either way. Sometimes, it’s a worthwhile exercise to try to disprove your findings.

Lesson 5: the most obvious interpretation of the same dataset may vary depending on temporal or other context.

If I was to ask a current-day analyst (who was unfamiliar with the case) to take a look at Farr’s data and provide a view with regards to the explanation of the differences in cholera death rates, then it’s quite possible they’d note the elevation link. I would hope so. But it’s unlikely that, even if they used precisely the same analytical approach, they would suggest that miasma theory is the answer. Whilst I’m hesitant to claim there’s anything that no-one believes, for the most part analysts will probably place an extremely low weight on discredited scientific theories from a couple of centuries ago when it comes to explaining what data shows.

This is more than an idealistic principle – parallels, albeit usually with less at stake, can happen in day-to-day business analysis. Preexisting knowledge changes over time, and differs between groups. Who hasn’t seen (or had of being) the poor analyst who revealed a deep, even dramatic, insight into business performance predicated on data which was later revealed to have been affected by something entirely different.

For my part, I would suggest to learn what’s normal, and apply double-scepticism (but not total disregard!) when you see something that isn’t. This is where domain knowledge is critical to add value to your technical analytical skills. Honestly, it’s more likely that some ETL process messed up your data warehouse, or your store manager is misreporting data, than overnight 100% of the public stopped buying anything at all from your previously highly successful store for instance.

Again, here is an argument for sharing one’s data, holding discussions with people outside of your immediate peer group, and re-analysing data later in time if the context has substantively changed. Although it’s now closed, back in the deep depths of computer data viz history (i.e. the year 2007), IBM launched a data visualisation platform called “Many Eyes”. I was never an avid user, but the concept and name rather enthralled me.

Many Eyes aims to democratize visualization by providing a forum for any users of the site to explore, discuss, and collaborate on visual content…

Sadly, I’m afraid it’s now closed. But other avenues of course exist.

In the data-explanation world, there’s another driving force of change – the development of new technologies for inferring meaning from datapoints. I use “technology” here in the widest possible sense, meaning not necessarily a new version of your favourite dataviz software or a faster computer (not that those don’t help), but also the development of new algorithms, new mathematical processes, new statistical models, new methods of communication, modes of thought and so on.

One statistical model, commonplace in predictive analysis today, is logistic regression. This technique was developed in the 1950s, so was obviously unavailable as a tool for Farr to use a hundred years beforehand. However, in 2004, Bingham et al. published a paper that re-analysed Farr’s data, but this time using logistic regression. Now, even here they still find a notable relationship between elevation and the cholera death rate, reinforcing the idea that Farr’s work was meaningful – but nonetheless conclude that:

Modern logistic regression that makes best use of all the data, however, shows that three variables are independently associated with mortality from cholera. On the basis of the size of effect, it is suggested that water supply most strongly invited further consideration.

Lesson 6: reanalysing data using new “technology” may lead to new or better insights (as long as the new technology is itself more meritorious in some way than the preexisting technology, which is not always the case!).

But anyway, even without such modern-day developments, Snow’s analysis was conducted, and provided evidence that a particular water supply was causing a concentration of cholera cases in a particular district of London. He immediately got the authorities to remove the handle of the contaminated pump, hence preventing its use, and hundreds of people were immediately saved from drinking its foul water and dying.

That’s the story, right? Well, the key events themselves seem to be true, and it remains a great example of that all-too-rare phenomena of data analysis leading to direct action. But it overlooks the point that, by the time the pump was disabled, the local cholera epidemic had already largely subsided.

The International Journal of Epidemiology published a commentary regarding the Broad Street pump in 2002, which included a chart using data taken from Whitehead’s “Remarks on the outbreak of cholera in Broad Street, Golden Square, London, in 1854” paper, which was published in 1867. The chart shows, quite vividly, that by the date that the handle of the pump was removed, the local cholera epidemic that it drove was likely largely over.

handle

As Whitehead wrote:

It is commonly supposed, and sometimes asserted even at meetings of Medical Societies, that the Broad Street outbreak of cholera in 1854 was arrested in mid-career by the closing of the pump in that street. That this is a mistake is sufficiently shown by the following table, which, though incomplete, proves that the outbreak had already reached its climax, and had been steadily on the decline for several days before the pump-handle was removed

Lesson 7: timely analysis is often vital – but if it was genuinely important to analyse urgently, then it’s likely important to take action on the findings equally as fast.

It seems plausible that if the handle had been removed a few days earlier, many more lives could have been saved. This was particularly difficult in this case, as Snow had the unenviable task of persuading the authorities too take action based on a theory that was counter to the prevailing medical wisdom at the time. At least any modern-day analysts can take some solace in the knowledge that even our highest regarded dataviz heroes had some frustration in persuading decision makers to actually act on their findings.

This is not at all to reduce Snow’s impact on the world. His work clearly provided evidence that helped lead to germ theory, which we now hold to be the explanatory factor in cases like these. The implications of this are obviously huge. We save lives based on that knowledge.

Even in the short term, the removal of the handle, whilst too late for much of the initial outbreak, may well have prevented a deadly new outbreak. Whitehead happily acknowledged this in his article.

Here I must not omit to mention that if the removal of the pump-handle had nothing to do with checking the outbreak which had already run its course, it had probably everything to do with preventing a new outbreak; for the father of the infant, who slept in the same kitchen, was attacked with cholera on the very day (Sept. 8th) on which the pump-handle was removed. There can be no doubt that his discharges found their way into the cesspool, and thence into the well. But, thanks to Dr. Snow, the handle was then gone.

Lesson 8: even if it looks like your analysis was ignored until it was too late to solve the immediate problem, don’t be too disheartened –  it may well contribute towards great things in the future.

Advertisements

You made a chart. So what?

In the latest fascinating Periscope video from Chris Love, the conversation centred around a question that can be summarised as “Do data visualisations need a ‘so what’?“.

There are many ways of rephrasing this: one could ask whether it is (always) the responsibility of the viz author to highlight the story that their visualisations show? Or can¬†a data visualisation be truly worthy of high merit even if it doesn’t lead the viewer to a conclusion?

This topic resonates strongly with me: part of my day job involves maintaining a reference library of the results from the analytical research or investigation we do. We publish this widely within our organisation, so that any employee who has cause or interest in what we found in the past can help themselves to the results. The title we happened to give the library is “So what?“.

Although the detailed results of our work may be reported in many different formats, each library entry has a templated front page that includes the same sections for each study:

  1. The title of the work.
  2. The question that the work intended to address.
  3. A summary of the scope and dataset that went into the analysis.
  4. A list of the main findings.
  5. And finally, the all-important “So what?” section.

Note the distinction between findings (e.g. “customers who don’t buy anything for 50 days are likely to never buy anything again”) and the so what (“we recommend you call the customer when it has been 40 days since you saw them if you wish to increase your sales by 10%”).

The simple answer

With the above in mind, my position is probably quite obvious. If you are going to demand a binary¬†yes/no answer as to whether a data visualisation should have a “so what?”, then my simplistic generalization would be that the answer is yes, it should.

Most of the time, especially in a business context, the main intention behind the whole analytics pipeline is to provide some nugget of information that will lead to a specific decision or action being taken. If the data visualisation doesn’t lead to (or preferably even spoon-feed) a conclusion then there is a high risk that the audience might feel that they wasted their time in looking at it.

In reality though, the black-and-white answer portrayed above is naturally a series of various shades of grey.

A slightly more refined answer

Two key considerations are paramount to deciding whether a particular viz has to have a “so what” to be valuable.

The audience

Please note that I write this from the perspective of visualisations aimed at communities that are not necessarily all data scientist type professionals. If your intended audience is a startup data company populated entirely by computer science PhDs who live and breath dataviz, then the answers may differ. But for most of us, hobbyists or pros, this is not the audience we have, or seek.

A rule of thumb here then might be:

  • If your audience consists entirely of other analysts, then no, it is not essential to have a “so what?” aspect to your viz. However under many circumstances it still would be extremely useful to do so.
  • If your audience includes non-analysts, particularly those people who might term themselves “busy executives” or claim that they “don’t need data to make decisions” (ugh) then it is in general absolutely essential that your viz points towards a “so what”, if a viz is indeed what you intend to deliver.

Why is it OK to lose the “so what” for analysts? Well, only because these people are probably very capable of using a well-designed viz to generate their own conclusions in an analytically safe way. It’s not that they don’t need a “so what”: they almost certainly do – it’s just that you can feel more secure that, whilst not producing it yourself, you can rely on them to do that aspect of the work properly.

They might even be better than you at interpreting the results, if for instance they have extensive subject domain knowledge that you don’t. Interpretation of data is almost always a mix of analytical proficiency and domain-specific knowledge.

Even the best technical analyst cannot have knowledge of all domains. This is why it’s generally not good to let a brand spanking new super-IQ multiple-PhD analyst join an existing company and sit on their own in a dark computer-filled room for a year before entering into discussion as to what kind of analysis you might be interested in to add maximum value to your world.

The lack of an explicit “so what?” ruins many great dashboards

I’m going to go a step further and say that¬†in many cases¬†–¬†especially in non-data focussed organisations – “general” dashboards¬†turn¬†out to be¬†not very useful.

This may be a controversial statement in a world where every analytical software provider sells fancy new ways to make dashboards, every consultant can build ten for you quicksmart, and every “stakeholder” falls over in amazement when they see that they can view and interact with several facets of data at once in a way that was never possible with their tedious plain .csv files.

But a pattern I have often seen is:

  1. Someone sees or suggests a fancy new tool that claims dashboarding as one of its abilities (and this is not to denigrate any tool; this happens plenty even with my favourite tool de jour, Tableau).
  2. A VIP loves the theoretical power of what they see and decides they need a dashboard “on sales” for example.
  3. A analyst happily creates a “sales dashboard” – usually based on what they think the VIP should probably want to see, given that “sales” is not a very fully fleshed out description of anything.
  4. The sponsor VIP is very happy and views it as soon as it’s accessible.
  5. They may even go and visit it every day for the first week, rejoicing in the up-to-date, integrated, comprehensive, colourful data. Joy!
  6. The administrator checks the server logs next month and realises that no-one in the entire company opened the sales dashboard since week 1.
  7. The analyst is sad.

Why? Everyone (sort of…arguably…) did their job. But, after the novelty wore off, the decision maker probably got bored or “too busy” to open the dashboard every day. At best, perhaps they ask an analytical type to monitor what’s going on with the dashboard. At worst, perhaps they go back to making up decisions based on the random decay of radioactive isotopes, or something similar.

They got “too busy” because, after¬†they had waited for the ¬†dashboard to load up, they’d see a few nice charts with interactive filters to go through in order to¬†try and determine¬†whether¬†there was anything they should actually go and do in the real world based on what they showed.

Sales are a bit up in Canada vs yesterday, horray! Yesterday they were a bit down, boo! Do I need to do something about this? Who knows? Do I want to fiddle around with 50 variations of a chart to try work it out? No, it’s not my job and quite possibly I don’t have the time or expertise (and nor should I need it) to do that, sayeth the VIP.

So are dashboards useless? Of course not. But they have to be implemented with the reality of the audience capability, interest and use-case in mind. Most dashboards (at least those that are not solely for analysts to explore) should start with

  • At least 1 clear pre-defined question to address; and
  • 1 clear pre-defined action that might realistically take place based on the answer to the question.

But I don’t want a computer running my business!

Shouldn’t you check that it would definitely be a bad idea before saying that? ūüôā

But seriously, the above is not to say one has necessarily to commit blindly to taking the pre-defined action – not every organisation is ready for, or suited to, prescriptive analytics.

However, if there is no way at all that an answer that a dashboard provides could possibly lead to influencing an action, then is it really worth one’s time working on it, at least in a business context?

  • “Sales dashboard” is not a question or an action.
  • “Am I getting fewer sales this year than last year?” is a question.
  • “If I am getting fewer sales this year then I will spend more on marketing this year” is an action.
  • “What form of marketing gave the best ROI last year?” is a question.
  • “If I need to do more marketing¬†this year then I’ll advertise using¬†the method that gave the greatest ROI last year” is an action.

The list of questions doesn’t need to be exhaustive, in fact it usually can’t be. If someone can use a dashboard to answer 100 questions not even imagined at the time of creation, then great. Indeed this is one of the potential strengths of a well-designed dashboard – but there should be at least 1 question in mind before it is created.

Why does checking my dashboard bore me?

Note that in that example above, the listed actions actually imply that the dashboard user is only interested in the results shown on the dashboard under one particular condition: if the sales this year are lower than last year.

For 99 days in a row they might check the dashboard and see that the sales are higher this year, and hence do nothing. On the 100th day, perhaps there was a dramatic fall, so that day is the day when the appropriate advertising action is considered.

However, consider how many people will actually persist in checking the dashboard for 100 days in a row when 99% of the time the check results in no new action.

I myself am obviously very analytically inclined, am happy to and (like to think) efficient at interpreting data, and yet even I have automated rules in my Outlook email client to immediately delete unread almost every “daily report” that gets emailed to me automatically (ssssh, don’t tell anyone, that’s just between you and I). Even the simple act of double-clicking to open the attachment is too much effort in comparison with the expected¬†value of seeing the report contents¬†on an average day.

In this sort of circumstance, what might enable a dashboard to be truly useful is the concept of alerting.

A possible use case is as follows:

  1. A sales dashboard aimed at answering the question of whether we are getting fewer sales this year is set up.
  2. Every day, alerting software routinely checks this data, and emails the VIP (only) if it shows that yes, sales have fallen. The email also provides a direct web link to the targeted sales dashboard.
  3. When the VIP receives this email, knowing that there is something “interesting” to see, they may well be concerned enough to open the dashboard and, to the best of their ability, use whatever context is available there to decide on their next action.
  4. If the information they need isn’t there, or they don’t have the time / expertise / inclination to interpret it, then of course they will legitimately request some more work from their analyst. But at least here we see that “data” provided a trigger that has alerted a relevant decision maker that they need to…make a decision, and made it easy for them to use the dashboard tool at their disposal specifically on the day that they are likely to gain value from doing so.
  5. Everyone is happy (well, except about the poor sales).

There is an implicit “so what” in the scenario above.

Main findings

  • Sales are lower than last year.
  • Last year, TV adverts produced tremendous ROI.

So what?

  • To make sure the sales¬†keep growing, consider buying some advertising.
  • To be safest, use a¬†method that was proven effective last year, TV.

But aren’t there some occasions that a “so what” isn’t needed?

Yes, rules of thumb have exceptions. There are some scenarios in which one might legitimately consider not producing an explicit “so what”.

Here are a few I could think of quickly.

1: Exploratory analysis: maybe you just got access to a dataset but you don’t really know what it contains, what the typical patterns and distributions are or its scope. Building a few visualisations on top of that is a great way for an analyst to get a quick understanding of the potential of what they have, and, later, what sort of “so what?” questions could potentially be asked.

2: Data quality testing: in a similar vein to the above, you can often use histograms, line charts and so on to get a quick idea of whether your data is complete and correct. If your viz shows that all your current customers were born in the 19th century then something is probably wrong.

3: Getting inspiration: got too much time on your hands and can’t think of some other work to do? (!!!) You could pick a dataset, or set of datasets, and spend some time graphing their attributes and looking for interesting patterns, outliers, and so on that could form the basis of interesting questions.

  • Why does x correlate with y?
  • Why is x look like a Gaussian distribution whereas y looks like a gamma distribution?
  • Why does store X sell the most of product Y?

This doesn’t have to be done on an individual basis. An interactive dataviz might be a great basis for a group brainstorming discussion, whether within a group of analysts or a far wider audience of interested parties..

4: Learning technical skills: perhaps you are trialling new analysis software or techniques, or trying to improve your existing skills. Working with data you’re already familiar with in new tools is a great way to learn them; perhaps even recreating something you did elsewhere if it’s relevant. The aim here is to increase your skillset, not derive new insights.

5: “How to” guides for others to follow: whether formal training or blog posts (showing fancy extreme edge cases others can marvel at perhaps?), maybe your emphasis is not on what the data actually contains in a subject domain sense, but rather a demonstration of how to use a certain generic analytical feature or technique. Here the data is just a placeholder to provide a practical example for others to follow.

6: You’re an artist: perhaps you’re not actually trying to use data as a tool to generate insight, but rather to create art. This is no lesser a task than classic data analysis, but it’s a very different one, with very different priorities. Think for example of Nathalie Miebach, whose website’s tagline is:

“translating science data into sculpture, installations and musical scores.”

miebachart

This might be fine art, but it does not try to lead to business insight.

7: You want to focus on promoting your work and become famous :-): a controversial one perhaps; but it is not always plain old bar charts that happen to show the greatest insights that get shared around the land of Twitter, Facebook and other such attention-requesting mediums.

If your goal is generically to get “coverage” – perhaps to increase¬†advertising revenue based on CPM or to become more well-known¬†for your work – and you feel that you¬†have to choose between generating a true insight and making something that looks highly attractive, then the latter might actually be a better bet.

But one should acknowledge what they’re doing; perhaps the skills you demonstrate in doing this are closer to the afore-mentioned “data artist” than “data analyst”.

I have a sneaking suspicion for instance that – not to re-raise a never-ending debate! – David McCandless’ books are probably picked up¬†in higher volumes than Stephen Few’s¬†when both are presented together in a bookshop.

  • McCandless’ “Information is Beautiful“, a series of pretty, sometimes fascinating, infographics, many of which have little in the way of conclusions, is currently rank #1788 in Amazon UK books.
  • Stephen Few’s “Show me the numbers“, a more hardcore text on best practice in presenting information, at #7952, with a cover consisting of¬†very unglamourous bar and line charts.

This is not to compare one to the other in terms of worthwhileness; they are aimed at totally different audiences whose desire to have a book in the “data visualisation” category is motivated by very different reasons.

Even amongst the specialist dataviz-analyst community Tableau has, I note that around half of the visualisations that Tableau picks as their public “Viz of the day” are variations on geographical maps.

Geo-maps tend to look “fancier” and more enticing than bar charts, even though they are applicable only to analysis of a very specific type of data, and can provide only certain types of insights. For most organisations, whilst there is often relevance in geospatial analysis, I suspect that “geo-maps” analytics forms far less than 50% of total analytical output.

It’s therefore very unlikely that the winning “Viz of the day” entries actually reflect how Tableau is actually used most of the time. Hence you might conclude that, if you want to be in the running for that particular honour, you should bias your work towards visualisation with¬†the sort of attention-grabbing graphics that maps often provide, irrespective of¬†whether another form might generate¬†a similar or stronger “so what?” output.

8: Regulatory / reporting requirements: in some circumstances you might be bound by regulation or other authority to produce certain analytical reports, irrespective of whether you think they add value or provide insight. Think for instance of the fields of accounting, for publicly traded companies, healthcare companies, investment products and so on.

9:Your job is explicitly, literally, to “visualise data”. It’s possible to imagine, perhaps in a large business department, employing someone whose job is to repeatedly convert data, for instance, from text tables into a best-practice chart form, without going further. It would be another person’s job to derive the “so what?”.

You could think of this as a horizontal slice of the analytics pipeline vs the “beginning-to-end” vertical pipeline. After all, analysts often rely on other people with different skills (eg. IT) to do a preparatory phase of data analysis, the data provision/manipulation itself (including¬†extract, transform and load operations). They could also rely on people to do the conclusion-forming stage.

Many companies do seem to have a de-facto version of this setup by employing people to “create reports”. By this,¬†they may mean something akin to blindly getting up-to-date data into a certain agreed template or dashboard format that managers are supposed to use to derive decisions from.

However, unless your managers happen to be keen analysts, or your organisation is extraordinarily predictable, then I tend to be concerned about the efficiency and¬†reliability of this method for anything other than, for example, the regulatory purposes mentioned above. It’s hard to imagine someone else consistently gaining optimal insights from a chart they had no control over designing, without a large amount of overhead-inducing iteration between chart-creator and insight-finder. Let’s face it – most non-quant managers would prefer a bullet-point summary of findings rather than a 10 tab Excel workbook full of charts if they’re honest.

There may be many more such scenarios; do let me know in the comments!

Hang on, isn’t there a “so what” in some of the above?

Did you notice the semantic trickery in the above “no so what” viz reasons? In fact, most do either have an implicit “so what” or are simply facilitating the later creation of one.

Items 1, 2 & 3 could be considered as part of the data preparation phase of the analytics pipeline. It would be unlikely (and unwanted) for the products of them to be the end of the analysis. Almost certainly, they’re a step 1 to a further analysis. An implicit “so what” here is either that the data is safe to proceed with, or it is not.

The output of these approaches can also be useful for establishing baselines for metrics, even if this isn’t the intended use at this point. For instance, if your exploration reveals the average customer purchases ¬£5 of products, this may be useful down the line to compare next year’s sales to. Did your later interventions improve sales or not?

Items 4 & 5 come down to being technical training for either yourself or for others. Once trained, you’re likely to be off analysing “so what?” scenarios next. If we’re looking to contrive a “so what?” here, it might be “so I am ready to put my skills to good use tackling real questions”.

Item 6 is quite unique. The data visualisation itself may never be useful as a “so what?” to anyone. It was never intended to be. It’s for a totally different audience who would no more ask “so what?” of data-inspired art than they would of Da Vinci’s ‘Mona Lisa‘.

Item 7 again might be considered data-use for the sake of something other than intrinsic “analysis”. This type of work might well have an explicit “so what?”, that could even be part of its allure. But it’s not the primary reason why it the visualisation was created, so it might not. Sometimes it could be considered¬†a variant of #6 with a specific goal.

It may also be itself a tool that generates useful data. If viewcount is what is important to the creator, then they may be tracking that pageviews on their own “so what”-enabled dashboard in order to determine what sort of output creates the most value for them..

Item 8 and 9 are mid-parts of the analytics pipeline. Although you may not be explicitly defining a “so what”, you’re enabling someone else to come up with their own later.

For better or worse, mandatory reporting regulations are there for someone’s perceived reason. A chart of fund performance is supposed to be there to help inform potential clients¬†whether they would like to invest or not, not simply to provide a nice curved shape.

And if your job is to create “standard” reports or charts, then almost certainly someone else is completing the later step of interpreting them to form their own “so what?”. Or at least they are supposed to be.

To conclude: (why) are we valuable?

Fiddling around with data may be somewhere between Big Bang level geekery and the sexiest job of the 21st century, and holds a personal fascination for some of us. But if we want someone to employ us to do it, or to add value in some other way to the world, we should remember why data as a vocation exists. For the average data analyst, it’s not to make a series of lines that look pleasant (although it’s¬†always nice when that¬†happens).

To quote the viz-lord Stephen Few, in his book “Now You See It“:

We analyse information not for its own sake but so we can make informed decisions. Good decisions are based on understanding. Analysis is performed to arrive at clear, accurate, relevant, and thorough understanding.

(my emphasis added)

Outside of that book, he uses the term “data sensemaking” frequently, which is a good description of what organisations tend to want from their data analysts, even if they don’t know to phrase it in that manner.¬†It must be stressed again, many “busy execs” are far happier with a few bullet points or alerts on potential issues than a set of even the most beautiful, most best-practice, visualisations.

When one exists within¬†the analyst community, it can be hard to remember that not everyone enjoys “data”. Even many of those who are intrigued may not yet have had the time, privilege or education that leads towards quick, accurate interpretation of data. It can be frustrating, or even impossible, for a non-quant specialist to try and understand the real-world implication of an abstract representation of some measure: they simply don’t want to, or can’t – and shouldn’t need to – hunt for their own takeaways in many cases.

When a crime is committed, we hope a professional detective will put together the clues and provide the real world interpretation that allows us to successfully confront the criminal in court. When non-trivial data appears, we should hope that a professional analyst is on hand to put together the data-clues and provides a real world interpretation that lets us successfully confront whatever issue is at hand.

Bonus addendum: some “so whats” are worse than no “so whats”

Before we go, there is perhaps one extra risk of “so whatting” a viz that should be considered. Producing a conclusion that could lead to action tends to necessitate taking a position; essentially you move from presenting a picture to arguing for the implications of what it shows.

Much data can provide multiple distinct answers to the same question if it is manipulated enough. There are indeed lies, damned lies and statistics, and dataviz inspired variations of all three.

If the analyst approaches the “so what?” aspect with bias, then human psychology is such that they may be inclined to provide an awesome conclusion that just coincidentally happens to match their pre-analysis viewpoint; c.f. “confirmation bias“. Of course, many organisations effectively employ people or subcontract out work to for¬†this exact reason, but that is generally not an ethically fantastic, or professionally fulfilling, position (and whole other organisations exist to debunk such guff).

It’s pretty much impossible to provide even a basic chart without the risk of bias. Data analysis is surely part art, mixed amongst the maths and science – one can of course debate the precise split. But a¬†data vizzer has inherently made some explicit choices: what source of data to use, how much data, which type of visualisation, which comparisons to make, the format of chart and much more – all of which can induce, consciously or not, bias to¬†the audience.

Many best-practice “rules” of dataviz, and analytics in general, are in fact designed to reduce the risk of this. This is a very key reason as to why it’s worth learning them. Outside of those memorisable basics though, it’s often interesting to try and test the opposing view to what you’re presenting as your “so what?”.

Perhaps this year you have a higher proportion of female customers than last year. So what? “Our 10 year strategy to redesign our product to be especially attractive to women has been successful, we deserve a bonus”? Well, perhaps, but what if:

  • Last year had a weirdly low proportion of female purchasers vs normal and you’re just seeing basic regression to the mean?
  • Or, for the past 9 years the proportion of women buying your product has plummeted 10% every year; only to increase 2% in the latest year. Does that make your 10 year strategy a success?
  • Or this year was the first year you advertised in Cosmo, instead of FHM. Have other changes produced a variable that confounds your results?
  • Or men have stopped buying your product whilst women continue to buy it at the exactly same rate…does that count as success?

The right data displayed in the right way can help you eliminate or confirm these and other possibilities.

For any decision where the benefit likely outweighs the cost, it’s worth doing the exercise of disproving your first intuition in order to provide comfort that you are¬†supporting¬†the best quality of decision making;¬†not to mention reducing the¬†risk¬†that some joker with half a spreadsheet invalidates your finely crafted interpretation of your charts.