Lessons from what happened before Snow’s famous cholera map changed the world

Anyone who studies any amount of the history of, or the best practice for, data visualisation will almost certainly come across a handful of “classic” vizzes. These specific transformations of data-into-diagram have stuck with us through the mists of time in order to become examples that teachers, authors, conference speakers and the like repeatedly pick to illustrate certain key points about the power of dataviz.

A classic when it comes to geospatial analysis is John Snow’s “Cholera map”. Back in the 1850s, it was noted that some areas of the country had a lot more people dying from cholera than other places. At the time, cholera’s transmission mechanism was unknown, so no-one really knew why. And if you don’t know why something’s happening, it’s usually hard to take action against it.

Snow’s map took data that had been gathered about people who had died of cholera, and overlaid the locations where these people resided against a street map of a particularly badly affected part of London. He then added a further data layer denoting the local water supplies.


(High-resolution versions available here).

By adding the geospatial element to the visualisation, geographic clusters showed up that provided evidence to suggest that use of a specific local drinking-water source, the now-famous Broad Street public well, was the key common factor for sufferers of this local peak of cholera infection.

Whilst at the time scientists hadn’t yet proven a mechanism for contagion, it turned out later that the well was indeed contaminated, in this case with cholera-infected nappies. When locals pumped water from it to drink, many therefore tragically succumbed to the disease.

Even without understanding the biological process driving the outbreak – nobody knew about germs back then –  seeing this data-driven evidence caused  the authorities to remove the Broad Street pump handle, people could no longer drink the contaminated water, and lives were saved. It’s an example of how data visualisation can open ones’ eyes to otherwise hidden knowledge, in this case with life-or-death consequences.

But what one hears a little less about perhaps is that this wasn’t the first data-driven analysis to confront the same problem. Any real-world practising data analyst might be unsurprised to hear that there’s a bit more to the story than a swift sequence of problem identification -> data gathering -> analysis determining the root cause ->  action being taken.

Snow wasn’t working in a bubble. Another gentleman, by the name of William Farr, whilst working at the General Register Office, had set up a system that recorded people’s deaths along with their cause. This input seems to have been a key enabler of Snow’s analysis.

Lesson 1: sharing data is a Very Good Thing. This is why the open data movement is so important, amongst other reasons. What if Snow hadn’t been able examine Farr’s dataset – could lives have been lost? How would the field of epidemiology have developed without data sharing?

In most cases, no single person can reasonably be expected to both be the original source of all the data they need and then go on to analyse it optimally. “Gathering data” does not even necessarily involve the same set of skills as “analysing data” does – although of course a good data practitioner should usually understand some of the theory of both.

As it happens, William Farr had gone beyond collecting the data. Being of a statistical bent, he had actually already used the same dataset himself to analytically tackle the same question – why are there relatively more cholera deaths in some places than others? He’d actually already found what appeared to be an answer. It later turned out that his conclusion wasn’t correct – but it certainly wasn’t obvious at the time. In fact, it likely seemed more intuitively correct than Snow’s theory back then.

Lesson 2: Here then is a real life example then of the value of analytical iteration. Just because one person has looked at a given dataset doesn’t mean that it’s worthless to have someone else re-analyse it – even if the former analyst has established a conclusion. This is especially important when the stakes are high, and the answer in hand hasn’t been “proven” by virtue of any resulting action confirming the mechanism. We can be pleased that Snow didn’t just think “oh, someone’s already looked at it” and move on to some shiny new activity.

So what was Farr’s original conclusion? Farr had analysed his dataset, again in a geospatial context, and seen a compelling association between the elevation of a piece of land and the number of cholera deaths suffered by people who live on it. In this case, when the land was lower (vs sea level for example) then cholera deaths seemed to increase.

In June 1852, Farr published a paper entitled “Influence of Elevation on the Fatality of Cholera“. It included this table:


The relationship seems quite clear; cholera deaths per 10k persons goes up dramatically as the elevation of the land goes down.

Here’s the same data, this time visualised in the form of a linechart, from a 1961 keynote address on “the epidemiology of airborne infection”, published in Bacteriology Reviews. Note the “observed mortality” line.


Based on that data, his elevation theory seems a plausible candidate, right?

You might notice that the re-vizzed chart also contains a line concerning the calculated death rate according to “miasma theory”, which seems to have an outcome very similar on this metric to the actual cholera death rate. Miasma was a leading theory of disease-spread back in the nineteenth century, with a pedigree encompassing many centuries. As the London Science Museum tells us:

In miasma theory, diseases were caused by the presence in the air of a miasma, a poisonous vapour in which were suspended particles of decaying matter that was characterised by its foul smell.

This theory was later replaced with the knowledge of germs, but at the time the miasma theory was a strong contender for explaining the distribution of disease. This was probably helped because some potential actions one might take to reduce “miasma” evidently would overlap with those of dealing with germs.

After analysing associations between cholera and multiple geo-variables (crowding, wealth, poor-rate and more), Farr’s paper selects the miasma explanation as the most important one, in a style that seems  quite poetic these days:

From an eminence, on summer evenings, when the sun has set, exhalations are often seen rising at the bottoms of valleys, over rivers, wet meadows, or low streets; the thickness of the fog diminishing and disappearing in upper air. The evaporation is most abundant in the day; but so long as the temperature of the air is high, it sustains the vapour in an invisible body, which is, according to common observation, less noxious while penetrated by sunlight and heat, than when the watery vapour has lost its elasticity, and floats about surcharged with organic compounds, in the chill and darkness of night.

The amount of organic matter, then, in the atmosphere we breathe, and in the waters, will differ at different elevations; and the law which regulates its distribution will bear some resemblance to the law regulating the mortality from cholera at the various elevations.

As we discover later, miasma theory wasn’t correct, and it certainly didn’t offer the optimum answer to addressing the cluster of cholera cases Snow examined.But there was nothing impossible or idiotic about Farr’s work. He (as far as I can see at a glance) gathered accurate enough data and analysed them in a reasonable way. He was testing a hypothesis that was based on the common sense at the time he was working, and found a relationship that does, descriptively, exist.

Lesson 3: correlation is not causation (I bet you’ve never heard that before ūüôā ). Obligatory link to the wonderful Spurious Correlations site.

Lesson 4: just because an analysis seems to support a widely held theory, it doesn’t mean that the theory must be true.

It’s very easy to lay down tools once we seem to have shown that what we have observed is explained by a common theory. Here though we can think of Karl Popper’s views of scientific knowledge being derived via falsification. If there are multiple competing theories in play, the we shouldn’t assume certainty that the dominant one is correct until we have come up with a way of proving the case either way. Sometimes, it’s a worthwhile exercise to try to disprove your findings.

Lesson 5: the most obvious interpretation of the same dataset may vary depending on temporal or other context.

If I was to ask a current-day analyst (who was unfamiliar with the case) to take a look at Farr’s data and provide a view with regards to the explanation of the differences in cholera death rates, then it’s quite possible they’d note the elevation link. I would hope so. But it’s unlikely that, even if they used precisely the same analytical approach, they would suggest that miasma theory is the answer. Whilst I’m hesitant to claim there’s anything that no-one believes, for the most part analysts will probably place an extremely low weight on discredited scientific theories from a couple of centuries ago when it comes to explaining what data shows.

This is more than an idealistic principle – parallels, albeit usually with less at stake, can happen in day-to-day business analysis. Preexisting knowledge changes over time, and differs between groups. Who hasn’t seen (or had of being) the poor analyst who revealed a deep, even dramatic, insight into business performance predicated on data which was later revealed to have been affected by something entirely different.

For my part, I would suggest to learn what’s normal, and apply double-scepticism (but not total disregard!) when you see something that isn’t. This is where domain knowledge is critical to add value to your technical analytical skills. Honestly, it’s more likely that some ETL process messed up your data warehouse, or your store manager is misreporting data, than overnight 100% of the public stopped buying anything at all from your previously highly successful store for instance.

Again, here is an argument for sharing one’s data, holding discussions with people outside of your immediate peer group, and re-analysing data later in time if the context has substantively changed. Although it’s now closed, back in the deep depths of computer data viz history (i.e. the year 2007), IBM launched a data visualisation platform called “Many Eyes”. I was never an avid user, but the concept and name rather enthralled me.

Many Eyes aims to democratize visualization by providing a forum for any users of the site to explore, discuss, and collaborate on visual content…

Sadly, I’m afraid it’s now closed. But other avenues of course exist.

In the data-explanation world, there’s another driving force of change – the development of new technologies for inferring meaning from datapoints. I use “technology” here in the widest possible sense, meaning not necessarily a new version of your favourite dataviz software or a faster computer (not that those don’t help), but also the development of new algorithms, new mathematical processes, new statistical models, new methods of communication, modes of thought and so on.

One statistical model, commonplace in predictive analysis today, is logistic regression. This technique was developed in the 1950s, so was obviously unavailable as a tool for Farr to use a hundred years beforehand. However, in 2004, Bingham et al. published a paper that re-analysed Farr’s data, but this time using logistic regression. Now, even here they still find a notable relationship between elevation and the cholera death rate, reinforcing the idea that Farr’s work was meaningful – but nonetheless conclude that:

Modern logistic regression that makes best use of all the data, however, shows that three variables are independently associated with mortality from cholera. On the basis of the size of effect, it is suggested that water supply most strongly invited further consideration.

Lesson 6: reanalysing data using new “technology” may lead to new or better insights (as long as the new technology is itself more meritorious in some way than the preexisting technology, which is not always the case!).

But anyway, even without such modern-day developments, Snow’s analysis was conducted, and provided evidence that a particular water supply was causing a concentration of cholera cases in a particular district of London. He immediately got the authorities to remove the handle of the contaminated pump, hence preventing its use, and hundreds of people were immediately saved from drinking its foul water and dying.

That’s the story, right? Well, the key events themselves seem to be true, and it remains a great example of that all-too-rare phenomena of data analysis leading to direct action. But it overlooks the point that, by the time the pump was disabled, the local cholera epidemic had already largely subsided.

The International Journal of Epidemiology published a commentary regarding the Broad Street pump in 2002, which included a chart using data taken from Whitehead’s “Remarks on the outbreak of cholera in Broad Street, Golden Square, London, in 1854” paper, which was published in 1867. The chart shows, quite vividly, that by the date that the handle of the pump was removed, the local cholera epidemic that it drove was likely largely over.


As Whitehead wrote:

It is commonly supposed, and sometimes asserted even at meetings of Medical Societies, that the Broad Street outbreak of cholera in 1854 was arrested in mid-career by the closing of the pump in that street. That this is a mistake is sufficiently shown by the following table, which, though incomplete, proves that the outbreak had already reached its climax, and had been steadily on the decline for several days before the pump-handle was removed

Lesson 7: timely analysis is often vital – but if it was genuinely important to analyse urgently, then it’s likely important to take action on the findings equally as fast.

It seems plausible that if the handle had been removed a few days earlier, many more lives could have been saved. This was particularly difficult in this case, as Snow had the unenviable task of persuading the authorities too take action based on a theory that was counter to the prevailing medical wisdom at the time. At least any modern-day analysts can take some solace in the knowledge that even our highest regarded dataviz heroes had some frustration in persuading decision makers to actually act on their findings.

This is not at all to reduce Snow’s impact on the world. His work clearly provided evidence that helped lead to germ theory, which we now hold to be the explanatory factor in cases like these. The implications of this are obviously huge. We save lives based on that knowledge.

Even in the short term, the removal of the handle, whilst too late for much of the initial outbreak, may well have prevented a deadly new outbreak. Whitehead happily acknowledged this in his article.

Here I must not omit to mention that if the removal of the pump-handle had nothing to do with checking the outbreak which had already run its course, it had probably everything to do with preventing a new outbreak; for the father of the infant, who slept in the same kitchen, was attacked with cholera on the very day (Sept. 8th) on which the pump-handle was removed. There can be no doubt that his discharges found their way into the cesspool, and thence into the well. But, thanks to Dr. Snow, the handle was then gone.

Lesson 8: even if it looks like your analysis was ignored until it was too late to solve the immediate problem, don’t be too disheartened –  it may well contribute towards great things in the future.

Kaggle now offers free public dataset and script combos

Kaggle, a company most famous for facilitating¬†competitions that allow organisations to solicit the help of teams of data scientists to solve their problems in return for a nice big prize, recently¬†introduced a new section useful even for the less competitive types: “Kaggle Datasets“.

Here they host “high quality public datasets” you can access for free. But what is especially nice is that as well as the data download itself, they host any scripts,¬†code and results that people have already written to handle them, plus some¬†general discussion.

For example, on¬†the “World Food Facts” page you can see a script that “ByronVergoesHouwens” wrote to see which countries ate the most sugar, and also¬†a chart that that script produced. ¬†In fact you can even execute scripts online, thanks to their “Kaggle Scripts” product.

It looks like the datasets will be added to regularly, but right now the list is:

  • Amazon Fine Food Reviews
  • Twitter US Airline Sentiment
  • SF¬†Salaries
  • First GOP debate Twitter Sentiment
  • 2013 American Community Survey
  • US Baby Names
  • May 2015 Reddit Comments
  • 2015 Notebook UX Survey
  • NIPS 2015 Papers
  • Iris (yes, the one you will have seen many times already if you’ve read ANY¬†books/tutorials on clustering in R or similar!)
  • Meta Kaggle
  • Health Insurance Marketplace
  • US Dept of Education:¬†College Scoreboard
  • Ocean Ship Logbooks (1750-1850)
  • World Development Indicators
  • World¬†Food Facts
  • Hilary Clinton’s Emails (sounds fun…:-))

Beware! Killer robots swim among us

In a further sign of humanity’s inevitable journey towards¬†dystopia, live trials of an autonomous sea-based killer robot made the news recently. If all goes well, it could be released into¬†the wild¬†within a couple of months.

Here’s a picture. Notice it’s cute little foldy-out arm at the bottom, which happens to contain the necessary ingredients to provide a lethal injection to its prey.



Luckily for us, this is the COTSbot, which, in a backwards version of nominative determinism, has a type of starfish called “Crown Of Thorns Starfish” as its sole target.


The issue with this type of starfish is that they have got a bit out of hand around the Great Barrier Reef. Apparently at a certain population level they live in happy synergy with the reef. But when the population increases to the size it is today (the cause of which is quite possibly due to human farming techniques) they start causing a lot of damage to the reef.

Hence the Australian Government¬† wants rid of them. It’s a bit fiddly to have divers perform the necessary operation, so hence some Queensland University of Technology roboticists¬†have developed a¬†killer robot.

The notable feature of the COTSbot is that it may (??) be the first robot that autonomously decides whether it should kill a lifeform or not.

It drives itself around the reef for up to eight hours per session, using its computer vision and a plethora of processing and data science techniques to look for the correct starfishes, wherever they may be hiding, and perform a lethal injection into them. No human is needed to make the kill / don’t kill decision.

Want to see what it looks like in practice? Check out the heads-up-display:


If that looks kind of familiar to you, perhaps you’re remembering this?

terminator1 HUD

Although that one is based on technology from the year 2029 and is part of a machine that looks more like this.


(Don’t panic, this one probably¬†won’t be around for a good¬†13 years yet – well, bar the time-travel side of things.)

Back to present day: in fact, for the non-squeamish, you can watch a video of the COTS-destroyer in action below.

How does it work then?

A paper by Dayoub et al.  presented at the IEEE/RSJ International Conference on Intelligent Robots and Systems conference  explains the approach.

Firstly it should be noted that the challenge of recognising these starfish is considerable. The papers informs us that, whilst COTS look like starfish when laid out on flat terrain, they tend to wrap themselves around or hide in coral – so it’s not as simple as looking for nice star shapes. Furthermore they vary in colour, look different depending on how deep they are, and have thorns that can¬†have the same sort of visual texture as the coral they live in¬†(go evolution). The researchers therefore attempt to assess the features of the COTS via various clever techniques detailed in the paper.

Once the features have been extracted, a random forest classifier, which has been trained on thousands of known photos of starfish/no starfish is used to determine whether what it can see through its camera should be exterminated or not.

A random forest classifier is a popular data science classification technique, essentially being an aggregation of decision trees.

Decision trees are one of the more understandable-by-humans classification techniques. Simplistically you could imagine a single tree as providing branches to follow dependent on certain variables, which it automatically machine-learns from have previously processed a stack of inputs that it has been told are either one thing (a starfish) or another thing (not a starfish).

Behind the scenes, an overly simple version of a tree (with slight overtones of doomsday added for dramatic effect) might have a form similar to this:


The random forest classifier takes a new image and runs many different decision trees over it – each tree has been trained independently and hence is likely to have established different rules, and potentially therefore make different decisions. The “forest” then looks at the decision from each of its trees, and, in a fit of machine-learning democracy, takes the most popular decision as the final outcome.

The researchers claim to have approached 99.9% accuracy in this detection Рto the point where it will even refuse to go after 3D-printed COTS, preferring the product that nature provides.

Although probably not the type of killer robot that the Campaign to Stop Killer Robots campaigns against, or the UN debates the implications¬†of;¬†if it is the first autonomous killer robot it still can conjure up the beginnings of some ethical dilemmas (even outside that of killing the starfish…after all, deliberate eradication/introduction of species to prevent other problems has not always gone well even in the pre-robotic stage of history – but one assumes this has been considered in depth before we got to this point!).

Although 99.9% accuracy is highly impressive, it’s not 100%. It’s very unlikely that many¬†of non-trivial classification models can ever truly claim 100%¬†over the vast amount of complex scenarios that the real world presents. Data-based classifications, predictions and so on are almost always a compromise between the concepts like precision vs recall, sensitivity vs specificity, type 1 vs type 2 errors, accuracy vs power and whatever other names no doubt exist to refer to¬†the general concept that a decision¬†model may:

  • Identify something that is not a¬†COTS as a COTS (and try¬†to kill it)
  • Identify a real COTS as not being a COTS (and leaving it alone to plunder the reef)

Deciding on the accpetable balance between accepting each type of error is an important part of designing models. Without actually knowing the details, here it sounds like the researchers sensibly opted onto the side of caution, such that if the robot isn’t very sure it will send a photo to a human and await a decision.

It’s also the case that the intention is not to have the robot kill every single COTS, which suggests that false negatives might be less damaging than false positives. One should also note that it’s not going to be connected to the internet, making it hard for the average hacker to remotely take it over and go on a tourist-injection mission or similar.

However, given it’s envisaged that one day a fleet of 100 COTSbots, each armed with 200 lethal shots, might crawl the reef for 8 hours per session ¬†it’s very¬†possible a wrong decision may be made at some point.

Happily, it’s unlikely to accidentally classify a human as a starfish and inject¬†it with poison (plus, although I’m¬†too lazy to look it up,¬†I¬†imagine that a starfish dose of starfish poison is not enough to kill a human) – the risk the researchers see is more that the injection needle may be damaged if the COTSbot tries to inject a bit of coral.

Nonetheless, a precedent may have been set for a fleet of autonomous killer robot drones. If it works out well, perhaps it starts moving the needle slightly towards the world of handily-acronymed ¬†“Lethal Autonomous Weapons Systems” that¬†the US Defense Advanced Research Projects Agency¬† is supposedly working on today.

If that fills you with unpleasant stress, there’s no need to worry for the moment.¬†Take a moment of light relief and watch this video of how good the 2015 entrants to the DARPA robotics challenge were at stumbling back from the local student bar traversing human terrain.

More data is not always better data

Like a lot of data-fans, I have something of a tendency to “collect” data just in case it will become useful one day. Vendors are feeding that addiction with constant talk of swimming through blissful “data lakes” and related tools, notably Hadoop and its brethren.

Furthermore, as the production of data grows exponentially, the cost of storing it moves incrementally closer to zero. As¬†an example, Amazon will store whatever you want for $0.007 per GB in its Glacier product, if you don’t need to retrieve it very often or very fast.

That is to say, that the same amount of data you would have needed 728 high density floppy disks to save it on within my relatively short memory, you can now store for about a British ha’penny. This is a coin¬†so irrelevantly valueless¬†in the grand scheme of buying things these days that it was withdrawn from circulation over 30 years ago, long before my memory begins.

But just because we can store all these 1s and 0s, should we? Below I have started to consider some downsides of data hoarding.

The too-obvious risks: large amounts of data expose you to more hacking and privacy risks

There are two huge risks, so obvious that I will pretty much¬†skip over them. There is the risk of a hack, with the potential¬†for¬†some malevolent being to end¬†up with your personal details. Just last week we heard the news that over 150,000 people’s personal details were accessed illegitimately from the computer systems of a mobile phone provider, including nearly 16k bank account details. A couple of years ago, Target famously had up to 70 million of its customers details stolen. If this data hadn’t been stored, the hack obviously couldn’t have taken place.

Often related to such behaviour, there’s also the issue of accidental privacy breaches. These might be simple mistakes – like the time that the UK government accidentally lost a couple of CDs that had personal data relating to every child in the UK on for instance – but they may have consequences not dissimilar to when the data has been hacked if it gets into the wrong hands.

Less dramatically, even in that most curated of safe software gardens, Apple’s app store, 256 apps were removed last month as it was found they were collecting and transmitting personal data that they should not have been.

These examples are not mean to imply there was something innately wrong in storing these particular parts of data. Maybe Target needs to store these¬†customer details to runs its operations in the way it finds most productive, and the Government probably needs to store information on children (although perhaps not in the sorting offices of Royal Mail). However, every time data is stored that doesn’t yet have a use, one should bear in mind there is a non-zero risk it may leak.

Hacking and privacy breaches are easy and obvious concepts though. More interesting here are the other downsides of storing too much data that do not require the efforts of an evildoer to produce adverse consequences. What else might we consider?

Below I list a few. I’m sure I’ve missed many more. Important to consider in some of these is that the data itself may not be generating the majority of the risk, but rather it enables something negative which is later¬†done with it. Methodological issues though are¬†something even the most data-addicted analyst can address in their work.

Large amounts of data may encourage bad analysis

A lot of data analysis is targetted at understanding associations or causes of a target variable based on a set of input variables. Surely the more input variables we have, the better models we can produce and the better our model will reflect the world?

Well, it depends on how you measure it. Mathematically, yes, more variables do tend to lead to a model with a better fit to the data that was used to train it.

And guess what, we now have huge lakes full of crazy amounts of varied data and computer algorithms that can process it. If you want to predict your future sales, why not sack your data scientists and just have your supercomputer run your past sales data against every one of these billions of available variables until it deduces a model that predicts it perfectly?

This doesn’t work for a number of reasons, not least because for generating actual useful insights¬†to support¬†optimal decision¬†making, using too many variables tends to lead to the curse of overfitting.

This is where you involve so many variables as predictors that your model is too specific to the precise historical data you trained it on, and therefore misleads you as to the real drivers behind your target predicted variables.

Wikipedia has a fun example:

As a simple example, consider a database of retail purchases that includes the item bought, the purchaser, and the date and time of purchase. It’s easy to construct a model that will fit the training set perfectly by using the date and time of purchase to predict the other attributes; but this model will not generalize at all to new data, because those past times will never occur again.

In other words, if you give me your past sales data at a customer/transaction level, then I can build you a model that will perfectly “predict” that past data. It would look something like:


Wow, look at my R-squared of 1. This is an unbeatable model. Horray!

At least until we actually want to use it to predict what will happen tomorrow when a new customer comes with a new name and a new date, at which point…catastrophe.

Although it’s a silly example, the point is that the problem came from the fact we exposed the model to too much data. Is it really relevant that someone’s name is Jane Doe? Did we need to store that for the purposes of the model? No – it actually made the resulting analysis worse.

Simplifying it to a univariate situation, the “correlation doesn’t imply causation” trope remains indefinitely true.

First, there’s the issue of how we measure correlation. A classic for linear models is Pearson’s ‘r’.

This gives you a single value between 0 and 1, where 1 is perfect positive correlation, 0 is no correlation and -1 is perfect negative correlation. You will often see results presented as “our analysis shows that X and Y are correlated with r of 0.8”. Sounds good, but what does this tell us?

Anscombe’s Quartet, possibly my favourite ever set of charts (doesn’t everyone have one?), shows us that it doesn’t tell us all that much.

All these charts have a “r” of around 0.8, and they all have the same mathematical linear regression trendline.

Anscombe's quartet

(thanks Wikipedia)

But do you believe that the linear correlation is appropriate for all four cases? Is the blue line a fair reflection of the reality of the trend in all datasets?

Anscombe’s Quartet is not actually¬†about using too much data, but rather relying on statistical summarisation too much. It’s an example of where data visualisation is key¬†to showing validity of a potentially computer-generated equation.

But visualisation alone isn’t sufficient either, when we are using too much data.

Let’s imagine we’re a public service that needs to come up with a way to predict deaths by drowning. So¬†we throw all the data we could possibly obtain at it and see what sticks.

Here’s one positive result from the fabulous Spurious Correlations site.

Spurious correlations

OMG! The data shows Nicolas Cage causes drownings! Ban Nicolas Cage!

A quick dose of common sense suggests that Nicolas Cage films are probably not all versions of the videotape seen in The Ring, which kills you shortly after viewing it.

Nicolas Cage films and drowning statistics may¬†mathematically fit, but, if we are taking our work seriously, there was little expected value in storing and using this data for this purpose. Collecting and analysing too much data led us to a (probably ūüôā ) incorrect theory.

Part of the issue is that predictive analytics is not even as simple as looking for a needle in a haystack. Instead we’re looking for a particular bit of hay in a haystack, which may look the same as any other bit of hay.

Most variables are represented as a sequence of numbers and, whether by coincidence or cross correlation, one meaningless sequence of numbers might look very like one very meaningful sequence of numbers in a particular sample being tested.

It doesn’t help that there are several legitimate statistical tests that arguably lose an aspect of their usefulness in a world where we have big data.

Back in the day, perhaps we could test 20 people’s reaction to a photo by getting them into a lab and paying them for their opinion. We would use inferential statistics to work out whether their opinions were meaningful enough to be extrapolated to a larger population – what can we infer from our 20-person results?

Now we can in theory test 1 billion people’s reaction to a photo by having Facebook perform an experiment (although that might not go down too well with some people). All other things being equal, we can infer a whole lot more from testing 1 billion people than testing 20.

There are therefore many hypothesis tests that are designed to check whether what looks like a difference between >=2 groups of “things”, for instance in the average score out of 10 women¬†give a photo vs the average score men do, is indeed a real difference or just down to random chance. Classics include Student’s T Test, ANOVA, Mann-Whitney and many more.

The output of these often gives a measure of the probability of the result being seen being “real” and not due to some random fluctuations or noise. After all, most variables measured in real life have some natural variation – just because a particular woman is taller than a particular man we should not infer that all women are taller than all men. It could just be that you didn’t test enough women and men to get an idea of the reality of the situation given the natural variance in human height.

This output is often expressed as a “p value”, which is a decimal equivalent to the percentage probability that the differences you see in groups are down to chance.

P = 0.05 is a common benchmark, which would imply that whatever your hypothesis was, there’s a 95% probability¬†that it is actually true (because the probability of it being down to chance is 0.05, = 5%).

Sounds good, but there are at least two issues to be wary of now you may be running these tests over vast numbers of rows or columns of data.

Hypothesis tests over a lot of rows (subjects)

Looking at the formula for these tests, you will see that the actual difference between groups necessary to meet that benchmark for “this is real” gets lower as your sample size gets bigger. Perhaps the test will tell you that a difference of 1 IQ point is significant when you’ve tested a hypothesis over a million subjects in your data, whereas it wouldn’t have done so if you had only tested it over 100.

So is the test wrong in either case? Should we artificially limit how many subjects we put into a test? No, of course not (unless there’s another good reason to do so), but some common sense is needed when interpreting it in a practical sense.

In medical circles, as well as statistical significance, there’s a concept of clinical significance. OK, we might have established that things that do X are more likely to do Y, but does that matter? Is that 1 IQ point difference actually something we care about or not? If your advert produces a ¬£0.01 increase in average basket value, however “significant”, do you care?

Maybe you do, if you have a billion baskets a day. Maybe you don’t if you have 10. These are not questions you can answer in the abstract – but should be considered on a case by case basis by a domain expert.

Just because you had enough data available to detect some sort of effect does not mean that you should rush to your boss as soon as you see a p=0.05 result.

Hypothesis tests over a lot of columns (variables)

We saw above that a common standard is that we can claim our hypothesis is true when we are 95% certain it wasn’t due to chance.

To switch this around: we are happy to claim something is true even if there’s a 5% probability it was due to chance.

That seems fairly safe as a one-off, but imagine we’re testing 20 variables in this experiment, and apply the same rule to say we’re satisfied if any of them meet this criteria.

There’s a 5% chance you come back with the wrong answer on your first variable.
Then there’s 5% chance you come back with the wrong answer on your second variable.
…and so on until there’s a 5% chance you come back with the wrong answer on your tenth variable.

One can calculate that after 20 tests there is a 64% chance of a variable being shown to be significant, even if we know in advance for sure that none of them are.

Imagine if we had gone back to the idea of throwing a massive datalake’s worth of variables against two groups to try and find out, out of “everything we know about”, what explains the differences in these groups? A computer could quite happily throw a million variables against the wall in such an exercise.

If each test is allowed to produce a false positive 5% of the time, and you have a million such tests, one can quickly see that you will (almost) certainly get a bunch of results that Рhad they been conducted as single, well-constructed, hypothesis tests Рshow results that might have be considered as statistically significant Рbut in this scenario they are clearly often predictable-in-existence, but random-in-placement, false indicators.

This sort of behaviour is what people refer to as “p hacking”, and is seemingly somewhat rife within academic literature too where journals prefer to publish positive results (“X did cause Y”) rather than negative (“X did not cause Y”), even though both are often equally useful insights.

An article in Nature reports on a real-world example of this

…he..demonstrated that creative p-hacking, carried out over one ‚Äúbeer-fueled‚ÄĚ weekend, could be used to ‘prove’ that eating chocolate leads to weight loss, reduced cholesterol levels and improved well-being (see go.nature.com/blkpke). They gathered 18 different measurements ‚ÄĒ including weight, blood protein levels and sleep quality ‚ÄĒ on 15 people, a handful of whom had eaten some extra chocolate for a few weeks. With that many comparisons, the odds were better than 50‚Äď50 that at least one of them would look statistically significant just by chance. As it turns out, three of them did ‚ÄĒ and the team cherry-picked only those to report.

(Unless of course chocolate does lead to weightloss, which would be pretty cool.)

The same article also refers to a similar phenomena as “Texas sharpshooting”, being based on a similarity to “an inept marksman who fires a random pattern of bullets at the side of a barn, draws a target around the biggest clump of bullet holes, and points proudly at his success.”

It’s quite possible to p-hack in a world where big data doesn’t exist. However, the relevance here is that one needs to avoid the temptation of collecting, storing and throwing a whole bunch of “miscellaneous” data at a problem when looking for correlations, models and so on, and then reporting that you found some genuine insight whenever a certain statistical test happens to reach a certain number.

There are explicit procedures and statistical tests designed to alert one you as to¬†when you’re at risk of this, also coming from the field of statistical inference. These fall under the category of controlling the “familywise error rate” of a set of hypotheses, with perhaps the most famous being the Bonferroni correction.

Going back to the scientific method is also critical; replication can be critical. It’s good practice when building a statistical model to ensure you have some data that you can test it on, which has not been¬†anywhere near the model-building process itself. If you have enough of this, then if you believe you’ve found that people born on a rainy day are more likely to buy high value items, then you can at least test this one particular theory specifically, instead of testing for an arbitrary amount of potential theories that happen to crop up due to spurious relationships within your training dataset.

Large amounts of data may reinforce the world’s injustices

A previous post on this site dealt with this issue in more detail, so we’ll skim it here. But suffice to say that there are common methodological flaws in data collection, processing, statistical models and the resulting “data driven” execution that result in suboptimal outcomes when measured in terms of fairness or justice by many reasonable people.

Two highlights to note are that:

Machine learning models…learn

Many statistical models are specifically designed to learn from historical reality. My past post had an example where a computer was fed data on historical hiring decisions – which resulted in it “innocently” learning to discriminate against women and people with foreign-sounding names.

Imagine if gender data hadn’t been stored for the applicants though – after all, should it really be a factor influencing applicant success in this way? No; so why was it collected for this task? Perhaps there was an important reason – but if not, then adding this data effectively provided the model with the tools to recreate gender discrimination.

The “foreign sounding names” point provides a warning though – perhaps the computer was not explicitly fed with ethnicity; but it was supplied with data that effectively proxied for it. Again, the computer has just learnt to do what humans did in the past (and indeed the present, which is why “name-blind recruitment” is a thing),

Implementing data driven models requires…data

“Data driven” decisions can only be made where data exists. Again, in my previous post, we saw that using “objective” data from phone apps or vehicles equipped to automatically transmit information as to where potholes that needed fixing were located, in fact produced a bias towards focussing such repair resources on areas that were full of affluent people – because poor areas were less like to have the same number of people with the resources and interest in fiddling around with the expensive Landrovers and fancy phones that collected this information.

This phenomenon is something like a tech/data version of the “WEIRD” issue that has been noted in psychology research – where a hugely disproportionate amount of human research has, for reasons other than deliberate malice, inadvertently concentrated on understanding Western, Educated people from Industrialised, Rich and Democratic countries – often¬†American college students, who are probably not entirely representative of the entirety of¬†humanity.

It could be argued that this one is a risk associated with having too little data rather than too much. But, given it is impossible to obtain data on literally every aspect of every entity in the world, these sort of examples should also be considered cases where using partial data for a variable risks producing worse results than using no data for that variable at all.

One should be wary of deciding that you might as well collect the data you happen to have access to as a method of creating analysis on something that will be extrapolated to people outside of your cohort’s parameters. Perhaps you will spend all that time and effort collecting data just because you can, just to end up with a model that produces worse real-world outcomes than not having that data at all.

Large amounts of data may destroy the planet

Dramatic perhaps, but for those of us who identify with the overwhelming scientific consensus that climate change is real, dangerous and affected by human activities, there is a mechanism here.

Storing lots of data requires lots of machines. Powering lots of machines (and especially their required cooling equipment) needs lots of electricity. Lots of electricity means lots of energy generation, which given – even in the developed world – is overwhelmingly generated in environmentally unfriendly ways, tends to mean lots of pollution and other such factors that could eventually contribute towards a potential planetary disaster.

Back in 2013, The Register reported on a study that suggested that “IT” was responsible for about 10% of electricity usage worldwide. These sort of measures are rather hard to do accurately in aggregate, and here includes the energy needed both to create the device and manage the distribution of information to it.

On the pure data side, the US Natural Resources Defense Council reported that, in 2013, data centres in the US alone used an estimated 91 billion kilowatt-hours of energy. They then went on to project some figures:

Data center electricity consumption is projected to increase to roughly 140 billion kilowatt-hours annually by 2020, the equivalent annual output of 50 power plants, costing American businesses $13 billion annually in electricity bills and emitting nearly 100 million metric tons of carbon pollution per year.

To be fair, some of the larger firms are trying to remedy this, whether for environmental purposes or cost cutting.

Facebook have a few famous examples, including building a data centre that is powered by hydroelectriciy, situated in a chilly sub-arctic area of Sweden which provides a big bonus of free cooling. This was¬†expected to save 70% of the “normal” amount of energy for cooling such a place.

More recently, their engineers have been developing methods such that they can keep the vast amount of pointless photos that users uploaded years ago (and never looked at again) more offline, on machines that were powered down or even on Blueray disks.

Given that a lot¬†of Facebook’s data isn’t being viewed¬†by anyone at any single¬†point in time, they can use people’s¬†behavioural data in order to predict which other types of data might be needed live, leaving the rest turned off. It might sound¬†a bit like Googling Google, but as Ars Technica reports:

“We have a system that allows 80-90 percent of disks turned off,” Weihl said. …you can predict when photos are going to be needed‚ÄĒlike when a user is scrolling through photos chronologically, you can see you’ll need to load up images soon. You could make a decision to turn off not just the backups, but all the copies of older stuff, and keep only the primaries of recent stuff spinning.”

It’s not only the big players to consider though. The NRDC noted that the majority of data centre power needs were actually for “small, medium, and large corporate data centers as well as in the multi-tenant data centers to which a growing number of companies outsource their data center needs”, which on the whole are substantially less efficient than Facebook’s arctic zone.

So, every time you hoard data you don’t need, whether in big corporate Hadoops or Facebook photo galleries, you’re probably contributing towards unnecessary environmental pollution.

Optimum solution? Don’t save it. More realistic solution? Some interested organisations have done evaluations and produced consumer-friendly guides, such as the Greenpeace “Click Clean Scorecard”¬†to suggest which storage service you might like to use if the environment¬†is a key concern for you.

As a spoiler, Apple and Facebook did very well in the latest edition. Amazon Web Services did not.

Final thoughts

I’m a data-believer, I drink the Kool-Aid that claims¬†the data can save the world, or at least, if not save it directly, then help us make decisions that produce better results. And in general, having more information is better than having less – as long as it is usable, used correctly – or at least there’s some conceptual reason to imagine one day it might be. But it’s an interesting exercise¬†to play Devil’s Advocate and imagine why more data is not always a good thing.

Some of the issues above are less about the existence of the data, and more about the way that having access to it can tempt analysts or their managers into bad practice.

Other issues surround the size of the data, and the fact that it’s often simply not necessary to use as much as data as you can. An article on Wired suggests one should¬†add “viability” and “value” to the 3 Vs of big data (velocity, volume and variety, for anyone who hasn’t lived through the painful cloud of buzzwords in¬†recentish times).

An Information Age writer asks:

What’s the marginal impact on a predictive model’s accuracy if it runs on five million rows versus 10 billion rows?

The answer is often “very little”, other than it will:

  • use more resources (time, money, the planet) to gather and store,
  • take exponentially more time and CPU power to process; and
  • risk allowing an inordinate amount of overfitting, biased or otherwise poor models if best practice is not followed.

From restaurant-snobbery to racism: some perils of data-driven decision-making

Wired recently wrote a piece explaining how now OpenTable, a leading “reserve a restuarant over the internet” service, was starting to permit customers to pay for their meal via an app at their leisure, rather than flag down a waiter and awkwardly fiddle around with credit cards.

There’s an obvious convenience to this for the restaurant patron, but, as with most useful “free” services, the cost is ones’ personal data. Right now, it is possible, at least unofficially, for the data-interested to access some OpenTable restaurant data – but soon it may become a lot more personal.

Wired writes:

Among information the company purports to collect: name and contact information, current and prior reservation details, order history, dining preferences, demographics and precise location data. The company pairs such user data with information from ‚ÄúThird Party Platforms.‚ÄĚ The wording here is purposefully vague, but it is certainly plausible that the company could use outside research firms to cobble together a whole host of personal information like income, age and spending habits.

For users who make payments via the mobile app, OpenTable reserves the right to share its customer dossier with the restaurant ‚Äúfor its own purposes.‚ÄĚ

In a utopian world, this could be great. It might be nice to turn up to a restaurant having them already know who you are, which your favourite table is, what drinks you might want on ice and be given a personalised menu highlighting your favourites and removing any foods to which you have an allergy to?

This sort of technology-experience is already in use in a few places. For instance, in the magical land of Disneyworld, the Disney MagicBand allows you to pre-order your food and have it ready by the time you turn up.

From another Wired article:

If you‚Äôre wearing your Disney MagicBand and you‚Äôve made a reservation, a host will greet you at the drawbridge and already know your name‚ÄĒWelcome Mr. Tanner! She‚Äôll be followed by another smiling person‚ÄĒSit anywhere you like! Neither will mention that, by some mysterious power, your food will find you.

The hostess, on her modified iPhone, received a signal when the family was just a few paces away. Tanner family inbound! The kitchen also queued up: Two French onion soups, two roast beef sandwiches! When they sat down, a radio receiver in the table picked up the signals from their MagicBands and triangulated their location using another receiver in the ceiling. The server‚ÄĒas in waitperson, not computer array‚ÄĒknew what they ordered before they even approached the restaurant and knew where they were sitting.

But the first Wired article highlights also a less delightful side to the experience.

Restaurants have always doled out preferential treatment to the ‚Äúbest‚ÄĚ customers‚ÄĒbut now, they‚Äôll be able to brand them with a specific dollar sign the second they walk in the door.

A concern is whether knowledge of a customer’s normal habits will be used to score them on some metric that leads to differential treatment. Maybe if they know upon approach that you buy cheap dishes, aren’t likely to have excessive social influence or – most horrific of all – tip badly, you will get a worse level of service than if they scored you in the opposite way.

Some might argue you can see the consequences of this sort of “pre-knowledge” in a very basic way already. Discount voucher sites like Groupon offer cheap deals on meals that often require one to identify to the waiter that you will be using such a voucher in advance.

There are very many anecdotal reports (mild example) that this can lead to worse service (and very many other reports that this might well be because voucher users aren’t understanding that the server probably should get a tip based on the non-voucher price!).

The Brainzooming Group summarises some more formal research:

Additional research revealed a direct link between the use of Groupon and a negative service experience. The above graph is from a study conducted by Cornell researchers who studied over 16,000 Groupon Deals in 20 US cities between January and July this year. The study found, among other things, that Groupon users averaged a 10% lower rating than those who didn’t use Groupon.

However, it’s clearly not the case that “restaurant prejudice” would be a new thing – it existed well before OpenTable thought about opening up its data treasure trove. It happened even in ye olde times before Groupon. In fact the author of the original Wired article quoted at the top was themselves an assistant to a “VIP” of the sort that never had trouble getting even the toughest of tables with his secret phone numbers and possibly a bit of name infamy.

My first boss, an infamous man-about-town, kept a three-ring binder of unpublished phone numbers unlocking some of the toughest tables in the city.

The issue is – as with a lot of big data concerns – not that it introduces a brand new issue but that it allows it at a massively enhanced scale.

Now instead of a judgement that is basically a 2-way 0.1% “top VIP” vs 99.9% “normal person” categorisation, we could have “5* tipper who earns ¬£8k a month and just paid off her mortgage” vs “unlucky downtrodden victim of capitalism who normally only spends a bare minimum and drinks tap water” scored for each and every one of the OpenTable users individually.

But at the end of the day, how much should we care? Having a suboptimal restaurant experience is not the worst experience life can bring to most people’s existence. And some would say that, if the data is reliable and the models are accurate (hmmm…), such businesses have every right to treat their customers in line with their behaviour, so long as they remain in line with consumer law.

Of course though, this type of data-driven behaviour is not limited to restaurant-going though – and there are risks of a far less recreational nature to consider.

Insurance is one such example. Insurance actuaries, analysts and the like have, for more years than most of us, used data to determine the services and prices they will offer to people. The basic theory is: if there is more chance that this person will make an expensive insurance claim than average then they will be charged more (or refused) for the insurance product.

There is obvious business sense in this, but the need to rely on segmentaton, generalised patterns and so on means that someone with little relevant history who happens to fall into a “bad group” may be unfairly penalised.

The classic example bandied around here in the UK (where we have no mass-market need yet of health insurance) is car insurance.

Gender was traditionally an influencing factor in the insurance premiums. Men drivers would be charged more than women drivers based on the premise that they may on average take more risks than women, apparently get into more accidents, are less likely to wear seat belts, are more likely to get caught driving drunk and so on.

In a population this therefore might seem fair. At the individual level, less so in some cases – some men are safer drivers than some women, but of course the insurer has no way at first to determine which ones, so charges all men more. Well, they did until it became illegal.

As per the BBC in 2012:

the European Court of Justice has decided that insurers will no longer be allowed to take the gender of their customers into account when setting their insurance premiums.

There are many arguments as to the fairness of this ruling which we won’t go into here, other than to note that it did not please insurance companies overly – so some seem to be using a different sort of data to semi-bypass that ruling.

According to Stephen McDonald from the Newcastle University Business School, it looks like insurance companies are now using a proxy for gender: occupation.

Some jobs may legitimately have higher risks regarding driving than others. Also, even in 2015, it still happens that some jobs have a higher proportion of males or females in than others. Can you see where we’re going with this?

Some examples from Stephen’s study:

The occupations represent different mixes of males to females, with social workers and dental nurses being mostly female, civil engineers and plumbers mostly male, and solicitors and sports and leisure assistants being gender neutral.

And how is this relevant to insurance premiums? Well:

Comparing prices from before and after the change of law, he finds that after the ban:

  • Prices are the same for men and women.
  • But prices become relatively lower for young (21 year old) drivers in occupations mostly held by women, but higher for those in predominantly male occupations. For example, premiums increased by an estimated 13% for 21 year old civil engineers, but decreased by 10% for dental nurses of the same age.

To summarise: (especially young) people who tell insurance companies that they have a job which is commonly done by males are charged more than those who inform the same company that they have a job commonly done by females.

So, for anyone who really has no idea what career they’d like to follow – if you like driving and like money, perhaps dental nurse is the way to go!

Before we leave driving, let’s have a look at those dangerous motorist obstacles: potholes. A few years ago, a (somewhat dubiously sourced) survey estimated that hitting potholes caused nearly ¬£500 million worth of damage to the UK’s car population in a year. The associated swerving and other consequential manoeuvres can be even a matter of life and death in some cases.

So they surely need fixing – but how can we locate where they are, so the requisite council can move in and fix?

In reality, few motorists are likely bothered enough to take the time to call or visit the authorities, even if they know who the relevant one is. Being the year 2015, the fashionable answer in many cases of course is to use an app. “Fill that hole” is one example (now backed by the Government) where you can quickly and easily report any pothole dangers.

Street Bump goes one step further, and tries to use a combination of your smartphone’s accelerometer and GPS sensors to automatically detect when you hit a pothole and then report it directly, avoiding the need for a user to manually log it.

But how you really want to be doing this doesn’t involve faffing around with a smartphone at all. Why not just get one of Jaguar’s latest developments – a Land Rover that not only detects and reports potholes automatically whilst driving but also could potentially alert the driver of the vehicle in advance to give them time to avoid them, tell other nearby cars about their existence and even – as driving becomes ever more automated – take over the controls and safely navigate around them automatically.

So let’s all get brand new Jaguars and pothole problems will be forever solved! Except we won’t, will we? And why not? Well, the #1 reason may be that cutting-edge car tech from Jaguar is not going to be cheap. It will probably be beyond the price range of most of us (for now). Just as smartphones are out of the price, knowledge or “interested in learning about” range of a still-significant proportion of the population. Which leads to a further “data risk”, especially when we are talking about important public services.

As Tim Harford notes in the Financial Times

Yet what Street Bump really produces, left to its own devices, is a map of potholes that systematically favours young, affluent areas where more people own smartphones. Street Bump offers us ‚ÄúN = All‚ÄĚ in the sense that every bump from every enabled phone can be recorded. That is not the same thing as recording every pothole.

There’s an obvious loop here. Areas that are more economically deprived, or where there is less digital literacy, are less likely to have people busily reporting potholes on their fancy smartphones, let alone their cutting edge Jags. But there’s little evidence that they have fewer potholes – just fewer reports.

To steal a famous saying, here’s a case where “absence of evidence is not evidence of absence”.

If, however, we wrongly assume that areas with the most digital pothole reports are those with the most potholes and prioritise repairs accordingly, then one can imagine a situation where the most deprived areas get the least maintenance and become ever more troubled and undesirable, as their reputation as “bad areas” spirals further out. Those advocates of the broken windows theory might believe it could even lead to higher crime.

Potholes might seem like a relatively trivial example (if you’ve never hit one), but in an era where even the most vulnerable are being forced by the Government to magic up a computer and learn how to use the internet to access the welfare they desperately need to live – despite some surveys suggesting this will range between hard and impossible for many of those who most need them – it’s not hard to imagine more serious reinforcements of inequality along these lines.

Or even matters of life and death.

3 years ago, Hurricaine Sandy struck New York. It was big and bad enough that many people dialling 911 to request urgent help could not get through. Luckily though, the NY Fire Department had a Twitter account and a member of staff to monitor it for urgent requests. Although, in this case, they explicitly requested people not tweet for emergency help, this fine employee did only what anyone with some humanity would do – and passed on these urgent requests to the front-line rescuers.

“My friends’ parents who are trapped on #StatenIsland are at 238 Weed Ave. Water almost up to 2nd flr.,” posted Michael Luo, an investigative reporter for The New York Times.

I have contacted dispatch,” Rahimi responded within minutes. “They will try to send help as soon as they can.”

Fantastic stuff, social media activity potentially actually saving lives – but for those people unfortunate enough to not have a smartphone suitably fancy enough to tweet, or perhaps lacking the knowledge or desire needed to do so – it potentially introduces an innate bias in resource deployment that favours some people over others for no legitimate reason.

Moving on, regarding the same subject of reinforcing inequalities: it is the case many predictive data models are in fact deriving their answers by learning and adapting to what happened in the past. It’s the obvious way to test such systems if nothing else: if they can predict what did happen properly, they may stand a good chance of predicting what will happen.

This means however that data-driven efforts are not immune from reinforcing historical prejudice either

Let’s take a couple of serious biggies: race and gender discrimination.

Surely big, open, accessible data is exactly the tool needed to combat the horrible biases of some in society? Not necessarily; it depends entirely upon application.

As long ago as 1988, the Commission for Racial Inequality (now merged into the Equality and Human Rights Commission) found St George’s Hospital Medical School guilty of racial and gender discrimination in choosing who to admit to its school. Some rogue racist recruiter? No, not directly anyway.

‘…a computer program used in the initial screening of applicants for places at the school unfairly discriminated against women and people with non-European sounding names’.


‘The program was written after careful analysis of the way in which staff were making these choices and was modified until…it was giving a 90-95% correlation with the gradings of the [human] selection panel’

The data and ruleset that led to this bias wasn’t as explicit as one might imagine at first sight. The computer was never given the race of the applicants – it wasn’t even recorded on the application form. However, it seemingly “learned” to effectively proxy it based on surname and place of birth. The overall effect was to reduce the chances of those applicants who seemed female or foreign to be included on the “invite to interview” list.

‘Women and those from racial minorities had a reduced chance of being interviewed independent of academic considerations’

This wasn’t a programmer or data scientist (or whatever they were called in 1988!) secretly typing in a prejudiced line of code, and nor did the computer suddenly decide it didn’t like foreigners or women. The computer was not racist; but by the nature of the algorithms it ran reflected a process that probably was.

‘This point is important: the program was not introducing new bias but merely reflecting that already in the system’

Fast forward more than quarter of a decade, and now machine learning algorithms can run over outrageously large amounts of data, trying to test and learn which of thousands of variables can automate accurate decision making that in the past took a slow, expensive, human to do.

As shown above, it’s not as simple as removing “race” or other likely prejudices from the variable set.

2 years ago, Kosinksi et al. published a paper in the Proceedings of the National Academy of Science that simply looked at what people had pressed ‘Like’ on in Facebook. This is information that is often publicly available, attributable to an individual, and accessible for data-related uses.

Using nearly 60,000 volunteers, they produced a regression model which produced some very personal insights.

‘The model correctly discriminates between homosexual¬†and heterosexual men in 88% of cases, African Americans and¬†Caucasian Americans in 95% of cases, and between Democrat and¬†Republican in 85% of cases. For the personality trait ‚ÄúOpenness‚ÄĚ,¬†prediction accuracy is close to the test‚Äďretest accuracy of a standard¬†personality test’

Yes, from you pressing ‘Like’ a few times on Facebook, this model purports to be able to determine with reasonable accuracy your physical and mental traits.

The researchers were kind enough to publish a supplement that showed some of the more predictive “likes” – including such unintuitive gems as an association between high IQ and liking curly fries.

But are businesses, employers and other authorities really going to use Facebook data for important decisions? Undoubtedly. They already do.

Loan companies are an obvious example. A quick Google will reveal many organisations purporting to credit score, or have a model for credit scoring, that depends at least partially on your digital data stream.

To pick one at random, Venturebeat reports on a company called Kreditech.

‘By analyzing a pool of publicly available online information, Kreditech can predict how creditworthy you are in a matter of seconds.

The big data technology doesn‚Äôt require any external credit bureau data. Instead, it relies on Facebook, e-commerce shopping behavior, mobile phone usage, and location data.’

And whilst I have no idea about the workings of their model, even if race is not a specific variable involved in the decision, then should if choose to segment based on, for instance, people who have liked “Bonfires” on Facebook, then, in accordance with the Facebook Likes study above, they will de-facto be adjusting for race (liking bonfires being apparently more predictive of being white than black).

Why pick the credit score example? Because some big finance companies have bad form on this sort of prejudice – the US DOJ for instance reached a $335 million settlement a few years ago because:

‘Countrywide discriminated by charging more than 200,000 African-American and Hispanic borrowers higher fees and interest rates than non-Hispanic white borrowers in both its retail and wholesale lending. The complaint alleges that these borrowers were charged higher fees and interest rates because of their race or national origin, and not because of the borrowers‚Äô creditworthiness or other objective criteria related to borrower risk.’

Whatever the motivating factor behind those original prejudiced decisions was, if
this dataset of successful lending to applicants in the past is used to teach a machine how to credit-score automatically then one can see the same risk of unjust outcomes being created, just like St George’s inadvertently reproduced racism and sexism in its data-driven recruitment filter.

Even for those cold, hard capitalists with no interest in social conscience – by not taking the time to consider what your model is actually doing to generate its scoring you could jeopardise your own profits.

As noted above, some of the potential problems come from the classic ‘use the past to predict the future’ methods that underpin a lot of predictive work. Those familiar with Clayton Christensen’s much heralded book regarding the “Innovator’s Dilemma” will immediately see a problem.

From Wikipedia:

‘Christensen’s book suggests that successful companies can
put too much emphasis on customers’ current needs, and fail to adopt new technology or business models that will meet their customers’ unstated or future needs.’

Limiting your marketing efforts to approach people as customers just because they are the same sort of people that previously were customers may artificially restrict your business to a homogeneous, perhaps ever-declining, population of people. That is probably not really your only opportunity for radical growth.

So what’s to be done about this?

Unfortunately, when presented with “computer says no” type consumer scoring, most people are not necessarily going to understand why the computer said no. Sometimes it is literally impossible to determine that through any realistic method. Most of the responsibility therefore has to lie with those developing such models.

I recently had the pleasure of attending a fascinating talk by Hilary Mason, the Founder of Fast Forward Labs and previously a data scientist with bit.ly.

In it, and also in the book ‘Data Driven‘ she co-authored, she went through a set of questions that she likes to be asked about any data science project. One was ‘What’s the most evil thing that can be done with this?’

Partly this was to encourage more open thinking – and they do advise not to ask it if you actually work with people who are evil! – but also noted that:

‘One of the challenges with data is the power that it can unleash¬†for both good and bad, and data scientists may find themselves¬†making decisions that have ethical consequences. It is essential¬†to recognize that just because you can, doesn‚Äôt mean you should.

Openess is an important, if tricky, issue. Where possible, good practice should mandate that predictive model builders should provide clear, understandable documentation as to how exactly their products work.

There are at least two tricky factors to take into account here though:

  1. If this is a commercial model, companies are not going to want to reveal the details of their secret sauce. Experian, for instance, may publish “some factors that may affect credit scores“, but they are never going to publish the full model workings for obvious reasons. However, this does not mean that Experian’s data scientists should not be producing at least clear and explicit documentation for their fellow employees, under NDAs if necessary.
  2. Some types of model are simply more impenetrable than others. A decision tree is quite easy to represent in a way that non-data-scientists can understand. A neural network is very much harder. But sometimes neural networks may produce far more accurate models. But either way it’s not hard to document what went in to the model, even if you can’t fully explain what came out!

Although this is somewhat idealistic, it would be nice if users of such models also were enabled, and proactively made an attempt, to understand it as far as possible.

The report on St. George’s prejudiced recruitment model above made this clear:

‘A major criticism of the staff at St. George’s was that many had no idea of the contents of the program and those who did failed to report the bias.’

In reality, one can’t determine from the article whether this is a fair criticism of the users, or actually something that should be aimed elsewhere. But it would not be impossible for any given organisation to internally promote understanding of such systems.

They also imply one possible monitoring solution. It’s good practice to regularly monitor the output of your model for scoring accuracy – but one can also monitor the consequence of the decisions it’s making, irrespective of accuracy. This is especially possible if you are, for legal or ethical reasons, particularly concerned about certain biases, even over pure model accuracy.

Worried if your model is outputting racist decisions? Well, if you can, why not correlate its results with “race” and look for patterns? Even before designing it you could remove any variables that you have determined to have¬†certain degree of cross-correlation with race if you wanted to be really careful. But know that – depending on the task at hand¬†– this might jeopardise model accuracy and hence be a hard sell in many¬†environments.

No-one can argue that the main point of a lot of these predictive models is to optimise for accuracy. But then be aware of exactly what “accuracy” means. History is full of prejudice, and models can certainly reproduce it.

If you’re a luxury brand trying to target the customers with the most spare money then an obvious variable might be “pay received”. But don’t forget, according to the Fawcett Society:

In modern Britain, despite the 1970 Equal Pay Act, women still earn less than men. The gender pay gap remains the clearest and most dramatic example of economic inequality for women today.

So optimising for pay may have some correlation with favouring men over women. Does this matter to you? If, in your context, it does then tread with care!

If you’re a police department, trying to target people likely to commit the most serious of crimes, then “length of prison sentence” might be a sensible input. But beware – the Wall Street Journal reports:

Prison sentences of black men were nearly 20% longer than those of white men for similar crimes in recent years, an analysis by the U.S. Sentencing Commission found.

So optimising for length of prison sentence may be unfairly biasing your “crime score” to select black people over white people.

So, try and evade potential “judge bias” by just using the fact of arrest? Sorry, that probably won’t work much better. Here’s one example why from Jonathon Rothwell of the Brookings Institute:

Blacks remain far more likely than whites to be arrested for selling drugs (3.6 times more likely) or possessing drugs (2.5 times more likely).

Here’s the real shock: whites are actually more likely than blacks to sell drugs and about as likely to consume them.

So optimising for historical arrest statistics may also entail unfairly selecting black people over white people, because – before big data came near to police departments – some already were unfairly selecting black people over white people for other reasons. Does this matter to you? If so, once more, tread with care!

There’s probably no simple, generic, solution to avoid these issues. We are easily already at the point where most human users don’t really understand what creates the output of most data models in any depth.

The point can only be to never assume that even if¬†a model uses a mathematically rigorous algorithm, an unbiased methodology and data that is considered to be an fair view of reality, that its outcome will respect “equal rights”.¬†There is an implicit conflict in some cases between optimising for model accuracy – which may involve internally recreating parts of the world that we don’t especially like – and using data for the common good.

Stephen Few’s new book “Signal” is out

Stephen Few’s latest, “Signal: Understanding what matters in a world of noise” has just been released – or at least it has in the US, seems to be stuck on pre-order on Amazon UK at present.

Not many reviews seem to be floating around just yet, but the topic is ultra-fascinating:

In this age of so-called Big Data, organizations are scrambling to implement new software and hardware to increase the amount of data they collect and store.

However, in doing so they are unwittingly making it harder to find the needles of useful information in the rapidly growing mounds of hay.

If you don’t know how to differentiate signals from noise, adding more noise only makes things worse. When we rely on data for making decisions, how do we tell what qualifies as a signal and what is merely noise?

In and of itself, data is neither. Assuming that data is accurate, it is merely a collection of facts. When a fact is true and useful, only then is it a signal. When it’s not, it’s noise. It’s that simple.

In Signal, Stephen Few provides the straightforward, practical instruction in everyday signal detection that has been lacking until now. Using data visualization methods, he teaches how to apply statistics to gain a comprehensive understanding of one’s data and adapts the techniques of Statistical Process Control in new ways to detect not just changes in the metrics but also changes in the patterns that characterize data.

Data science vs rude Lego

Data science moves onwards each day, helping (perhaps) solve more and more of the world’s problems. But apparently there’s at least one issue for which we don’t have a great machine-learning/AI solution for just yet – identifying penises made out of Lego.

Indeed this is apparently the problem that plagued the potential-Minecraft-beater “Lego Universe” nearly 5 years ago.

The internet is awash with re-tweets of ex-Lego-Universe developer Megan Fox’s amusing stories from yesteryear. Thanks to Exquisite Tweets for collecting.

Funny story – we were asked to make dong detection software for LEGO Universe too. We found it to be utterly impossible at any scale.

Players would hide the dongs where the filtering couldn’t see, or make them only visible from one angle / make multi-part penis sculptures…

They actually had a huge moderation team that got a bunch of screenshots of every model, every property. Entirely whitelist-based building.

YOU could build whatever you wanted, but strangers could never see your builds until we’d had the team do a penis sweep on it.

It was all automated, but the human moderators were IIRC the single biggest cost center for LEGO Universe’s operational costs. Or close to.

To be fair, this was a few years ago and progress on image recognition data science did not stop.

Lego itself just released “Lego Worlds” recently which seems to be a similar type of thing – whether they have solved the problem I do not know.

Humanity does seem to be making decent progress on such tasks in general. Microsoft Research recently published a paper “Delving deep into rectifiers” wherein they detail their algorithmic achievement in being perhaps the first program that classifies images within the Imagenet Large Scale Visual Recognition Challenge 2012 more accurately than the competitor human managed.

In the consumer space, both Flickr, and very recently, Google have opened up features that allow anyone to upload large numbers (or in Google’s case, apparently infinite) photographs and then keyword search for “dog”, “Billy”, “Paris” etc. to show all your photos of dogs, Billy or taken in Paris without you having to provide any manual tagging or contextual information.

Flickr’s attempt has been around a bit longer and has caused a little controversy – as all in the field of data will know, the sort of machine learning and classification processes this extremely hard problem requires do not have any inbuilt sense of politeness or decency.

Misclassifying this¬†photo of Auschwitz as “sport”, as reported by the Guardian, is surely just a confused algorithm rather than a deliberate attempt to offend.

Flickr staff are open that mistakes will be made and that there is an inbuilt process to learn from them – but it’s obvious why a “normal” viewer can find these classification errors offensive, especially when they might relate to photos of their children for instance.

This surely poses a dilemma for the sort of companies that provide these services. The idea behind these services is a great one, and pretty essential in these days where we all take thousands of photos a year and need some way to retrieve the few ones we are particularly interested in Рbut how understanding present-day consumers are towards the mistakes inherent in the process Рparticularly at the start of any such efforts Рremains to be seen.

In any case I’m sure it won’t be long before someone tests how good Google Photo is at autotagging Lego genitalia (or much worse…).

Behind the scenes of the FiveThirtyEight UK general election forecasting model

Here in¬†the UK we’re about to go to the polls to elect some sort of government in just a few weeks. Nate Silver’s FiveThirtyEight team are naturally on the case in providing their famously accurate election forecasts. ¬†They were kind enough¬†to explain again the methodology being used in this blog post by Ben Lauderdale.

Go there and read it in full for the clear and interesting explanation, but in super-quick summary it is starting with their famed method of analysing poll results over time and adjusting for the historic bias each poll has shown vs reality, both in terms of source and time-left-before-election.

What the average poll says now is not the best guess of what will happen in the subsequent election…We can estimate how the relative weight to put on the polls changes as elections approach and use that in our forecast.

But it soon becomes more complex. In their view, due to the increasing influence¬†on the results by parties that nationally have a low share of vote but¬†with high regional variance, applying a uniform swing to the whole country based on national polls doesn’t work.

However, constituency-level polls are not frequent or numerous enough to include in the above. They did manage to get some, but, being relatively sparse, are developing a model around them.

We use a multilevel regression model to describe how vote intention at the constituency level depends on a variety of factors, including region, incumbency, constituency demographics and results in the last election. We then reconcile the constituency-level vote intentions we get from this data with the national-level forecast that we constructed using the national polls, by applying a swing model that we built from the historical record of constituency vote share swings from election to election.

I’m looking forward very much to seeing how it goes, even if¬†I’m not greatly¬†keen on the result they predict today! Follow their predictions here.

Their¬†full description of their model¬†includes¬†a lesson on¬†the importance of phrasing survey¬†questions.¬†Apparently people¬†do not answer “If there was a general election tomorrow, which party would you vote for?” in the same¬†way as “Thinking specifically about your own parliamentary constituency at the next general election and the candidates who are likely to stand for election to Westminster there, which party‚Äôs candidate do you think you will vote for in your own constituency”.

The most toxic place on Reddit

Reddit, the “front page of the internet” – and a network I hardly ever dare enter for fear of being sucked in to reading 100s of comments for hours on highly pointless yet entertaining things –¬† has had its share of controversies over the years.

The site is structurally divided up into “subreddits” , which one can imagine just as simple, quite old-school, forums where anyone can leave links and comments, and anyone else can up or downvote them as to whether they approve or not.

Reddit users were themselves busily engaged in a chat regarding “which popular subreddit has a really toxic community”¬†when Ben Bell of Idibon¬† (a company big into text analysis)¬† decided to tackle the same question with a touch of data science.

But what is “toxic”? Here’s their definition.

Ad hominem attack: a comment that directly attacks another Redditor (e.g. ‚Äúyour mother was a hamster and your father smelt of elderberries‚ÄĚ) or otherwise shows contempt/disagrees in a completely non-constructive manner (e.g. ‚ÄúGASP are they trying CENSOR your FREE SPEECH??? I weep for you /s‚ÄĚ)

Overt bigotry:  the use of bigoted (racist/sexist/homophobic etc.) language, whether targeting any particular individual or more generally, which would make members of the referenced group feel highly uncomfortable

Now, text sentiment analysis isn’t all that perfect as of today. The CTO of Datasift ¬†who has¬†a very cool social-media-data-acquiring-tool was¬†claiming around 70% accuracy as being about the peak possible, a couple of years ago. The CEO of the afore-mention Idibon claimed about 80% was possible today.

No-one is claiming nearly 100%, especially on such subtle determinations such as toxicity, and their chosen opposite, supportiveness. The learning process was therefore a mix of pure machine science and human involvement, with the Idibon sentiment analysis software highlighting, via the Reddit API, the subreddits most likely to be extreme, and humans classifying a subset of the posts into those categories.

But¬†what is a toxic community? It’s not as simple as simply a place with a lot of toxic comments (although that’s probably not a bad proxy). It’s a community where¬†such nastiness is approved of or¬†egged on, rather than¬†ignored, frowned upon or punished. Here Reddit provides a simple mechanism to indicate this, as each user can upvote (approve of) or downvote (disapprove of) ¬†a post.

Their final formula they used to calculate judge the subreddits, as per their blog again, is 

The full results of their analysis are kindly available for interactive visualisation, raw data download and so on here.

But in case anyone is in need of a quick offending, here were the top 5 by algorithmic toxicity. It may not be advisable to visit them on a work computer.

Rank of bigotry Subreddit name Official description
1 TheRedPill Discussion of sexual strategy in a culture increasingly lacking a positive identity for men.
2 Opieandanthony The Opie and Anthony Show
3 Atheism The web’s largest atheist forum. All topics related to atheism, agnosticism and secular living are welcome.
4 Sex r/sex is for civil discussions about all facets of sexuality and sexual relationships. It is a sex-positive community and a safe space for people of all genders and orientations.
5 Justneckbeardthings A subreddit for those who adorn their necks with proud man fur.Neckbeard: A man who is socially inept and physically unappealing, especially one who has an obsessive interest in computing:- Oxford Dictionary

[Edited to correct Ben Bell’s name and column title of table¬†–¬†my apologies!]