More data is not always better data

Like a lot of data-fans, I have something of a tendency to “collect” data just in case it will become useful one day. Vendors are feeding that addiction with constant talk of swimming through blissful “data lakes” and related tools, notably Hadoop and its brethren.

Furthermore, as the production of data grows exponentially, the cost of storing it moves incrementally closer to zero. As an example, Amazon will store whatever you want for $0.007 per GB in its Glacier product, if you don’t need to retrieve it very often or very fast.

That is to say, that the same amount of data you would have needed 728 high density floppy disks to save it on within my relatively short memory, you can now store for about a British ha’penny. This is a coin so irrelevantly valueless in the grand scheme of buying things these days that it was withdrawn from circulation over 30 years ago, long before my memory begins.

But just because we can store all these 1s and 0s, should we? Below I have started to consider some downsides of data hoarding.

The too-obvious risks: large amounts of data expose you to more hacking and privacy risks

There are two huge risks, so obvious that I will pretty much skip over them. There is the risk of a hack, with the potential for some malevolent being to end up with your personal details. Just last week we heard the news that over 150,000 people’s personal details were accessed illegitimately from the computer systems of a mobile phone provider, including nearly 16k bank account details. A couple of years ago, Target famously had up to 70 million of its customers details stolen. If this data hadn’t been stored, the hack obviously couldn’t have taken place.

Often related to such behaviour, there’s also the issue of accidental privacy breaches. These might be simple mistakes – like the time that the UK government accidentally lost a couple of CDs that had personal data relating to every child in the UK on for instance – but they may have consequences not dissimilar to when the data has been hacked if it gets into the wrong hands.

Less dramatically, even in that most curated of safe software gardens, Apple’s app store, 256 apps were removed last month as it was found they were collecting and transmitting personal data that they should not have been.

These examples are not mean to imply there was something innately wrong in storing these particular parts of data. Maybe Target needs to store these customer details to runs its operations in the way it finds most productive, and the Government probably needs to store information on children (although perhaps not in the sorting offices of Royal Mail). However, every time data is stored that doesn’t yet have a use, one should bear in mind there is a non-zero risk it may leak.

Hacking and privacy breaches are easy and obvious concepts though. More interesting here are the other downsides of storing too much data that do not require the efforts of an evildoer to produce adverse consequences. What else might we consider?

Below I list a few. I’m sure I’ve missed many more. Important to consider in some of these is that the data itself may not be generating the majority of the risk, but rather it enables something negative which is later done with it. Methodological issues though are something even the most data-addicted analyst can address in their work.

Large amounts of data may encourage bad analysis

A lot of data analysis is targetted at understanding associations or causes of a target variable based on a set of input variables. Surely the more input variables we have, the better models we can produce and the better our model will reflect the world?

Well, it depends on how you measure it. Mathematically, yes, more variables do tend to lead to a model with a better fit to the data that was used to train it.

And guess what, we now have huge lakes full of crazy amounts of varied data and computer algorithms that can process it. If you want to predict your future sales, why not sack your data scientists and just have your supercomputer run your past sales data against every one of these billions of available variables until it deduces a model that predicts it perfectly?

This doesn’t work for a number of reasons, not least because for generating actual useful insights to support optimal decision making, using too many variables tends to lead to the curse of overfitting.

This is where you involve so many variables as predictors that your model is too specific to the precise historical data you trained it on, and therefore misleads you as to the real drivers behind your target predicted variables.

Wikipedia has a fun example:

As a simple example, consider a database of retail purchases that includes the item bought, the purchaser, and the date and time of purchase. It’s easy to construct a model that will fit the training set perfectly by using the date and time of purchase to predict the other attributes; but this model will not generalize at all to new data, because those past times will never occur again.

In other words, if you give me your past sales data at a customer/transaction level, then I can build you a model that will perfectly “predict” that past data. It would look something like:


Wow, look at my R-squared of 1. This is an unbeatable model. Horray!

At least until we actually want to use it to predict what will happen tomorrow when a new customer comes with a new name and a new date, at which point…catastrophe.

Although it’s a silly example, the point is that the problem came from the fact we exposed the model to too much data. Is it really relevant that someone’s name is Jane Doe? Did we need to store that for the purposes of the model? No – it actually made the resulting analysis worse.

Simplifying it to a univariate situation, the “correlation doesn’t imply causation” trope remains indefinitely true.

First, there’s the issue of how we measure correlation. A classic for linear models is Pearson’s ‘r’.

This gives you a single value between 0 and 1, where 1 is perfect positive correlation, 0 is no correlation and -1 is perfect negative correlation. You will often see results presented as “our analysis shows that X and Y are correlated with r of 0.8”. Sounds good, but what does this tell us?

Anscombe’s Quartet, possibly my favourite ever set of charts (doesn’t everyone have one?), shows us that it doesn’t tell us all that much.

All these charts have a “r” of around 0.8, and they all have the same mathematical linear regression trendline.

Anscombe's quartet

(thanks Wikipedia)

But do you believe that the linear correlation is appropriate for all four cases? Is the blue line a fair reflection of the reality of the trend in all datasets?

Anscombe’s Quartet is not actually about using too much data, but rather relying on statistical summarisation too much. It’s an example of where data visualisation is key to showing validity of a potentially computer-generated equation.

But visualisation alone isn’t sufficient either, when we are using too much data.

Let’s imagine we’re a public service that needs to come up with a way to predict deaths by drowning. So we throw all the data we could possibly obtain at it and see what sticks.

Here’s one positive result from the fabulous Spurious Correlations site.

Spurious correlations

OMG! The data shows Nicolas Cage causes drownings! Ban Nicolas Cage!

A quick dose of common sense suggests that Nicolas Cage films are probably not all versions of the videotape seen in The Ring, which kills you shortly after viewing it.

Nicolas Cage films and drowning statistics may mathematically fit, but, if we are taking our work seriously, there was little expected value in storing and using this data for this purpose. Collecting and analysing too much data led us to a (probably 🙂 ) incorrect theory.

Part of the issue is that predictive analytics is not even as simple as looking for a needle in a haystack. Instead we’re looking for a particular bit of hay in a haystack, which may look the same as any other bit of hay.

Most variables are represented as a sequence of numbers and, whether by coincidence or cross correlation, one meaningless sequence of numbers might look very like one very meaningful sequence of numbers in a particular sample being tested.

It doesn’t help that there are several legitimate statistical tests that arguably lose an aspect of their usefulness in a world where we have big data.

Back in the day, perhaps we could test 20 people’s reaction to a photo by getting them into a lab and paying them for their opinion. We would use inferential statistics to work out whether their opinions were meaningful enough to be extrapolated to a larger population – what can we infer from our 20-person results?

Now we can in theory test 1 billion people’s reaction to a photo by having Facebook perform an experiment (although that might not go down too well with some people). All other things being equal, we can infer a whole lot more from testing 1 billion people than testing 20.

There are therefore many hypothesis tests that are designed to check whether what looks like a difference between >=2 groups of “things”, for instance in the average score out of 10 women give a photo vs the average score men do, is indeed a real difference or just down to random chance. Classics include Student’s T Test, ANOVA, Mann-Whitney and many more.

The output of these often gives a measure of the probability of the result being seen being “real” and not due to some random fluctuations or noise. After all, most variables measured in real life have some natural variation – just because a particular woman is taller than a particular man we should not infer that all women are taller than all men. It could just be that you didn’t test enough women and men to get an idea of the reality of the situation given the natural variance in human height.

This output is often expressed as a “p value”, which is a decimal equivalent to the percentage probability that the differences you see in groups are down to chance.

P = 0.05 is a common benchmark, which would imply that whatever your hypothesis was, there’s a 95% probability that it is actually true (because the probability of it being down to chance is 0.05, = 5%).

Sounds good, but there are at least two issues to be wary of now you may be running these tests over vast numbers of rows or columns of data.

Hypothesis tests over a lot of rows (subjects)

Looking at the formula for these tests, you will see that the actual difference between groups necessary to meet that benchmark for “this is real” gets lower as your sample size gets bigger. Perhaps the test will tell you that a difference of 1 IQ point is significant when you’ve tested a hypothesis over a million subjects in your data, whereas it wouldn’t have done so if you had only tested it over 100.

So is the test wrong in either case? Should we artificially limit how many subjects we put into a test? No, of course not (unless there’s another good reason to do so), but some common sense is needed when interpreting it in a practical sense.

In medical circles, as well as statistical significance, there’s a concept of clinical significance. OK, we might have established that things that do X are more likely to do Y, but does that matter? Is that 1 IQ point difference actually something we care about or not? If your advert produces a £0.01 increase in average basket value, however “significant”, do you care?

Maybe you do, if you have a billion baskets a day. Maybe you don’t if you have 10. These are not questions you can answer in the abstract – but should be considered on a case by case basis by a domain expert.

Just because you had enough data available to detect some sort of effect does not mean that you should rush to your boss as soon as you see a p=0.05 result.

Hypothesis tests over a lot of columns (variables)

We saw above that a common standard is that we can claim our hypothesis is true when we are 95% certain it wasn’t due to chance.

To switch this around: we are happy to claim something is true even if there’s a 5% probability it was due to chance.

That seems fairly safe as a one-off, but imagine we’re testing 20 variables in this experiment, and apply the same rule to say we’re satisfied if any of them meet this criteria.

There’s a 5% chance you come back with the wrong answer on your first variable.
Then there’s 5% chance you come back with the wrong answer on your second variable.
…and so on until there’s a 5% chance you come back with the wrong answer on your tenth variable.

One can calculate that after 20 tests there is a 64% chance of a variable being shown to be significant, even if we know in advance for sure that none of them are.

Imagine if we had gone back to the idea of throwing a massive datalake’s worth of variables against two groups to try and find out, out of “everything we know about”, what explains the differences in these groups? A computer could quite happily throw a million variables against the wall in such an exercise.

If each test is allowed to produce a false positive 5% of the time, and you have a million such tests, one can quickly see that you will (almost) certainly get a bunch of results that – had they been conducted as single, well-constructed, hypothesis tests – show results that might have be considered as statistically significant – but in this scenario they are clearly often predictable-in-existence, but random-in-placement, false indicators.

This sort of behaviour is what people refer to as “p hacking”, and is seemingly somewhat rife within academic literature too where journals prefer to publish positive results (“X did cause Y”) rather than negative (“X did not cause Y”), even though both are often equally useful insights.

An article in Nature reports on a real-world example of this

…he..demonstrated that creative p-hacking, carried out over one “beer-fueled” weekend, could be used to ‘prove’ that eating chocolate leads to weight loss, reduced cholesterol levels and improved well-being (see They gathered 18 different measurements — including weight, blood protein levels and sleep quality — on 15 people, a handful of whom had eaten some extra chocolate for a few weeks. With that many comparisons, the odds were better than 50–50 that at least one of them would look statistically significant just by chance. As it turns out, three of them did — and the team cherry-picked only those to report.

(Unless of course chocolate does lead to weightloss, which would be pretty cool.)

The same article also refers to a similar phenomena as “Texas sharpshooting”, being based on a similarity to “an inept marksman who fires a random pattern of bullets at the side of a barn, draws a target around the biggest clump of bullet holes, and points proudly at his success.”

It’s quite possible to p-hack in a world where big data doesn’t exist. However, the relevance here is that one needs to avoid the temptation of collecting, storing and throwing a whole bunch of “miscellaneous” data at a problem when looking for correlations, models and so on, and then reporting that you found some genuine insight whenever a certain statistical test happens to reach a certain number.

There are explicit procedures and statistical tests designed to alert one you as to when you’re at risk of this, also coming from the field of statistical inference. These fall under the category of controlling the “familywise error rate” of a set of hypotheses, with perhaps the most famous being the Bonferroni correction.

Going back to the scientific method is also critical; replication can be critical. It’s good practice when building a statistical model to ensure you have some data that you can test it on, which has not been anywhere near the model-building process itself. If you have enough of this, then if you believe you’ve found that people born on a rainy day are more likely to buy high value items, then you can at least test this one particular theory specifically, instead of testing for an arbitrary amount of potential theories that happen to crop up due to spurious relationships within your training dataset.

Large amounts of data may reinforce the world’s injustices

A previous post on this site dealt with this issue in more detail, so we’ll skim it here. But suffice to say that there are common methodological flaws in data collection, processing, statistical models and the resulting “data driven” execution that result in suboptimal outcomes when measured in terms of fairness or justice by many reasonable people.

Two highlights to note are that:

Machine learning models…learn

Many statistical models are specifically designed to learn from historical reality. My past post had an example where a computer was fed data on historical hiring decisions – which resulted in it “innocently” learning to discriminate against women and people with foreign-sounding names.

Imagine if gender data hadn’t been stored for the applicants though – after all, should it really be a factor influencing applicant success in this way? No; so why was it collected for this task? Perhaps there was an important reason – but if not, then adding this data effectively provided the model with the tools to recreate gender discrimination.

The “foreign sounding names” point provides a warning though – perhaps the computer was not explicitly fed with ethnicity; but it was supplied with data that effectively proxied for it. Again, the computer has just learnt to do what humans did in the past (and indeed the present, which is why “name-blind recruitment” is a thing),

Implementing data driven models requires…data

“Data driven” decisions can only be made where data exists. Again, in my previous post, we saw that using “objective” data from phone apps or vehicles equipped to automatically transmit information as to where potholes that needed fixing were located, in fact produced a bias towards focussing such repair resources on areas that were full of affluent people – because poor areas were less like to have the same number of people with the resources and interest in fiddling around with the expensive Landrovers and fancy phones that collected this information.

This phenomenon is something like a tech/data version of the “WEIRD” issue that has been noted in psychology research – where a hugely disproportionate amount of human research has, for reasons other than deliberate malice, inadvertently concentrated on understanding Western, Educated people from Industrialised, Rich and Democratic countries – often American college students, who are probably not entirely representative of the entirety of humanity.

It could be argued that this one is a risk associated with having too little data rather than too much. But, given it is impossible to obtain data on literally every aspect of every entity in the world, these sort of examples should also be considered cases where using partial data for a variable risks producing worse results than using no data for that variable at all.

One should be wary of deciding that you might as well collect the data you happen to have access to as a method of creating analysis on something that will be extrapolated to people outside of your cohort’s parameters. Perhaps you will spend all that time and effort collecting data just because you can, just to end up with a model that produces worse real-world outcomes than not having that data at all.

Large amounts of data may destroy the planet

Dramatic perhaps, but for those of us who identify with the overwhelming scientific consensus that climate change is real, dangerous and affected by human activities, there is a mechanism here.

Storing lots of data requires lots of machines. Powering lots of machines (and especially their required cooling equipment) needs lots of electricity. Lots of electricity means lots of energy generation, which given – even in the developed world – is overwhelmingly generated in environmentally unfriendly ways, tends to mean lots of pollution and other such factors that could eventually contribute towards a potential planetary disaster.

Back in 2013, The Register reported on a study that suggested that “IT” was responsible for about 10% of electricity usage worldwide. These sort of measures are rather hard to do accurately in aggregate, and here includes the energy needed both to create the device and manage the distribution of information to it.

On the pure data side, the US Natural Resources Defense Council reported that, in 2013, data centres in the US alone used an estimated 91 billion kilowatt-hours of energy. They then went on to project some figures:

Data center electricity consumption is projected to increase to roughly 140 billion kilowatt-hours annually by 2020, the equivalent annual output of 50 power plants, costing American businesses $13 billion annually in electricity bills and emitting nearly 100 million metric tons of carbon pollution per year.

To be fair, some of the larger firms are trying to remedy this, whether for environmental purposes or cost cutting.

Facebook have a few famous examples, including building a data centre that is powered by hydroelectriciy, situated in a chilly sub-arctic area of Sweden which provides a big bonus of free cooling. This was expected to save 70% of the “normal” amount of energy for cooling such a place.

More recently, their engineers have been developing methods such that they can keep the vast amount of pointless photos that users uploaded years ago (and never looked at again) more offline, on machines that were powered down or even on Blueray disks.

Given that a lot of Facebook’s data isn’t being viewed by anyone at any single point in time, they can use people’s behavioural data in order to predict which other types of data might be needed live, leaving the rest turned off. It might sound a bit like Googling Google, but as Ars Technica reports:

“We have a system that allows 80-90 percent of disks turned off,” Weihl said. …you can predict when photos are going to be needed—like when a user is scrolling through photos chronologically, you can see you’ll need to load up images soon. You could make a decision to turn off not just the backups, but all the copies of older stuff, and keep only the primaries of recent stuff spinning.”

It’s not only the big players to consider though. The NRDC noted that the majority of data centre power needs were actually for “small, medium, and large corporate data centers as well as in the multi-tenant data centers to which a growing number of companies outsource their data center needs”, which on the whole are substantially less efficient than Facebook’s arctic zone.

So, every time you hoard data you don’t need, whether in big corporate Hadoops or Facebook photo galleries, you’re probably contributing towards unnecessary environmental pollution.

Optimum solution? Don’t save it. More realistic solution? Some interested organisations have done evaluations and produced consumer-friendly guides, such as the Greenpeace “Click Clean Scorecard” to suggest which storage service you might like to use if the environment is a key concern for you.

As a spoiler, Apple and Facebook did very well in the latest edition. Amazon Web Services did not.

Final thoughts

I’m a data-believer, I drink the Kool-Aid that claims the data can save the world, or at least, if not save it directly, then help us make decisions that produce better results. And in general, having more information is better than having less – as long as it is usable, used correctly – or at least there’s some conceptual reason to imagine one day it might be. But it’s an interesting exercise to play Devil’s Advocate and imagine why more data is not always a good thing.

Some of the issues above are less about the existence of the data, and more about the way that having access to it can tempt analysts or their managers into bad practice.

Other issues surround the size of the data, and the fact that it’s often simply not necessary to use as much as data as you can. An article on Wired suggests one should add “viability” and “value” to the 3 Vs of big data (velocity, volume and variety, for anyone who hasn’t lived through the painful cloud of buzzwords in recentish times).

An Information Age writer asks:

What’s the marginal impact on a predictive model’s accuracy if it runs on five million rows versus 10 billion rows?

The answer is often “very little”, other than it will:

  • use more resources (time, money, the planet) to gather and store,
  • take exponentially more time and CPU power to process; and
  • risk allowing an inordinate amount of overfitting, biased or otherwise poor models if best practice is not followed.

One thought on “More data is not always better data

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s