Books I read in 2016

Reading is one of the favoured hobbies in the DabblingWithData household. In 2016 my beloved fiance invited me to participate in the Goodreads Reading Challenge. It’s simple enough – you set a target and then see if you can read that many books.

The challenge does have its detractors; you can see that an obsession with it will perversely incentivise reading “Spot the Dog” over “Lord of the Rings“. But if you participate in good spirits, then you end up building a fun log of your reading which, if nothing else, gives you enough data that you’ll remember at least the titles of what you read in years hence.

I don’t quite recall where the figure came from, but I had my 2016 challenge set at 50 books. Fifty, you might say, that’s nearly one a week! Surely not possible – or so I thought. I note however that my chief competitor, following a successful year, has set this year’s target to 100, so apparently it’s very possible for some people).

Anyway, Goodreads has both a CSV export feature of the books you log as having read in the competition, and also an API.  I therefore thought I’d have a little explore of what I managed to read. Who knows, perhaps it’ll help improve my 2017 score!

Please click through for slightly more interactive versions of any chart, or follow this link directly. Most data is taken directly from Goodreads, with a little editing by hand.

How much did I read.png

Oh no, I missed my target 😦 Yes, fifty books proved too challenging for me in 2016 – although I got 80% of the way there, which I don’t think is too terrible. My 2017 target remains at fifty.

The cumulative chart shows a nice boost towards the end of August, which was summer holiday time for me. This has led me to conclude the following actionable step: have more holidays.

I was happy to see that I hadn’t subconsciously tried to cheat too much by reading only short books. From the nearly 14k page-equivalents I ploughed through, the single most voluminous book was Anathem. Anathem is a mix of sci-fi and philosophy, full of slightly made-up words just to slow you down further – an actual human:alien glossary is generously included in the back of the book.

The shortest was the Ladybird Book of the Meeting. This was essential reading for work purposes of course, and re-taught me eternal truths such as “Meetings are important because they give everyone a chance to talk about work. Which is easier than doing it”.

Most of my books were in the 2-400 page range – although of course different books make very different usages of a “page”.

So what did I read about?


Science fiction is #1 by book volume. I have an affinity for most things that have been deemed geeky through history (and perhaps you do too, if you got this far in!), so this isn’t all that surprising.

Philosophy at #2 is a relatively new habit, at least as a concerted effort. I felt that I’d got into the habit of concentrating too much on data (heresy I know), technology and related subjects in previous years’ reading habits – so thought I’d broaden my horizons a bit by looking into, well, what Google tells me is merely the study of “the fundamental nature of knowledge, reality, and existence”. It’s very interesting, I promise. Although it can be pretty slow to read as every other sentence one does risk ending up staring at the ceiling wondering whether the universe exists, and other such critical issues. Joking aside, the study of epistemology, reality and so on might not be a bad idea for analysty types.

Lower down we’ve got the cheap thriller and detective novels that are somewhat more relaxing, not requiring either a glossary or a headache tablet.

I was a little surprised at what a low proportion of my books were read in eBook format. For most – not all – books, I think eReaders give a much superior reading experience to ye olde paper. This I’m aware is a controversial minority  opinion but I’ll stick to it and point you towards a recent rant on the Hello Internet podcast to explain why.


So I’d have guessed a 80-90% eBook rate – but a fair number of paper books actually slipped in. Typically I suspect these are ones I borrowed, or ones that aren’t available in eBook formats. Some of Asimov’s books, of which I read a few this year, for instance are usually not available on Kindle.

On which subject, authors. Most included authors only fed my book habit once last year, although the afore-mentioned Asimov got his hooks into me. This was somewhat aided by the discovery of a cluster of his less well-known books fortuitously being available for 50p each at a charity sale. But if any readers are interested in predictive analytics and haven’t read the Foundation Trilogy, I’d fully recommend even a full price copy for an insight into what the world might have to cope with if your confusion matrix ever showed perfection in all domains.

Sam Harris was the second most read. That fits in with the philosophy theme. He’s also one of the rare people who can at times express opinions that intuitively I do not agree with at all, but does it in a way such that the train of thought that led him to his conclusions is apparent and often quite reasonable. He is, I’m aware, a controversial character on most sides of any political spectrum for one reason or another.

Back to format – I started dabbling with audio books, although at first did not get on so well with them; there’s a certain amount of concentration needed which comes easier to me when visual-reading than audio-reading. But I’m trying again this year, and it’s going better – practice makes perfect?

The “eBook /Audio” category refers to a couple of lecture series from the Great Courses  which give you  a set of half hour lectures to listen to, and an accompanying book to follow along with. These are not free but they cover a much wider range of topics than the average online MOOC seems to (plus you don’t feel bad about not doing assignments – there are none).

Lastly, the GoodReads rating. Do I read books that other people think are great choices? Well, without knowing the background distribution of ratings, and taking into account the number of reviews and from whom, it’s hard to do much except assume a relative ranking when the sample gets large enough.

It does look like my books are on the positive side of the 5-points scale, although definitely not the amongst GoodReads’ most popular. Right now, that list starts with The Hunger Games, which I have read and enjoyed, but it wasn’t in 2016. Looking down the global popularity list, I do see quite a few I’ve had a go at in the past, but almost none that I regret choosing one of my actual choices over this year at first sight!

For the really interested readers out there, you can see the full list of my books and links to the relevant Goodreads pages on the last tab of the viz.


When is it safe to stop watching the match?

Despite the Harvard Business Review‘s insistence that data analyst is the sexiest job of the 21st century, ask a non-quant about popular references to data analyssis and you are quite likely to hear some reference to Moneyball (be that book or film). Spoiler alert: “sabermetric” data analysis enabled a baseball team with less money to beat another one that had a lot more money.

Very cool, except – in possibly the most inflammatory statement likely to make it onto this blog – in general watching team sport matches at length is pretty pointless.

Evidence? Clauset et al. have contributed to the field in their recent paper “Safe leads and lead changes in competitive team sports”, published recently in the Physical Review journal.

Within it, they attempt to use data to model and validate how the lead changes between teams playing certain sports. For instance, team A might score the first point in a match, but – specific-sport-allowing – team B might well then score 2 points and seize the lead. The usual rule of course is whoever happens to have the lead after a set amount of time is deemed the winner.

Although they dabble quite successfully in others, the sport they model most accurately is basketball. Their rationale for starting here is that basketball has a high rate of points scoring, with NBA statistics showing an average of 93.6 baskets with an average value of 2.07 points per basket.

Modelling frequent events accurately is almost always easier than modelling infrequent events, so it’s clear why they picked basketball over UK football for instance, where FiveThirtyEight reports that the most common score found in almost 200,000 English football games was a thrilling 1:0. This occurred in about 16% of the matches. In fact not far off 10% of games ended with no-one scoring and no-one winning at all, just to make it sound even more exciting.

Anyway, that aside, how did Clauset’s team model the changes in lead of basketball so accurately that it significantly beat previous heuristics? Advanced logistic neural network forest tree linear super-regressions? Nope, they used a random walk.

For those unfamiliar with random walk models, it’s quite easy to understand at least at the simplest level.

You can imagine a random walk in physical terms. Consider a situation where you’re standing on a platform and can walk either forwards or backwards. Flip a coin – heads you walk forwards, tails you walk backwards. Repeat until 48 minutes have elapsed and consider that your result.

Sounds fantastically trivial, right? What in the uber-complexities of reality could really be modelled by anything derived from such a basic process? Oh, nothing much, just simple things like the stock market and molecular movements amongst others.

And sports, apparently.

The team concludes:

A model based on random walks provides a remarkably good description for the dynamics of scoring in competitive team sports.

In fact the same set of laws can determine many aspects of having the lead in a game.

…we found that the celebrated arcsine law of Eq. (1) closely describes the distribution of times for: (i) one team is leading …,
(ii) the last lead change in a game …
and (iii) when the maximal lead in the game occurs…

The model even covers the empirical fact that if something exciting is going to happen (an “extremal value”) then it tends to be near the very start or the very end of the game.

Lest it be said that I am unfairly representing the model due to my personal views of the merits of long-term sport viewing, towards the end of the article the authors similarly commit:

Cynically, our results suggest that one should watch only the first few and last few minutes of a professional basketball game; the rest of the game is as predictable as watching repeated coin tossings.

And I don’t think they mean that in a positive way!

For the full formulae, validation and so on, see the original paper.

But being in the middle of an arena-crowd watching said sport is probably not an ideal time to whip out the scientific calculator to determine if the lead will change and when – so there is a handy rule of thumb one can use to determine if the match is effectively over, as Slate reports.

it can be expressed as a rule of thumb for determining what the lead and remaining time have to be for a team to have a 90 percent chance at maintaining that lead:

L = .4602√t

, where L is the lead and t is the number of seconds remaining.*

As even the most ardent fan is unlikely to think in terms of seconds remaining, the below chart will tell you when it’s safe to make your excuses and leave the NBA stadium, assuming a 90% confidence level is within your tolerance.

Lead needed to predict win

Assuming a standard 48 minute basketball match, locate the number of minutes that have elapsed already on the x axis, and if the current winning team is leading by at least the y-axis number of points then they are at least 90% sure to win overall. For instance, if you’ve watched 40 minutes of play, and your team is ahead by around 10 points then there’s really not much point in watching it play out – go flip some real coins at the bartender whilst there’s not a queue.

(Journal reference: Phys. Rev. E 91, 062815 (2015))

Every death in the Game of Thrones – a visualisation

TBronn is the #1 killerhe Washington Post published a nice visualisation concerning the many, many deaths in Game of Thrones yesterday – apparently there have been 456 such violent extravaganzas.

Coded by season, allegiance, importance of character, method of death and other such metadata it gives a nice refresh of the important parts of the storyline. Find out which location was deadliest, which character has the most kills and other such fascinating and vital facts.

One has to love the understated nature of the associated data. They record the death of Oberyn as being “Method category: Hands” which, whilst undoubtedly accurate, does not entirely set the scene as to the horror-fest that is more elucidated by Time magazine’s description of it as “his head popped like a grape”.

It certainly made me pull a face not dissimilar to the expression of the unfortunate bystander below.

Reaction to Oberyn's death

Of course the scene is on Youtube if you really must re-view.