Books I read in 2017

Long term readers (hi!) may recall my failure to achieve the target I had of reading 50 books in 2016. I had joined the 2016 Goodreads reading challenge, logged my reading activity, and hence had access to the data needed track my progress at the end of the year. It turns out that 41 books is less than 50.

Being a glutton for punishment, I signed up again in 2017, with the same cognitively terrifying 50 book target – basically one a week, although I cannot allow myself to think that way. It is now 2018, so time to review how I did.

Goodreads allows you to log which books you are reading and when you finished them. The finish date is what counts for the challenge. Nefarious readers may spot a few potential exploits here, especially if competing for only 1 year. However, I tried to play the game in good faith (but did I actually do so?  Perhaps the data will reveal!).

As you go through the year, Goodreads will update you on how you are doing with your challenge. Or for us nerd types, you can download a much more detailed and useful CSV. There’s also a the Goodreads API to explore, if that floats your boat.

Similarly to last year, I went with the CSV.  I did have to hand-edit the CSV a little, both to fill in a little missing data that appears to be absent from the Goodreads dataset, and also to add couple of extra data fields that I wanted to track that Goodreads doesn’t natively support. I then popped the CSV into a Tableau dashboard, which you can explore interactively by clicking here.

Results time!

How much did I read

Joyful times! In 2017 I got to, and even exceeded, my target! 55 books read.

In comparison to my 2016 results, I got ahead right from the start of the year, and widened the gap notably in Q2. You can see a similar boost to that witnessed in 2016 around the time of the summer holidays, weeks 33-35ish. Not working is clearly good for one’s reading obligations.

What were the characteristics of the books I read?

Although page count is a pretty vague and manipulable measure – different books have different physical sizes, font sizes, spacing, editions – it is one of the few measures where data is easily available so we’ll go with that. In the case of eBooks or audio books (more on this later) without set “pages” I used the page count of the respective paper version. I fully acknowledge this rigour of this analysis as falling under “fun” rather than “science”.

So the first revelation is that this year’s average pages per read book was 300, a roughly 10% decrease from last year’s average book. Hmm. Obviously, if everything else remains the same,  the target of 50 books is easier to meet if you read shorter books! Size doesn’t always reflect complexity or any other influence around time to complete of course.

I hadn’t deliberately picked short books – in fact, being aware of this incentive I had tried to be conscious of avoiding doing this, and concentrate on reading what I wanted to read, not just what boosts the stats. However, even outside of this challenge, I (most likely?) only have a certain number of years to live, and hence do feel a natural bias towards selecting shorter books if everything else about them was to be perfectly equal. Why plough through 500 pages if you can get the same level of insight about a topic in 150?

The reassuring news is that, despite the shorter average length of book, I did read 20% more pages in total. This suggests I probably have upped the abstract “quantity” of reading, rather than just inflated the book count by picking short books. There was also a little less variation in page count between books this year than last by some measures.

In the distribution charts, you can see a spike of books at around 150 pages long this year which didn’t show up last year. I didn’t note a common theme in these books, but a relatively high proportion of them were audio books.

Although I am an avid podcast listener, I am not a huge fan of audio books as a rule. I love the idea as a method to acquire knowledge whilst doing endless chores or other semi-mindless activities. I would encourage anyone else with an interest of entering book contents into their brain to give them a whirl. But, for me, in practice I struggle to focus on them in any multi-tasking scenario, so end up hitting rewind a whole lot. And if I am in a situation where I can dedicate full concentration to informational intake, I’d rather use my eyes than my ears. For one, it’s so much faster, which is an important consideration when one has a book target!  With all that, the fact that audio books are over-represented in the lower page-counts for me is perhaps therefore not surprising. I know my limits.

I have heard tell that some people may consider audio books as invalid for the book challenge. In defence, I offer up that Goodreads doesn’t seem to feel this way in their blog post on the 2018 challenge. Besides, this isn’t the Olympics – at least no-one has sent me a gold medal yet – so everyone can make their own personal choice. For me, if it’s a method to get a book’s contents into my brain, I’ll happily take it. I just know I have to be very discriminating with regards to selecting audio books I can be sure I will be able to focus on. Even I would personally regard it cheating to log a book that happened to be audio-streaming in the background when I was asleep. If you don’t know what the book was about, you can’t count it.

So, what did I read about?

What did I read

Book topics are not always easy to categorise. The categories I used here are mainly the same as last year, based entirely on my 2-second opinion rather than any comprehensive Dewey Decimal-like system. This means some sort of subjectivity was necessary. Is a book on political philosophy regarded as politics or philosophy? Rather than spend too much time fretting about classification, I just made a call one way or the other. Refer to above comment re fun vs science.

The main changes I noted were indeed a move away from pure philosophical entries towards those of a political tone. Likewise, a new category entrant was seen this year in “health”. I developed an interest in improving one’s mental well-being via mindfulness and meditation type subjects, which led me to read a couple of books on this, as well as sleep, which I have classified as health.

Despite me continuing to subjectively feel that I read the large majority of books in eBook form, I actually moved even further away from that being true this year. Slightly under half were in that form. That decrease has largely been taken up by the afore-mentioned audio books, of which I apparently read (listened?) 10 this year. Similarly to last year, 2 of the audio entries were actually “Great Courses“, which are more like a sequence of university-style lectures, with an accompanying book containing notes and summaries.

My books have also been slightly less popular with the general Goodreads-rating audience this year, although not dramatically so.

Now, back to the subject of reading shorter books in order to make it easier to hit my target: the sheer sense of relief I felt when I finished book #50 and hence could go wild with relaxed, long and slow reading, made me concerned as to whether I had managed to beat that bias or not. I wondered whether as I got nearer to my target, the length of the books I selected might have risen, even though this was not my intention.

Below, the top chart shows that average page count by book completed on a monthly basis, year on year.

Book length ofer time

 

The 2016 data risks producing somewhat invalid conclusions, especially if interpreted without reference to the bottom “count of books” chart, mainly because of the existence of a  September 2016, a month where I read a single book that happened to be over 1,000 pages long.

I also hadn’t actually decided to participate in the book challenge at the start of 2016. I was logging my books, but just for fun (imagine that!). I don’t remember quite when it was suggested I should explicitly join then challenge, but before then it’s less likely I felt pressure to read faster or shorter.

Let’s look then only at 2017:

Book length ofer time2Sidenote: What happened in July?! I only read one book, and it wasn’t especially long. I can only assume Sally Scholz’s intro to feminism must have been particularly thought-provoking.

For reference, I hit book #50 in November this year. There does seem some suggestion in the data that indeed that I did read longer books as time went on, despite my mental disavowal of doing such.

Stats geeks might like to know that the line of best fit shown in the top chart above could be argued to represent that 30% of the variation in book length over time, with each month cumulatively adding on an estimate of an extra 14 pages above a base of 211 pages.  It should be stated that I didn’t spend too long considering the best model or fact-checking the relevant assumptions for this dataset. Instead just pressed “insert trend line” in Tableau and let it decide :).

I’m afraid the regression should not be considered as being traditionally statistically significant at the 0.05 level though, having a p-value of – wait for it – 0.06. Fortunately, for my intention to publish the above in Nature :), I think people are increasingly aware of the silliness of uncontextual hardline p-value criteria and/or publication bias.

Nonetheless, as I participate in the 2018 challenge – now at 52 books, properly one a week – I shall be conscious of this trend and double-up my efforts to keep reading based on quality rather than length. Of course, I remain very open – some might say hopeful! – that one sign of a quality author is that they can convey their material in a way that would be described as concise. You generous readers of my ramblings may detect some hypocrisy here.

For any really interested readers out there, you can once more see the full list of the books I read, plus links to the relevant Goodreads description pages, on the last tab of the interactive viz.

Advertisements

My favourite R package for: summarising data

Hot on the heels of delving into the world of R frequency table tools, it’s now time to expand the scope and think about data summary functions in general. One of the first steps analysts should perform when working with a new dataset is to review its contents and shape.

How many records are there? What fields exist? Of which type? Is there missing data? Is the data in a reasonable range? What sort of distribution does it have? Whilst I am a huge fan of data exploration via visualisation, running a summary statistical function over the whole dataset is a great first step to understanding what you have, and whether it’s valid and/or useful.

So, in the usual format, what would I like my data summarisation tool to do in an ideal world? You may note some copy and paste from my previous post. I like consistency 🙂

  1. Provide a count of how many observations (records) there are.
  2. Show the number, names and types of the fields.
  3. Be able to provide info on as many types of fields as possible (numeric, categorical, character, etc.).
  4. Produce appropriate summary stats depending on the data type. For example, if you have a continuous numeric field, you might want to know the mean. But a “mean” of an unordered categorical field makes no sense.
  5. Deal with missing data transparently. It is often important to know how many of your observations are missing. Other times, you might only care about the statistics derived from those which are not missing.
  6. For numeric data, produce at least these types of summary stats. And not to produce too many more esoteric ones, cluttering up the screen. Of course, what I regard as esoteric may be very different to what you would.
    1. Mean
    2. Median
    3. Range
    4. Some measure of variability, probably standard deviation.
    5. Optionally, some key percentiles
    6. Also optionally, some measures of skew, kurtosis etc.
  7. For categorical data, produce at least these types of summary stats:
    1. Count of distinct categories
    2. A list of the categories – perhaps restricted to the most popular if there are a high number.
    3. Some indication as to the distribution – e.g. does the most popular category contain 10% or 99% of the data?
  8. Be able to summarise a single field or all the fields in a particular dataframe at once, depending on user preference.
  9. Ideally, optionally be able to summarise by group, where group is typically some categorical variable. For example, maybe I want to see a summary of the mean average score in a test, split by whether the test taker was male or female.
  10. If an external library, then be on CRAN or some other well supported network so I can be reasonably confident the library won’t vanish, given how often I want to use it.
  11. Output data in a “tidy” but human-readable format. Being a big fan of the tidyverse, it’d be great if I could pipe the results directly into ggplotdplyr, or similar, for some quick plots and manipulations. Other times, if working interactively, I’d like to be able to see the key results at a glance in the R console, without having to use further coding.
  12. Work with “kable” from the Knitr package, or similar table output tools. I often use R markdown and would like the ability to show the summary statistics output in reasonably presentable manner.
  13. Have a sensible set of defaults (aka facilitate my laziness).

What’s in base R?

The obvious place to look is the “summary” command.

This is the output, when run on a very simple data file consisting of two categorical (“type”, “category”) and two numeric (“score”, “rating”) fields. Both type and score have some missing data. The others do not. Rating has a both one  particularly high and one particularly low outlier.

summary(data)

basesummary

This isn’t too terrible at all.

It clearly shows we have 4 fields, and it has determined that type and category are categorical, hence displaying the distribution of counts per category. It works out that score and rating are numerical, so gives a different, sensible, summary.

It highlights which fields have missing data. But it doesn’t show the overall count of records, although you could manually work it out by summing up the counts in the categorical variables (but why would you want to?). There’s no standard deviation. And whilst it’s OK to read interactively, it is definitely not “tidy”, pipeable or kable-compatible.

Just as with many other commands, analysing by groups could be done with the base R “by” command. But the output is “vertical”, making it hard to compare the same stats between groups at a glance, especially if there are a large number of categories. To determine the difference in means between category X and category Z in the below would be a lot easier if they were visually closer together. Especially if you had many more than 3 categories.

by(data, data$category, summary)

bybasesummary

So, can we improve on that effort by using libraries that are not automatically installed as part of base R? I tested 5 options. Inevitably, there are many more possibilities, so please feel free to write in if you think I missed an even better one.

  • describe, from the Hmisc package
  • stat.desc from pastecs
  • describe from psych
  • skim from skimr
  • descr and dfSummary from summarytools

Was there a winner from the point of view of fitting nicely to my personal preferences? I think so, although the choice may depend on your specific use-case.

For readability, compatibility with the tidyverse, and ability to use the resulting statistics downstream, I really like the skimr feature set. It also facilitates group comparisons better than most. This is my new favourite.

If you prefer to prioritise the visual quality of the output, at the expense of processing time and flexibility, dfSummary from summarytools is definitely worth a look. It’s a very pleasing way to see a summary of an entire dataset.
Update: thanks to Dominic who left a comment after having fixed the processing time issue very quickly in version 0.8.2

If you don’t enjoy either of those, you are probably fussy :). But for reference, Hmisc’s describe was my most-used variant before conducting this exploration.

describe, from the Hmisc package

Documentation

library(Hmisc)
Hmisc::describe(data)

hmiscdescribe

This clearly provides the count of variables and observations. It works well with both categorical and numerical data, giving appropriate summaries in each case, even adapting its output to take into account for instance how many categories exist in a given field. It shows how much data is missing, if any.

For numeric data, instead of giving the range as such, it shows the highest and lowest 5 entries. I actually like that a lot. It helps to show at a glance whether you have one weird outlier (e.g. a totals row that got accidentally included in the dataframe) or whether there are several values many standard deviations away from the mean. On the subject of deviations, there’s no specific variance or standard deviation value shown – although you can infer much about the distribution from the several percentiles it shows by default.

The output is nicely formatted and spacious for reading interactively, but isn’t tidy or kableable.

There’s no specific summary by group function although again you can pass this function into the by() command to have it run once per group, i.e. by(data, data$type, Hmisc::describe)

The output from that however is very “long” and in order of groups rather than variables naturally, rendering comparisons of the same stat between different groups quite challenging at a glimpse.

stat.desc, from the pastecs package

Documentation

library(pastecs)
stat.desc(data)

statdesc

The first thing to notice is that this only handles numeric variables, producing NA for the fields that are categorical. It does provide all the key stats and missingness info you would usually want for the numeric fields though, and it is great to see measures of uncertainty like confidence intervals and standard errors available. With other parameters you can also apply tests of normality.

It works well with kable. The output is fairly manipulable in terms of being tidy, although the measures show up as row labels as opposed to a true field. You get one column per variable, which may or may not be what you want if passing onwards for further analysis.

There’s no inbuilt group comparison function, although of course the by() function works with it, producing a list containing one copy of the above style of table for each group – again, great if you want to see a summary of a particular group, less great if you want to compare the same statistic across groups.

describe and describeBy, from the psych package

Documentation

library(psych)
psych::describe(data)

psychdescribe

OK, this is different! It has included all the numeric and categorical fields in its output, but the categorical fields show up, somewhat surprisingly if you’re new to the package, with the summary stats you’d normally associate with numeric fields. This is because the default behaviour is to recode categories as numbers, as described in the documentation:

…variables that are categorical or logical are converted to numeric and then described. These variables are marked with an * in the row name…Note that in the case of categories or factors, the numerical ordering is not necessarily the one expected. For instance, if education is coded “high school”, “some college” , “finished college”, then the default coding will lead to these as values of 2, 3, 1. Thus, statistics for those variables marked with * should be interpreted cautiously (if at all).

As the docs indicate, this can be risky! It is certainly risky if you are not expecting it :). I don’t generally have use-cases where I want this to happen automatically, but if you did, and you were very careful how you named your categories, it could be handy for you.

For the genuinely numeric data though, you get most of the key statistics and a few nice extras. It does not indicate where data is missing though.

The output works with kable, and is pretty tidy, outside of the common issue of using rownames to represent the variable the statistics are summarising, if we are being pernickety.

This command does have a specific summary-by-group variation, describeBy. Here’s how we’d use it if we want the stats for each “type” in my dataset, A – E.

psych::describeBy(data, data$type)

psychdescribeby

Everything you need is there, subject to the limitations of the basic describe(). It’s much more compact than using the by() command on some of the other summary tools, but it’s still not super easy to compare the same stat across groups visually.  It also does not work with kable and is not tidy.

The “mat” parameter does allow you to produce a matrix output of the above.

psych::describeBy(data, data$type, mat = TRUE)

psychdescribebymat

This is visually less pleasant, but it does enable you to produce a potentially useful dataframe, which you could tidy up or use to produce group comparisons downstream, if you don’t mind a little bit of post-processing.

skim, from the skimr package

Documentation

library(skimr)
skim(data)

skim

At the top skim clearly summarises the record and variable count. It is adept at handling both categorical and numeric data. For readability, I like the way it separates them into different sections dependent on data type, which makes for quick interpretation given that different summary stats are relevant for different data types.

It reports missing data clearly, and has all the most common summary stats I like.

Sidenote: see the paragraph in red below. This issue mentioned in this section is no longer an issue as of skimr 1.0.1, although the skim_with function may still be of interest.

There is what appears to be a strange sequence of unicode-esque characters like <U+2587> shown at the bottom of the output. In reality, these are intended to be a graphical visualisation of distributions using sparklines, hence the column name “hist”, referring to histograms. This is a fantastic idea, especially to see in-line with the other stats in the table. Unfortunately, they do not by default display properly in the Windows environment which is why I see the U+ characters instead.

The skimr documentation details how this is actually a problem with underlying R code rather than this library, which is unfortunate as I suspect this means there cannot be a quick fix. There is a workaround involving changing ones locale, although I have not tried this, and probably won’t before establishing if there would be any side effects in doing so.

In the mean time, if the nonsense-looking U+ characters bother you, you can turn off the column that displays them by changing the default summary that skim uses per data type. There’s a skim_with function that you can use to add your own summary stats into the display, but it also works to remove existing ones. For example, to remove the “hist” column:

skim_with(integer = list(hist = NULL))
skim(data)

skimnohist

Now we don’t see the messy unicode characters, and we won’t for the rest of our skimming session.

UPDATE 2018-01-22 : the geniuses who designed skimr actually did find a way to make the sparklines appear in Windows after all! Just update your skimr version to version 1.0.1 and you’re back in graphical business, as the rightmost column of the integer variables below demonstrate.

skim2

 

 

The output works well with kable. Happily, it also respects the group_by function from dplyr, which means you can produce summaries by group. For example:

group_by(data, category) %>%
 skim()

groupbyskim

Whilst the output is still arranged by the grouping variable before the summary variable, making it slightly inconvenient to visually compare categories, this seems to be the nicest “at a glimpse” way yet to perform that operation without further manipulation.

But if you are OK with a little further manipulation, life becomes surprisingly easy! Although the output above does not look tidy or particularly manipulable, behind the scenes it does create a tidy dataframe-esque representation of each combination of variable and statistic. Here’s the top of what that looks like by default:

mydata <- group_by(data, category) %>%
 skim()

head(mydata, 10)

skimdf

It’s not super-readable to the human eye at a glimpse – but you might be able to tell that it has produced a “long” table that contains one row for every combination of group, variable and summary stat that was shown horizontally in the interactive console display. This means you can use standard methods of dataframe manipulation to programmatically post-process your summary.

For example, sticking to the tidyverse, let’s graphically compare the mean, median and standard deviation of the “score” variable, comparing the results between each value of the 3 “categories” we have in the data.

mydata %>%
 filter(variable == "score", stat %in% c("mean", "median", "sd")) %>%
 ggplot(aes(x = category, y = value)) +
 facet_grid(stat ~ ., scales = "free") +
 geom_col()

skimggplot

descr and dfSummary, from the summarytools package

Documentation

Let’s start with descr.

library(summarytools)
summarytools::descr(data)

descrsummarytools

The first thing I note is that this is another one of the summary functions that (deliberately) only works with numerical data. Here though, a useful red warning showing which columns have thus been ignored is shown at the top. You also get a record count, and a nice selection of standard summary stats for the numeric variables, including information on missing data (for instance Pct.Valid is the proportion of data which isn’t missing).

kable does not work here, although you can recast to a dataframe and later kable that, i.e.

kable(as.data.frame(summarytools::descr(data)))

The data comes out relatively tidy although it does use rownames to represent the summary stat.

mydata <- summarytools::descr(data)
View(mydata)

viewmydata

There is also a transpose option if you prefer to arrange your variables by row and summary stats as columns.

summarytools::descr(data, transpose = TRUE)

descrtranspose

There is no special functionality for group comparisons, although by() works, with the standard limitations.

The summarytools package also includes a fancier, more comprehensive, summarising function called dfSummary, intended to summarise a whole dataframe – which is often exactly what I want to do with this type of summarisation.

dfSummary(data)

summarytoolsdf

This function can deal with both categorical and numeric variables and provides a pretty output in the console with all of the most used summary stats, info on sample sizes and missingness. There’s even a “text graph” intended to show distributions. These graphs are not as beautiful as the sparklines that the skimr function tries to show, but have the advantage that they work right away on Windows machines.

On the downside, the function seems very slow to perform its calculations at the moment. Even though I’m using a relatively tiny dataset, I had to wait an annoyingly large amount of time for the command to complete – perhaps 1-2 minutes, vs other summary functions which complete almost instantly. This may be worth it to you for the clarity of output it produces, and if you are careful to run it once with all the variables and options you are interested in – but it can be quite frustrating when engaged in interactive exploratory analysis where you might have reason to run it several times.

Update 2018-02-10: the processing time issues should be fixed in version 0.82. Thanks very much to Dominic, the package author, for leaving a comment below and performing such a quick fix!

There is no special grouping feature.

Whilst it does work with kable, it doesn’t make for nice output. But don’t despair, there’s a good reason for that. The function has built-in capabilities to output directly into markdown or HTML.

This goes way beyond dumping a load of text into HTML format – instead giving you rather beautiful output like that shown below. This would be perfectly acceptable for sharing with other users, and less-headache inducing than other representations if staring at in order to gain an understanding of your dataset. Again though, it does take a surprisingly long time to generate.

summarytoolsdfhtml.PNG