Books I read in 2017

Long term readers (hi!) may recall my failure to achieve the target I had of reading 50 books in 2016. I had joined the 2016 Goodreads reading challenge, logged my reading activity, and hence had access to the data needed track my progress at the end of the year. It turns out that 41 books is less than 50.

Being a glutton for punishment, I signed up again in 2017, with the same cognitively terrifying 50 book target – basically one a week, although I cannot allow myself to think that way. It is now 2018, so time to review how I did.

Goodreads allows you to log which books you are reading and when you finished them. The finish date is what counts for the challenge. Nefarious readers may spot a few potential exploits here, especially if competing for only 1 year. However, I tried to play the game in good faith (but did I actually do so?  Perhaps the data will reveal!).

As you go through the year, Goodreads will update you on how you are doing with your challenge. Or for us nerd types, you can download a much more detailed and useful CSV. There’s also a the Goodreads API to explore, if that floats your boat.

Similarly to last year, I went with the CSV.  I did have to hand-edit the CSV a little, both to fill in a little missing data that appears to be absent from the Goodreads dataset, and also to add couple of extra data fields that I wanted to track that Goodreads doesn’t natively support. I then popped the CSV into a Tableau dashboard, which you can explore interactively by clicking here.

Results time!

How much did I read

Joyful times! In 2017 I got to, and even exceeded, my target! 55 books read.

In comparison to my 2016 results, I got ahead right from the start of the year, and widened the gap notably in Q2. You can see a similar boost to that witnessed in 2016 around the time of the summer holidays, weeks 33-35ish. Not working is clearly good for one’s reading obligations.

What were the characteristics of the books I read?

Although page count is a pretty vague and manipulable measure – different books have different physical sizes, font sizes, spacing, editions – it is one of the few measures where data is easily available so we’ll go with that. In the case of eBooks or audio books (more on this later) without set “pages” I used the page count of the respective paper version. I fully acknowledge this rigour of this analysis as falling under “fun” rather than “science”.

So the first revelation is that this year’s average pages per read book was 300, a roughly 10% decrease from last year’s average book. Hmm. Obviously, if everything else remains the same,  the target of 50 books is easier to meet if you read shorter books! Size doesn’t always reflect complexity or any other influence around time to complete of course.

I hadn’t deliberately picked short books – in fact, being aware of this incentive I had tried to be conscious of avoiding doing this, and concentrate on reading what I wanted to read, not just what boosts the stats. However, even outside of this challenge, I (most likely?) only have a certain number of years to live, and hence do feel a natural bias towards selecting shorter books if everything else about them was to be perfectly equal. Why plough through 500 pages if you can get the same level of insight about a topic in 150?

The reassuring news is that, despite the shorter average length of book, I did read 20% more pages in total. This suggests I probably have upped the abstract “quantity” of reading, rather than just inflated the book count by picking short books. There was also a little less variation in page count between books this year than last by some measures.

In the distribution charts, you can see a spike of books at around 150 pages long this year which didn’t show up last year. I didn’t note a common theme in these books, but a relatively high proportion of them were audio books.

Although I am an avid podcast listener, I am not a huge fan of audio books as a rule. I love the idea as a method to acquire knowledge whilst doing endless chores or other semi-mindless activities. I would encourage anyone else with an interest of entering book contents into their brain to give them a whirl. But, for me, in practice I struggle to focus on them in any multi-tasking scenario, so end up hitting rewind a whole lot. And if I am in a situation where I can dedicate full concentration to informational intake, I’d rather use my eyes than my ears. For one, it’s so much faster, which is an important consideration when one has a book target!  With all that, the fact that audio books are over-represented in the lower page-counts for me is perhaps therefore not surprising. I know my limits.

I have heard tell that some people may consider audio books as invalid for the book challenge. In defence, I offer up that Goodreads doesn’t seem to feel this way in their blog post on the 2018 challenge. Besides, this isn’t the Olympics – at least no-one has sent me a gold medal yet – so everyone can make their own personal choice. For me, if it’s a method to get a book’s contents into my brain, I’ll happily take it. I just know I have to be very discriminating with regards to selecting audio books I can be sure I will be able to focus on. Even I would personally regard it cheating to log a book that happened to be audio-streaming in the background when I was asleep. If you don’t know what the book was about, you can’t count it.

So, what did I read about?

What did I read

Book topics are not always easy to categorise. The categories I used here are mainly the same as last year, based entirely on my 2-second opinion rather than any comprehensive Dewey Decimal-like system. This means some sort of subjectivity was necessary. Is a book on political philosophy regarded as politics or philosophy? Rather than spend too much time fretting about classification, I just made a call one way or the other. Refer to above comment re fun vs science.

The main changes I noted were indeed a move away from pure philosophical entries towards those of a political tone. Likewise, a new category entrant was seen this year in “health”. I developed an interest in improving one’s mental well-being via mindfulness and meditation type subjects, which led me to read a couple of books on this, as well as sleep, which I have classified as health.

Despite me continuing to subjectively feel that I read the large majority of books in eBook form, I actually moved even further away from that being true this year. Slightly under half were in that form. That decrease has largely been taken up by the afore-mentioned audio books, of which I apparently read (listened?) 10 this year. Similarly to last year, 2 of the audio entries were actually “Great Courses“, which are more like a sequence of university-style lectures, with an accompanying book containing notes and summaries.

My books have also been slightly less popular with the general Goodreads-rating audience this year, although not dramatically so.

Now, back to the subject of reading shorter books in order to make it easier to hit my target: the sheer sense of relief I felt when I finished book #50 and hence could go wild with relaxed, long and slow reading, made me concerned as to whether I had managed to beat that bias or not. I wondered whether as I got nearer to my target, the length of the books I selected might have risen, even though this was not my intention.

Below, the top chart shows that average page count by book completed on a monthly basis, year on year.

Book length ofer time

 

The 2016 data risks producing somewhat invalid conclusions, especially if interpreted without reference to the bottom “count of books” chart, mainly because of the existence of a  September 2016, a month where I read a single book that happened to be over 1,000 pages long.

I also hadn’t actually decided to participate in the book challenge at the start of 2016. I was logging my books, but just for fun (imagine that!). I don’t remember quite when it was suggested I should explicitly join then challenge, but before then it’s less likely I felt pressure to read faster or shorter.

Let’s look then only at 2017:

Book length ofer time2Sidenote: What happened in July?! I only read one book, and it wasn’t especially long. I can only assume Sally Scholz’s intro to feminism must have been particularly thought-provoking.

For reference, I hit book #50 in November this year. There does seem some suggestion in the data that indeed that I did read longer books as time went on, despite my mental disavowal of doing such.

Stats geeks might like to know that the line of best fit shown in the top chart above could be argued to represent that 30% of the variation in book length over time, with each month cumulatively adding on an estimate of an extra 14 pages above a base of 211 pages.  It should be stated that I didn’t spend too long considering the best model or fact-checking the relevant assumptions for this dataset. Instead just pressed “insert trend line” in Tableau and let it decide :).

I’m afraid the regression should not be considered as being traditionally statistically significant at the 0.05 level though, having a p-value of – wait for it – 0.06. Fortunately, for my intention to publish the above in Nature :), I think people are increasingly aware of the silliness of uncontextual hardline p-value criteria and/or publication bias.

Nonetheless, as I participate in the 2018 challenge – now at 52 books, properly one a week – I shall be conscious of this trend and double-up my efforts to keep reading based on quality rather than length. Of course, I remain very open – some might say hopeful! – that one sign of a quality author is that they can convey their material in a way that would be described as concise. You generous readers of my ramblings may detect some hypocrisy here.

For any really interested readers out there, you can once more see the full list of the books I read, plus links to the relevant Goodreads description pages, on the last tab of the interactive viz.

Advertisements

My favourite R package for: summarising data

Hot on the heels of delving into the world of R frequency table tools, it’s now time to expand the scope and think about data summary functions in general. One of the first steps analysts should perform when working with a new dataset is to review its contents and shape.

How many records are there? What fields exist? Of which type? Is there missing data? Is the data in a reasonable range? What sort of distribution does it have? Whilst I am a huge fan of data exploration via visualisation, running a summary statistical function over the whole dataset is a great first step to understanding what you have, and whether it’s valid and/or useful.

So, in the usual format, what would I like my data summarisation tool to do in an ideal world? You may note some copy and paste from my previous post. I like consistency 🙂

  1. Provide a count of how many observations (records) there are.
  2. Show the number, names and types of the fields.
  3. Be able to provide info on as many types of fields as possible (numeric, categorical, character, etc.).
  4. Produce appropriate summary stats depending on the data type. For example, if you have a continuous numeric field, you might want to know the mean. But a “mean” of an unordered categorical field makes no sense.
  5. Deal with missing data transparently. It is often important to know how many of your observations are missing. Other times, you might only care about the statistics derived from those which are not missing.
  6. For numeric data, produce at least these types of summary stats. And not to produce too many more esoteric ones, cluttering up the screen. Of course, what I regard as esoteric may be very different to what you would.
    1. Mean
    2. Median
    3. Range
    4. Some measure of variability, probably standard deviation.
    5. Optionally, some key percentiles
    6. Also optionally, some measures of skew, kurtosis etc.
  7. For categorical data, produce at least these types of summary stats:
    1. Count of distinct categories
    2. A list of the categories – perhaps restricted to the most popular if there are a high number.
    3. Some indication as to the distribution – e.g. does the most popular category contain 10% or 99% of the data?
  8. Be able to summarise a single field or all the fields in a particular dataframe at once, depending on user preference.
  9. Ideally, optionally be able to summarise by group, where group is typically some categorical variable. For example, maybe I want to see a summary of the mean average score in a test, split by whether the test taker was male or female.
  10. If an external library, then be on CRAN or some other well supported network so I can be reasonably confident the library won’t vanish, given how often I want to use it.
  11. Output data in a “tidy” but human-readable format. Being a big fan of the tidyverse, it’d be great if I could pipe the results directly into ggplotdplyr, or similar, for some quick plots and manipulations. Other times, if working interactively, I’d like to be able to see the key results at a glance in the R console, without having to use further coding.
  12. Work with “kable” from the Knitr package, or similar table output tools. I often use R markdown and would like the ability to show the summary statistics output in reasonably presentable manner.
  13. Have a sensible set of defaults (aka facilitate my laziness).

What’s in base R?

The obvious place to look is the “summary” command.

This is the output, when run on a very simple data file consisting of two categorical (“type”, “category”) and two numeric (“score”, “rating”) fields. Both type and score have some missing data. The others do not. Rating has a both one  particularly high and one particularly low outlier.

summary(data)

basesummary

This isn’t too terrible at all.

It clearly shows we have 4 fields, and it has determined that type and category are categorical, hence displaying the distribution of counts per category. It works out that score and rating are numerical, so gives a different, sensible, summary.

It highlights which fields have missing data. But it doesn’t show the overall count of records, although you could manually work it out by summing up the counts in the categorical variables (but why would you want to?). There’s no standard deviation. And whilst it’s OK to read interactively, it is definitely not “tidy”, pipeable or kable-compatible.

Just as with many other commands, analysing by groups could be done with the base R “by” command. But the output is “vertical”, making it hard to compare the same stats between groups at a glance, especially if there are a large number of categories. To determine the difference in means between category X and category Z in the below would be a lot easier if they were visually closer together. Especially if you had many more than 3 categories.

by(data, data$category, summary)

bybasesummary

So, can we improve on that effort by using libraries that are not automatically installed as part of base R? I tested 5 options. Inevitably, there are many more possibilities, so please feel free to write in if you think I missed an even better one.

  • describe, from the Hmisc package
  • stat.desc from pastecs
  • describe from psych
  • skim from skimr
  • descr and dfSummary from summarytools

Was there a winner from the point of view of fitting nicely to my personal preferences? I think so, although the choice may depend on your specific use-case.

For readability, compatibility with the tidyverse, and ability to use the resulting statistics downstream, I really like the skimr feature set. It also facilitates group comparisons better than most. This is my new favourite.

If you prefer to prioritise the visual quality of the output, at the expense of processing time and flexibility, dfSummary from summarytools is definitely worth a look. It’s a very pleasing way to see a summary of an entire dataset.
Update: thanks to Dominic who left a comment after having fixed the processing time issue very quickly in version 0.8.2

If you don’t enjoy either of those, you are probably fussy :). But for reference, Hmisc’s describe was my most-used variant before conducting this exploration.

describe, from the Hmisc package

Documentation

library(Hmisc)
Hmisc::describe(data)

hmiscdescribe

This clearly provides the count of variables and observations. It works well with both categorical and numerical data, giving appropriate summaries in each case, even adapting its output to take into account for instance how many categories exist in a given field. It shows how much data is missing, if any.

For numeric data, instead of giving the range as such, it shows the highest and lowest 5 entries. I actually like that a lot. It helps to show at a glance whether you have one weird outlier (e.g. a totals row that got accidentally included in the dataframe) or whether there are several values many standard deviations away from the mean. On the subject of deviations, there’s no specific variance or standard deviation value shown – although you can infer much about the distribution from the several percentiles it shows by default.

The output is nicely formatted and spacious for reading interactively, but isn’t tidy or kableable.

There’s no specific summary by group function although again you can pass this function into the by() command to have it run once per group, i.e. by(data, data$type, Hmisc::describe)

The output from that however is very “long” and in order of groups rather than variables naturally, rendering comparisons of the same stat between different groups quite challenging at a glimpse.

stat.desc, from the pastecs package

Documentation

library(pastecs)
stat.desc(data)

statdesc

The first thing to notice is that this only handles numeric variables, producing NA for the fields that are categorical. It does provide all the key stats and missingness info you would usually want for the numeric fields though, and it is great to see measures of uncertainty like confidence intervals and standard errors available. With other parameters you can also apply tests of normality.

It works well with kable. The output is fairly manipulable in terms of being tidy, although the measures show up as row labels as opposed to a true field. You get one column per variable, which may or may not be what you want if passing onwards for further analysis.

There’s no inbuilt group comparison function, although of course the by() function works with it, producing a list containing one copy of the above style of table for each group – again, great if you want to see a summary of a particular group, less great if you want to compare the same statistic across groups.

describe and describeBy, from the psych package

Documentation

library(psych)
psych::describe(data)

psychdescribe

OK, this is different! It has included all the numeric and categorical fields in its output, but the categorical fields show up, somewhat surprisingly if you’re new to the package, with the summary stats you’d normally associate with numeric fields. This is because the default behaviour is to recode categories as numbers, as described in the documentation:

…variables that are categorical or logical are converted to numeric and then described. These variables are marked with an * in the row name…Note that in the case of categories or factors, the numerical ordering is not necessarily the one expected. For instance, if education is coded “high school”, “some college” , “finished college”, then the default coding will lead to these as values of 2, 3, 1. Thus, statistics for those variables marked with * should be interpreted cautiously (if at all).

As the docs indicate, this can be risky! It is certainly risky if you are not expecting it :). I don’t generally have use-cases where I want this to happen automatically, but if you did, and you were very careful how you named your categories, it could be handy for you.

For the genuinely numeric data though, you get most of the key statistics and a few nice extras. It does not indicate where data is missing though.

The output works with kable, and is pretty tidy, outside of the common issue of using rownames to represent the variable the statistics are summarising, if we are being pernickety.

This command does have a specific summary-by-group variation, describeBy. Here’s how we’d use it if we want the stats for each “type” in my dataset, A – E.

psych::describeBy(data, data$type)

psychdescribeby

Everything you need is there, subject to the limitations of the basic describe(). It’s much more compact than using the by() command on some of the other summary tools, but it’s still not super easy to compare the same stat across groups visually.  It also does not work with kable and is not tidy.

The “mat” parameter does allow you to produce a matrix output of the above.

psych::describeBy(data, data$type, mat = TRUE)

psychdescribebymat

This is visually less pleasant, but it does enable you to produce a potentially useful dataframe, which you could tidy up or use to produce group comparisons downstream, if you don’t mind a little bit of post-processing.

skim, from the skimr package

Documentation

library(skimr)
skim(data)

skim

At the top skim clearly summarises the record and variable count. It is adept at handling both categorical and numeric data. For readability, I like the way it separates them into different sections dependent on data type, which makes for quick interpretation given that different summary stats are relevant for different data types.

It reports missing data clearly, and has all the most common summary stats I like.

Sidenote: see the paragraph in red below. This issue mentioned in this section is no longer an issue as of skimr 1.0.1, although the skim_with function may still be of interest.

There is what appears to be a strange sequence of unicode-esque characters like <U+2587> shown at the bottom of the output. In reality, these are intended to be a graphical visualisation of distributions using sparklines, hence the column name “hist”, referring to histograms. This is a fantastic idea, especially to see in-line with the other stats in the table. Unfortunately, they do not by default display properly in the Windows environment which is why I see the U+ characters instead.

The skimr documentation details how this is actually a problem with underlying R code rather than this library, which is unfortunate as I suspect this means there cannot be a quick fix. There is a workaround involving changing ones locale, although I have not tried this, and probably won’t before establishing if there would be any side effects in doing so.

In the mean time, if the nonsense-looking U+ characters bother you, you can turn off the column that displays them by changing the default summary that skim uses per data type. There’s a skim_with function that you can use to add your own summary stats into the display, but it also works to remove existing ones. For example, to remove the “hist” column:

skim_with(integer = list(hist = NULL))
skim(data)

skimnohist

Now we don’t see the messy unicode characters, and we won’t for the rest of our skimming session.

UPDATE 2018-01-22 : the geniuses who designed skimr actually did find a way to make the sparklines appear in Windows after all! Just update your skimr version to version 1.0.1 and you’re back in graphical business, as the rightmost column of the integer variables below demonstrate.

skim2

 

 

The output works well with kable. Happily, it also respects the group_by function from dplyr, which means you can produce summaries by group. For example:

group_by(data, category) %>%
 skim()

groupbyskim

Whilst the output is still arranged by the grouping variable before the summary variable, making it slightly inconvenient to visually compare categories, this seems to be the nicest “at a glimpse” way yet to perform that operation without further manipulation.

But if you are OK with a little further manipulation, life becomes surprisingly easy! Although the output above does not look tidy or particularly manipulable, behind the scenes it does create a tidy dataframe-esque representation of each combination of variable and statistic. Here’s the top of what that looks like by default:

mydata <- group_by(data, category) %>%
 skim()

head(mydata, 10)

skimdf

It’s not super-readable to the human eye at a glimpse – but you might be able to tell that it has produced a “long” table that contains one row for every combination of group, variable and summary stat that was shown horizontally in the interactive console display. This means you can use standard methods of dataframe manipulation to programmatically post-process your summary.

For example, sticking to the tidyverse, let’s graphically compare the mean, median and standard deviation of the “score” variable, comparing the results between each value of the 3 “categories” we have in the data.

mydata %>%
 filter(variable == "score", stat %in% c("mean", "median", "sd")) %>%
 ggplot(aes(x = category, y = value)) +
 facet_grid(stat ~ ., scales = "free") +
 geom_col()

skimggplot

descr and dfSummary, from the summarytools package

Documentation

Let’s start with descr.

library(summarytools)
summarytools::descr(data)

descrsummarytools

The first thing I note is that this is another one of the summary functions that (deliberately) only works with numerical data. Here though, a useful red warning showing which columns have thus been ignored is shown at the top. You also get a record count, and a nice selection of standard summary stats for the numeric variables, including information on missing data (for instance Pct.Valid is the proportion of data which isn’t missing).

kable does not work here, although you can recast to a dataframe and later kable that, i.e.

kable(as.data.frame(summarytools::descr(data)))

The data comes out relatively tidy although it does use rownames to represent the summary stat.

mydata <- summarytools::descr(data)
View(mydata)

viewmydata

There is also a transpose option if you prefer to arrange your variables by row and summary stats as columns.

summarytools::descr(data, transpose = TRUE)

descrtranspose

There is no special functionality for group comparisons, although by() works, with the standard limitations.

The summarytools package also includes a fancier, more comprehensive, summarising function called dfSummary, intended to summarise a whole dataframe – which is often exactly what I want to do with this type of summarisation.

dfSummary(data)

summarytoolsdf

This function can deal with both categorical and numeric variables and provides a pretty output in the console with all of the most used summary stats, info on sample sizes and missingness. There’s even a “text graph” intended to show distributions. These graphs are not as beautiful as the sparklines that the skimr function tries to show, but have the advantage that they work right away on Windows machines.

On the downside, the function seems very slow to perform its calculations at the moment. Even though I’m using a relatively tiny dataset, I had to wait an annoyingly large amount of time for the command to complete – perhaps 1-2 minutes, vs other summary functions which complete almost instantly. This may be worth it to you for the clarity of output it produces, and if you are careful to run it once with all the variables and options you are interested in – but it can be quite frustrating when engaged in interactive exploratory analysis where you might have reason to run it several times.

Update 2018-02-10: the processing time issues should be fixed in version 0.82. Thanks very much to Dominic, the package author, for leaving a comment below and performing such a quick fix!

There is no special grouping feature.

Whilst it does work with kable, it doesn’t make for nice output. But don’t despair, there’s a good reason for that. The function has built-in capabilities to output directly into markdown or HTML.

This goes way beyond dumping a load of text into HTML format – instead giving you rather beautiful output like that shown below. This would be perfectly acceptable for sharing with other users, and less-headache inducing than other representations if staring at in order to gain an understanding of your dataset. Again though, it does take a surprisingly long time to generate.

summarytoolsdfhtml.PNG

My favourite R package for: frequency tables

Back for the next part of the “which of the infinite ways of doing a certain task in R do I most like today?” series. This time, what could more more fascinating an aspect of analysis to focus on than: frequency tables?

OK, most topics might actually be more fascinating. Especially when my definition of frequency tables here will restrict itself to 1-dimensional variations, which in theory a primary school kid could calculate manually, given time. But they are such a common tool, that analysts can use for all sorts of data validation and exploratory data analysis jobs, that finding a nice implementation might prove to be a time-and-sanity saving task over a lifetime of counting how many things are of which type.

Here’s the top of an example dataset. Imagine a “tidy” dataset, such that each row is an one observation. I would like to know how many observations (e.g. people) are of  which type (e.g. demographic – here a category between A and E inclusive)

Type Person ID
E 1
E 2
B 3
B 4
B 5
B 6
C 7

I want to be able to say things like: “4 of my records are of type E”, or “10% of my records are of type A”. The dataset I will use in my below example is similar to the above table, only with more records, including some with a blank (missing) type.

What would I like my 1-dimensional frequency table tool to do in an ideal world?

  1. Provide a count of how many observations are in which category.
  2. Show the percentages or proportions of total observations that represents
  3. Be able to sort by count, so I see the most popular options at the top – but only when I want to, as sometimes the order of data is meaningful for other reasons.
  4. Show a cumulative %, sorted by count, so I can see quickly that, for example, the top 3 options make up 80% of the data – useful for some swift Pareto analysis and the like.
  5. Deal with missing data transparently. It is often important to know how many of your observations are “missing”. Other times, you might only care about the statistics derived from those which are not missing.
  6. If an external library, then be on CRAN or some other well supported network so I can be reasonably confident the library won’t vanish, given how often I want to use it.
  7. Output data in a “tidy” but human-readable format. Being a big fan of the tidyverse, it’d be great if I could pipe the results directly into ggplot, dplyr, or whatever for some quick plots and manipulations. Other times, if working interactively, I’d like to be able to see the key results at a glance, without having to use further coding.
  8. Work with “kable” from the Knitr package, or similar table output tools. I often use R markdown and would like the ability to show the frequency table output in reasonably presentable manner.
  9. Have a sensible set of defaults (aka facilitate my laziness).

 

So what options come by default with base R?

Most famously, perhaps the “table” command.

table(data$Type)

table

A super simple way to count up the number of records by type. But it doesn’t show percentages or any sort of cumulation. By default it hasn’t highlighted that there are some records with missing data. It does have a useNA parameter that will show that though if desired.

table(data$Type, useNA = "ifany")

tablena

The output also isn’t tidy and doesn’t work well with Knitr.

The table command can be wrapped in the prop.table command to show proportions.

prop.table(table(data$Type))

proptable

But you’d need to run both commands to understand the count and percentages, and the latter inherits many of the limitations from the former.

So what’s available outside of base R? I tested 5 options, although there are, of course , countless more. In no particular order:

  • tabyl, from the janitor package
  • tab1, from epidisplay
  • freq, from summarytools
  • CrossTable, from gmodels
  • freq, from questionr

Because I am fussy, I managed to find some slight personal niggle with all of them, so it’s hard to pick an overall personal winner for all circumstances. Several came very close. I would recommend looking at any of the janitor, summarytools and questionr package functions outlined below if you have similar requirements and tastes to me.

tabyl, from the janitor package

Documentation

library(janitor)
tabyl(data$Type, sort = TRUE)

janitor

This is a pretty good start! By default, it shows counts, percents, and percent of non-missing data. It can optionally sort in order of frequency. It the output is tidy, and works with kable just fine. The only thing missing really is a cumulative percentage option. But it’s a great improvement over base table.

I do find myself constantly misspelling “tabyl” as “taybl” though, which is annoying, but not really something I can really criticise anyone else for.

tab1, from the epidisplay package

Documentation

library(epiDisplay)
tab1(data$Type, sort.group = "decreasing", cum.percent = TRUE)

epidisplayepidisplay2

This one is pretty fully featured. It even (optionally) generates a visual frequency chart output as you can see above. It shows the frequencies, proportions and cumulative proportions both with and without missing data. It can sort in order of frequency, and has a totals row so you know how many observations you have all in.

However it isn’t very tidy by default, and doesn’t work with knitr. I also don’t really like the column names it assigns, although one can certainly claim that’s pure personal preference.

A greater issue may be that the cumulative columns don’t seem to work as I would expect when the table is sorted, as in the above example. The first entry in the table is “E”, because that’s the largest category. However, it isn’t 100% of the non-missing dataset, as you might infer from the fifth numerical column. In reality it’s 31.7%, per column 4.

As far as I can tell, the function is working out the cumulative frequencies before sorting the table – so as category E is the last category in the data file it has calculated that by the time you reach the end of category E you have 100% of the non-missing data in hand. I can’t envisage a situation where you would want this behaviour, but I’m open to correction if anyone can.

freq, from the summarytools package

Documentation

library(summarytools)
summarytools::freq(data$Type, order = "freq")

summarytools

This looks pretty great. Has all the variations of counts, percents and missing-data output I want – here you can interpret the “% valid” column as “% of all non-missing”. Very readable in the console, and works well with Knitr. In fact it has some further nice formatting options that I wasn’t particularly looking for.

It it pretty much tidy, although has a minor niggle in that the output always includes the total row. It’s often important to know your totals, but if you’re piping it to other tools or charts, you may have to use another command to filter that row out each time, as there doesn’t seem to be an obvious way to prevent it being included with the rest of the dataset when running it directly.

CrossTable, from the gmodels library

Documentation

library(gmodels)
CrossTable(data$Type)

crosstable

Here the results are displayed in a horizontal format, a bit like the base “table”. Here though, the proportions are clearly shown, albeit not with a cumulative version. It doesn’t highlight that there are missing values, and isn’t “tidy”. You can get it to display a vertical version (add the parameter max.width = 1 ) which is visually distinctive, but untidy in the usual R tidyverse sense.

It’s not a great tool for my particular requirements here, but most likely this is because, as you may guess from the command name, it’s not particularly designed for 1-way frequency tables. If you are crosstabulating multiple dimensions it may provide a powerful and visually accessible way to see counts, proportions and even run hypothesis tests.

freq, from the questionr package

Documentation

library(questionr)
questionr::freq(data$Type, cum = TRUE, sort = "dec", total = TRUE)

questionr

Counts, percentages, cumulative percentages, missing values data, yes, all here! The table can optionally be sorted in descending frequency, and works well with kable.

It is mostly tidy, but also has an annoyance in that the category values themselves (A -E are row labels rather than a standalone column. This means you may have to pop them into in a new column for best use in any downstream tidy tools. That’s easy enough with e.g. dplyr’s add_rownames command. But that is another processing step to remember, which is not a huge selling point.

There is a total row at the bottom, but it’s optional, so just don’t use the “total” parameter if you plan to pass the data onwards in a way where you don’t want to risk double-counting your totals. There’s an “exclude” parameter if you want to remove any particular categories from analysis before performing the calculations as well as a couple of extra formatting options that might be handy.

My favourite R package for: correlation

R is a wonderful, flexible, if somewhat arcane tool for analytics of all kinds. Part of its power, yet also its ability to bewilder, comes from the fact that there are so many ways of doing the same, or similar, things. Many of these ways are instantly available thanks to many heroes of the R world creating and distributing free libraries to refine existing functionality and add new abilities.

Looking at a list of one from the most popular sources for these packages, CRAN, shows that their particular collection gets new entries on a several-times-per-day basis, and there are 11,407 of them at the time of writing.

With that intimidating stat in mind, I will keep a few notes on this blog as to my current favourite base or package-based methods for some common analytics tasks. Of course these may change regularly, based on new releases or my personal whims. But for now, let’s tackle correlations. Here I mean simple statistical correlations between 2 sets of data, the most famous one of which is likely the Pearson correlation coefficient, aka Pearson’s R.

What would I like to see in my ideal version of a correlation calculator? Here’s a few of my personal preferences in no particular order.

  1. Can deal with multiple correlation tests at once. For example, maybe I have 5 variables and I’d like to see the correlation between each one of them with each of the other 4 variables).
  2. Visualises the results nicely, for example in a highlighted correlation matrix. Default R often produces informative but somewhat uninspiring text output. I have got spoiled with the luxury of data visualisation tools so after a heavy day’s analysis I prefer to take advantage of the many ways dataviz can make analytic output easier to decipher for humans.
  3. If the output is indeed a dataviz, I have a slight preference for it to use ggplot charting all other things being equal. Ending up with a proper ggplot object is nice both in terms of the default visual settings vs some other forms of R chart, and also that you can then in theory use ggplot methods to adapt or add to it.
  4. Can produce p values, confidence intervals, or some other way of suggesting whether any correlations found are statistically significant or not.
  5. Can handle at least a couple of types of correlation calculations, the most common of which are probably Pearson correlation coefficient and Spearman’s rank correlation coefficient.

Default R has a couple of correlation commands built in to it. The most common is probably “cor“. Here’s an example of what it produces, using a test dataset named test_data of 5 variables, named a, b, c, d and e (which are in columns .

cor(test_data)

cor

So, it does multiple tests at once, and can handle Pearson, Spearman and Kendall correlation calculations, via changing the “method” parameter (which defaults to Pearson if you don’t specify, as in my example). But it doesn’t show the statistical significance of the correlations, and a potentially large text table of 8-place decimal numbers is not the sort of visualised output that would help me interpret the results in a particularly efficient way.

A second relevant default R command is “cor.test“. This one only allows you to make a single correlation comparison, for example between variable a and variable b.

cor.test(test_data$a, test_data$b)

cortest

So here we see it does return both a p value and a confidence interval to help us judge the significance of the correlation size noted. You can change the alternative hypothesis and confidence interval range via parameters. It can also do the same 3 types of correlation that “cor” supports. But, as noted, it can only compare two variables at once without further commands. And the output is again a bunch of text. That is really OK here, as you are focusing only on one comparison. But it’s going to be pretty tedious to run and decipher if you want to compare each one of a few variables against each of the others.

So, is there a package solution that makes me happy? As you might guess, yes, there’s actually a few contenders. But my current favourite is probably “ggcorrplot“. The manual is here, and there’s a handy usage guide here.

Suffice to say:

  1. It allows you to compare several variables against each other at the same time.
  2. It visualises the variables in a colour-coded correlation matrix
  3. The visualisation is a ggplot
  4. It can produce p values, using the accompanying function cor_pmat(), which can then be shown on the visualisation in various ways.
  5. It uses the results from the built in cor() function, so can handle the same 3 types of correlation.

There’s a bunch of options to select from, but here’s the default output

# calculate correlations using cor and store the results
corrdata <- cor(test_data)

# use the package's cor_pmat function to calculate p-values for the correlations
p.mat <- cor_pmat(test_data)

# produce a nice highlighted correlation matrix
ggcorrplot(corrdata, title = "Correlation matrix for test data")

The results look like:

ggcordefault

You can see it produces a correlation matrix, colour coded as to the direction and strength of the correlations. It doesn’t show anything about the statistical significance. Kind of pretty for an overview glance, but it could be rather more informative.

I much prefer to use a couple of options that show the actual correlation values and the significance; the ones I most commonly use probably being this set.

ggcorrplot(corrdata, title = "Correlation matrix for test data", lab=TRUE, p.mat = p.mat, sig.level = .05)

ggcorrplot

Here, the correlation coefficients are superimposed on the grid, so you can check immediately the strength of the correlation rather than try and compare to the colour scale.

You can also see that some of the cells are crossed out (for example the correlation between variable c and variable e in the above). This means that the correlation detected is not considered to be significant at the 0.05 level. That level can be changed, or the insignificant correlations be totally hidden if you prefer to not get distracted by them in the first place.

Transactions by active subscribers formulae in Tableau

This blog returns back from the dead (dormant?) with a quick note-to-self on how to do something that sounds simple but proved slightly complicated in practice, using Tableau.

Here’s a scenario, although many others would fit the same pattern. Imagine you have a business that is subscription based, where people can subscribe and cancel whenever they wish. Whilst subscribed, your customer can buy products from you anytime they want.  They can’t buy products if not subscribed.

What you want to know is, for a given cohort of subscribers, how many of those people who are still subscribed purchased a product within their first, second…nth month?

So we want to be able to say things like: “out of the X people that started a subscription last year, Y% of those who were still subscribed for at least Z months bought a product in their Zth month”.

It’s the “who were still subscribed” part that made this a little tricky, at least with the datasource I was dealing with.

Here’s a trivially small example of what I had – a file that has 1 row per sale per customer.

Subscriber ID Length of subscription Month of subscription Transaction type
1 5 1 Sale
1 5 2 Sale
1 5 3 Sale
2 7 1 Sale
2 7 6 Sale
3 1 1 Sale
4 8 1 Sale
4 8 2 Sale
4 8 4 Sale
4 8 5 Sale
5 1 Sale
5 2 Sale
5 3 Sale
5 8 Sale
5 9 Sale

For simplicity, let’s assume every customer has at least one sale. The columns tell you:

  • the ID number of the subscriber
  • the length of the subscription from start to finish, in months. If the length is blank then it means it’s still active today so we don’t know how long it will end up being.
  • the month number of the product sale
  • a transaction type, which for our purposes is always “sale”

Example interpretation: subscriber ID 1 had a subscription that lasted 5 months. They purchased a product in month 1, 2 and 3 (but not 4 or 5).

It’d be easy to know that you had 5 people at the start (count distinct subscriber ID), and that you had 2 transactions in month 3 (count distinct subscriber ID where month of subscription = 3). But how many of those 5 people were still subscribed at that point?

Because this example is so small, you can easily do that by eye. You can see in the data table that we had one subscription, ID 3, that only had a subscription length of 1 month. Everyone else stayed longer than 3 months – so there were 4 subscriptions left at month 3.

The figure we want to know is what proportion of the active subscribers at month 3 bought a product. The correct answer is the number of subscriptions making a product purchase at month 3 divided by the number of subscriptions still active at month 3. Here, that’ s 2 / 4 = 50%.

So how do we get that in Tableau, with millions of rows of data? As you might guess, one method involves the slightly-dreaded “table calculations“. Layout is usually important with table calculations. Here’s one way that works. We’ll build it up step by step, although you can of course combine many of these steps into one big fancy formula if you insist.

Firstly, I modified the data source (by unioning) so that when a subscription was cancelled it generated a “cancelled subscription” transaction.  That looked something like this after it was done.

Subscriber ID Length of subscription Month of subscription Transaction type
1 5 1 Sale
1 5 2 Sale
1 5 3 Sale
1 5 5 Cancelled subscription
2 7 1 Sale
2 7 6 Sale
2 7 7 Cancelled subscription
3 1 1 Sale
3 1 1 Cancelled subscription
4 8 1 Sale
4 8 2 Sale
4 8 4 Sale
4 8 5 Sale
4 8 8 Cancelled subscription
5 1 Sale
5 2 Sale
5 3 Sale
5 8 Sale
5 9 Sale

Note there’s the original sales transactions and now a new “cancel” row for every subscription that was cancelled. In these transactions the “month of subscription” is set to the actual month the subscription was cancelled, which we know from the field “Length of subscription”

Here are the formulae we’ll need to work out, for any given month, how many people were still active, and how many of those bought something:

  • The total count of subscribers in the cohort:
    Count of distinct subscribers in cohort

    { FIXED : [Count of distinct subscribers]}
  • The number of subscribers who cancelled in the given month:
    Count of distinct subscribers cancelling

     COUNTD(
     IF [Transaction type] = "Cancelled subscription"
     THEN [Subscriber ID]
     ELSE NULL
     END
     )
  • Derived from those two figures, the number of subscribers who are still active at the given month:
    Count of subscribers still active

    AVG([Count of distinct subscribers in cohort]) - (RUNNING_SUM([Count of distinct subscribers cancelling])- [Count of distinct subscribers cancelling])
  • The number of subscribers who made a purchase in the given month:
    Count of distinct subscribers making purchase

     COUNTD(
     IF [Transaction type] = "Sale" THEN [Subscriber ID]
     ELSE NULL
     END
     )
  • Finally, derived from the last two results, the proportion of subscribers who made a purchase as a percentage of those who are still active
    Proportion of distinct active subscribers making purchase

    [Count of distinct subscribers making purchase] / [Count of subscribers still active]

Let’s check if it the logic worked, by building a simple text table. Lay months on rows, and the above formulae as columns.

Table proof

That seems to match expectations. We’re certainly seeing the 50% of actives making a purchase on month 3 that were manually calculated above.

Plot a line chart with month of subscription on columns and proportion of distinct active subscribers making purchase on rows, and there we have the classic rebased propensity to purchase curve.

Propensity curve

(although this data being very small and very fake makes the curve look very peculiar!)

Note that we first experimented with this back in ye olde days of Tableau, before the incredible Level Of Detail calculations were available. I have found many cases where it’s worth re-visiting past table calculation work and considering if LoD expressions would work better, and this may well be one of them.

 

 

The Datasaurus: a monstrous Anscombe for the 21st century

Most people trained in the ways of data visualisation will be very familiar with Anscombe’s Quartet. For the uninitiated, it’s a set of 4 fairly simple looking X-Y scatterplots that look like this.

Anscombe's Quartet

What’s so great about those then? Well, the reason data vizzers get excited starts to become clear when you realise that the dotted grey lines I have superimposed on each small chart are in fact the mean average of X and Y in each case. And they’re basically the same for each chart.

The identikit summary stats go beyond mere averages. In fact, the variance of both X and Y (and hence the standard deviation) is also the pretty much the same in every chart. As is the correlation coefficient of X and Y, and the regression line that would be the line of best fit if were you to generate a linear model based on each of those 4 datasets.

The point is to show the true power of data visualisation. There are a bunch of clever-sounding summary stats (r-squared is a good one) that some nefarious statisticians might like to baffle the unaware with – but they are oftentimes so summarised that they can lead you to an entirely misleading perception, especially if you are not also an adept statistician.

For example, if someone tells you that their fancy predictive model demonstrates that the relationship between x and y can be expressed as “y = 3 + 0.5x” then you have no way of knowing whether the dataset the model was trained on was that from Anscombe 1, for which it’s possible that it may be a good model, or Anscombe 2, for which it is not, or Anscombe 3 and 4 where the outliers are make that model sub-par in reality, to the point where a school child issued with a sheet of graph paper could probably make a better one.

Yes analytics end-users, demand pictures! OK, there are so many possible summary stats out there that someone expert in collating and mentally visualising the implication of a combination of a hand-picked collection of 30 decimal numbers could perhaps have a decent idea of the distribution of a given set of data – but, unless that’s a skill you already have (clue: if the word “kurtosis” isn’t intuitive to you, you don’t, and it’s nothing to be ashamed of), then why spend years learning to mentally visualise such things, when you could just go ahead and actually visualise it?

But anyway, the quartet was originally created by Mr Anscombe in 1973. Now, a few decades later, it’s time for an even more exciting scatterplot collection, courtesy of Justin Matejka and George Fitzmaurice, take from their paper “Same Stats, Different Graphs“.

They’ve taken the time to create the Datasaurus Dozen. Here they are:

Datasaurus Dozen.png

What what? A star dataset has the same summary statistics as a bunch of lines, an X, a circle or a bunch of other patterns that look a bit like a migraine is coming on?

Yes indeed. Again, these 12 charts all have the same (well, extremely similar) X & Y means, the same X & Y standard deviations and variances, and also the same X & Y linear correlations.

12 charts are obviously more dramatic than 4, and the Datasaurus dozen certainly has a bunch of prettier shapes, but why did they call it Datasaurus? Purely click-bait? Actually no (well, maybe, but there is a valid reason as well!).

Because the 13th of the dozen (a baker’s dozen?) is the chart illustrated below. Please note that if you found Jurassic Park to be unbearably terrifying you should probably close your eyes immediately.

Datasaurus Main

Raa! And yes, this fearsome vision from the distant past also has a X mean of 54.26, and Y mean of 47.83, and X standard deviation of 16.76, a Y standard deviation of 26.93 and a correlation co-efficient of -0.06, just like his twelve siblings above.

If it’s hard to believe, or you just want to play a bit, then the individual datapoints that I put into Tableau to generate the above set of charts is available in this Google sheet – or a basic interactive viz version can be found on Tableau Public here.

Full credit is due to Wikipedia for the Anscombe dataset and Autodesk Research for the Datasaurus data.

data.world: the place to go for your open data needs?

Somewhere in my outrageously long list of data-related links to check out I found “data.world“. Not only is that a nice URL, it also contains a worthy service that I can imagine being genuinely useful in future, if it takes off like it should. At first glance, it’s a platform for hosting data – seemingly biased towards the “open” variant of data although I see they also offer to host private data too – with some social bits and pieces overlaid.

What’s interesting about this particular portal to me over and above a bunch of other sites with a mass of open data available on them is:

  1. Anyone can upload any conventional dataset (well, that they are legally allowed to do) – so right now it contains anything from World Bank GDP info through to a list of medieval battles, and much more besides. Therefore it presumably seeks to be a host for all the world’s useful data, rather than that of a certain topic or producer. Caveat user etc. presumably applies, but the vision is nice.
  2. You can actually do things with the data on the site itself. For example, you can join one set of data to another hosted on the site, even if it’s from a totally different project from a totally different author, directly on the site. You can run queries or see simple visualisations.
  3. It’s very easy to get the data out, and hence use it in other tools should you want to do more complicated stuff later on.
  4. It’ll also host data documentation and sample queries (for example, SQL that you can run live) to provide contextual information and shortcuts for analysts who need to use data that they might not be intimately familiar with.
  5. There’s a social side. It also allows chat conversations between authors, users and collaborators. You can see what’s already been asked or answered about each dataset. You can “follow” individual people, or curated collection of subject-based datasets.

So let’s test a few features out with a simple example.

The overriding concept is that of a dataset. A dataset is more than a table; it can include many tables and a bunch of other sorts of non-data files that aid with the use of the data itself – for instance documentation, notebooks or images. Each user can create datasets, name and describe them appropriately, and decide whether they should be public or private.

Here’s one I prepared earlier (with a month of my Fitbit step count data in as it happens).

Capture

You can make your dataset open or private, attribute a license to be explicit about its re-use, and add tags to aid discovery. You can even add data via a URL, and later refresh if the contents of that URL changes.

As you can see, after import it shows a preview of the data it read in at the bottom of the screen.  If there were multiple files, you’d be able to filter or sort them to find the one you want.

If you hit the little “i” icon next to any field name, you get a quick summary visualisation and data description, dependent on data type. This is very useful to get a quick overview of what your field contains, and if it was read in correctly. In my view, this sort of thing should be a standard feature in most analytical tools (it already is in some).

Capture

I believe tags, field names and descriptions are searchable – so if you do a nice job with those then it’ll help people find what you’re sharing.

Other common actions now available after you’ve uploaded or discovered a data table of  interest would be to:

You can also “explore” the data. This expands the data table to take up most of the screen, enabling easier sorting, filtering and so on. More interestingly, you can open a chart view where you can make basic charts to understand your data in more detail.

Now, this isn’t going to replace your dedicated visualisation tool – it has only the most basic of customisations available at the moment – but it handles simple exploration requirements in a way that is substantially less time consuming than downloading and importing your data into another tool.

It even suggests some charts you might like to make, allowing 1-click creation. On my data, for example, it offered to make me a chart of “Count of records by Date” or “Count of records by Steps”. It seems to take note of the data types, for instance defaulting to a line chart for the count by date, and a histogram for the count by steps.

Here’s the sort of output the 1-click option gives you:

Capture

OK, that’s not a chart you’re going to send to Nature right away, but it does quickly show the range of my data, let me see check for impossible outliers, and gives some quick insights into the distribution. Apparently I commonly do between about 5000 and 7500 steps…and I don’t make the default Fitbit 10k steps target very often. Oops.

These charts can then immediately be downloaded or shared as png or pdf, with automatically generated URLs like https://data.world/api/chart/export/d601307c3e790e5d05aa17773f81bd6446cdd148941b89b243d9b78c866ccc3b.png

Here I would quite like a 1-click feature to save & publish any chart that was particularly interesting with the dataset itself -but I understand why that’s probably not a priority unless the charting aspect becomes more of a dedicated visualisation feature rather than a quick explore mechanic.

For now, you could always export the graphic and include it as, for example, an image file in the dataset. Here for example is  a dataset where the author has taken the time to provide a great description with some findings and related charts to the set of tables they uploaded.

One type of artefact you can save online with the dataset are queries. Yes, you can query your file live onsite, with either (a variant of) SQL or SPARQL. Most people are probably more familiar with SQL, so let’s start there.

Starting a new query will give you a basic SELECT * LIMIT query, but you’re free to use many (but not all) standard SQL features to change up your dataset into a view that’s useful to you.

Let’s see, did I ever make my 10k step goal in December? If so, on which days?

Capture

Apparently I did, on a whopping four days, the dates of which are outlined above. I guess I had a busy Christmas eve.

These results then behave just like a data table, so they can then be exported, linked to or visualised as a chart.

Once you’re happy with your query, if you think it’s useful for the future you can save it, or if it might aid other people, then you can publish it. A published query remains with the dataset, so next time someone comes to look at the dataset, they’ll see a list of the queries saved which they can re-use or adapt for their own needs. No more need for hundreds of people to transform a common dataset in exactly the same way again and again!

Interestingly, you can directly query between different datasets in the same query, irrespective of data table, data set, or author. Specifying the schemas feels a little fiddly at the moment, but it’s perfectly doable once you understand the system (although there’s no doubt room for future UI improvement here).

Imagine for instance that, for no conceivable reason, I was curious as to which celebrities sadly died on the days I met my 10k steps goal. Using a combination of my dataset and the 2016 celebrity deaths list from popculture, I can query like this:

Capture.PNG

…only to learn the sad news that a giant panda called Pan Pan sadly expired during one of my goal-meeting days.

Of course, these query results can be published, shared, saved, explored and so on just like we saw previously.

Now, that’s a silly example but the idea of, not only being able to download open data, but have subject matter experts combine and publish useful data models as a one-time effort for data-consumers to use in future is an attractive feature. Together with the ability to upload documentation, images or even analytical notebooks you may see how this could become an invaluable resource of data and experience – even within a single organisation, let alone as a global repository of open data.

Of course, as with most aggregation or social sites, there’s a network effect: how useful this site ends up being depends on factors such as how many people make active use of it, how much data is uploaded to it and what the quality of the data is.

If one day it grew to the point was the default place to look for public data, without becoming a nightmare to find those snippets of data gold in amongst its prospectively huge collection, it would potentially be an incredibly useful service.

The “nightmare to find” aspect is not a trivial point – there are already several open data portals (for instance government based ones) which offer a whole load of nice datasets, but often it is hard to find the exact content of data at the granularity that you’re after even when you know it exists – and these are on sites that are often quite domain-limited which in some ways makes the job easier. At data.world there is already a global search (which includes the ability to search specifically on recency, table name or field name if you wish), tags and curated collections which I think shows the site takes the issue seriously.

For analyst confidence, some way of understanding data quality would also be useful. The previews of field types and contents are already useful here. Social features, to try and surface a concept similar to “institutional knowledge”, might also be overlaid. There’s already a basic “like” facility. Of course this can be a challenging issue for any data catalogue that, almost by definition, needs to offer upload access to all.

For browser-haters, it isn’t necessary to use the site directly in order to make use of its contents. There’s already an API which gives you the ability to programmatically upload, query and download data. This opens up some interesting future possibilities. Perhaps, if data.world does indeed become a top place to look for the data of the world, your analytics software of choice might in future include a feature such that you can effectively search a global data catalogue from the comfort of your chart-making screen, with a 1-click import once you’ve found your goal. ETL / metadata tools could provide a easy way to publish to the results of your manipulations, and so on.

The site is only in preview mode at present, so it’s not something to stake your life on. But I really like the concept and the execution so far is way beyond some other efforts I’ve seen in the past. If I find I’ve created a public dataset I’d like to share, I would certainly feel happy to distribute it and all supporting documents and queries via this platform. So best of luck to data.world in the, let’s say,”ambitious” mission of bringing together the world’s freely available – yet often sadly undiscoverable – data in a way that encourages people to actually make valuable use of it.

Lessons from what happened before Snow’s famous cholera map changed the world

Anyone who studies any amount of the history of, or the best practice for, data visualisation will almost certainly come across a handful of “classic” vizzes. These specific transformations of data-into-diagram have stuck with us through the mists of time in order to become examples that teachers, authors, conference speakers and the like repeatedly pick to illustrate certain key points about the power of dataviz.

A classic when it comes to geospatial analysis is John Snow’s “Cholera map”. Back in the 1850s, it was noted that some areas of the country had a lot more people dying from cholera than other places. At the time, cholera’s transmission mechanism was unknown, so no-one really knew why. And if you don’t know why something’s happening, it’s usually hard to take action against it.

Snow’s map took data that had been gathered about people who had died of cholera, and overlaid the locations where these people resided against a street map of a particularly badly affected part of London. He then added a further data layer denoting the local water supplies.

snowmap

(High-resolution versions available here).

By adding the geospatial element to the visualisation, geographic clusters showed up that provided evidence to suggest that use of a specific local drinking-water source, the now-famous Broad Street public well, was the key common factor for sufferers of this local peak of cholera infection.

Whilst at the time scientists hadn’t yet proven a mechanism for contagion, it turned out later that the well was indeed contaminated, in this case with cholera-infected nappies. When locals pumped water from it to drink, many therefore tragically succumbed to the disease.

Even without understanding the biological process driving the outbreak – nobody knew about germs back then –  seeing this data-driven evidence caused  the authorities to remove the Broad Street pump handle, people could no longer drink the contaminated water, and lives were saved. It’s an example of how data visualisation can open ones’ eyes to otherwise hidden knowledge, in this case with life-or-death consequences.

But what one hears a little less about perhaps is that this wasn’t the first data-driven analysis to confront the same problem. Any real-world practising data analyst might be unsurprised to hear that there’s a bit more to the story than a swift sequence of problem identification -> data gathering -> analysis determining the root cause ->  action being taken.

Snow wasn’t working in a bubble. Another gentleman, by the name of William Farr, whilst working at the General Register Office, had set up a system that recorded people’s deaths along with their cause. This input seems to have been a key enabler of Snow’s analysis.

Lesson 1: sharing data is a Very Good Thing. This is why the open data movement is so important, amongst other reasons. What if Snow hadn’t been able examine Farr’s dataset – could lives have been lost? How would the field of epidemiology have developed without data sharing?

In most cases, no single person can reasonably be expected to both be the original source of all the data they need and then go on to analyse it optimally. “Gathering data” does not even necessarily involve the same set of skills as “analysing data” does – although of course a good data practitioner should usually understand some of the theory of both.

As it happens, William Farr had gone beyond collecting the data. Being of a statistical bent, he had actually already used the same dataset himself to analytically tackle the same question – why are there relatively more cholera deaths in some places than others? He’d actually already found what appeared to be an answer. It later turned out that his conclusion wasn’t correct – but it certainly wasn’t obvious at the time. In fact, it likely seemed more intuitively correct than Snow’s theory back then.

Lesson 2: Here then is a real life example then of the value of analytical iteration. Just because one person has looked at a given dataset doesn’t mean that it’s worthless to have someone else re-analyse it – even if the former analyst has established a conclusion. This is especially important when the stakes are high, and the answer in hand hasn’t been “proven” by virtue of any resulting action confirming the mechanism. We can be pleased that Snow didn’t just think “oh, someone’s already looked at it” and move on to some shiny new activity.

So what was Farr’s original conclusion? Farr had analysed his dataset, again in a geospatial context, and seen a compelling association between the elevation of a piece of land and the number of cholera deaths suffered by people who live on it. In this case, when the land was lower (vs sea level for example) then cholera deaths seemed to increase.

In June 1852, Farr published a paper entitled “Influence of Elevation on the Fatality of Cholera“. It included this table:

farrtable

The relationship seems quite clear; cholera deaths per 10k persons goes up dramatically as the elevation of the land goes down.

Here’s the same data, this time visualised in the form of a linechart, from a 1961 keynote address on “the epidemiology of airborne infection”, published in Bacteriology Reviews. Note the “observed mortality” line.

farrchart.gif

Based on that data, his elevation theory seems a plausible candidate, right?

You might notice that the re-vizzed chart also contains a line concerning the calculated death rate according to “miasma theory”, which seems to have an outcome very similar on this metric to the actual cholera death rate. Miasma was a leading theory of disease-spread back in the nineteenth century, with a pedigree encompassing many centuries. As the London Science Museum tells us:

In miasma theory, diseases were caused by the presence in the air of a miasma, a poisonous vapour in which were suspended particles of decaying matter that was characterised by its foul smell.

This theory was later replaced with the knowledge of germs, but at the time the miasma theory was a strong contender for explaining the distribution of disease. This was probably helped because some potential actions one might take to reduce “miasma” evidently would overlap with those of dealing with germs.

After analysing associations between cholera and multiple geo-variables (crowding, wealth, poor-rate and more), Farr’s paper selects the miasma explanation as the most important one, in a style that seems  quite poetic these days:

From an eminence, on summer evenings, when the sun has set, exhalations are often seen rising at the bottoms of valleys, over rivers, wet meadows, or low streets; the thickness of the fog diminishing and disappearing in upper air. The evaporation is most abundant in the day; but so long as the temperature of the air is high, it sustains the vapour in an invisible body, which is, according to common observation, less noxious while penetrated by sunlight and heat, than when the watery vapour has lost its elasticity, and floats about surcharged with organic compounds, in the chill and darkness of night.

The amount of organic matter, then, in the atmosphere we breathe, and in the waters, will differ at different elevations; and the law which regulates its distribution will bear some resemblance to the law regulating the mortality from cholera at the various elevations.

As we discover later, miasma theory wasn’t correct, and it certainly didn’t offer the optimum answer to addressing the cluster of cholera cases Snow examined.But there was nothing impossible or idiotic about Farr’s work. He (as far as I can see at a glance) gathered accurate enough data and analysed them in a reasonable way. He was testing a hypothesis that was based on the common sense at the time he was working, and found a relationship that does, descriptively, exist.

Lesson 3: correlation is not causation (I bet you’ve never heard that before 🙂 ). Obligatory link to the wonderful Spurious Correlations site.

Lesson 4: just because an analysis seems to support a widely held theory, it doesn’t mean that the theory must be true.

It’s very easy to lay down tools once we seem to have shown that what we have observed is explained by a common theory. Here though we can think of Karl Popper’s views of scientific knowledge being derived via falsification. If there are multiple competing theories in play, the we shouldn’t assume certainty that the dominant one is correct until we have come up with a way of proving the case either way. Sometimes, it’s a worthwhile exercise to try to disprove your findings.

Lesson 5: the most obvious interpretation of the same dataset may vary depending on temporal or other context.

If I was to ask a current-day analyst (who was unfamiliar with the case) to take a look at Farr’s data and provide a view with regards to the explanation of the differences in cholera death rates, then it’s quite possible they’d note the elevation link. I would hope so. But it’s unlikely that, even if they used precisely the same analytical approach, they would suggest that miasma theory is the answer. Whilst I’m hesitant to claim there’s anything that no-one believes, for the most part analysts will probably place an extremely low weight on discredited scientific theories from a couple of centuries ago when it comes to explaining what data shows.

This is more than an idealistic principle – parallels, albeit usually with less at stake, can happen in day-to-day business analysis. Preexisting knowledge changes over time, and differs between groups. Who hasn’t seen (or had of being) the poor analyst who revealed a deep, even dramatic, insight into business performance predicated on data which was later revealed to have been affected by something entirely different.

For my part, I would suggest to learn what’s normal, and apply double-scepticism (but not total disregard!) when you see something that isn’t. This is where domain knowledge is critical to add value to your technical analytical skills. Honestly, it’s more likely that some ETL process messed up your data warehouse, or your store manager is misreporting data, than overnight 100% of the public stopped buying anything at all from your previously highly successful store for instance.

Again, here is an argument for sharing one’s data, holding discussions with people outside of your immediate peer group, and re-analysing data later in time if the context has substantively changed. Although it’s now closed, back in the deep depths of computer data viz history (i.e. the year 2007), IBM launched a data visualisation platform called “Many Eyes”. I was never an avid user, but the concept and name rather enthralled me.

Many Eyes aims to democratize visualization by providing a forum for any users of the site to explore, discuss, and collaborate on visual content…

Sadly, I’m afraid it’s now closed. But other avenues of course exist.

In the data-explanation world, there’s another driving force of change – the development of new technologies for inferring meaning from datapoints. I use “technology” here in the widest possible sense, meaning not necessarily a new version of your favourite dataviz software or a faster computer (not that those don’t help), but also the development of new algorithms, new mathematical processes, new statistical models, new methods of communication, modes of thought and so on.

One statistical model, commonplace in predictive analysis today, is logistic regression. This technique was developed in the 1950s, so was obviously unavailable as a tool for Farr to use a hundred years beforehand. However, in 2004, Bingham et al. published a paper that re-analysed Farr’s data, but this time using logistic regression. Now, even here they still find a notable relationship between elevation and the cholera death rate, reinforcing the idea that Farr’s work was meaningful – but nonetheless conclude that:

Modern logistic regression that makes best use of all the data, however, shows that three variables are independently associated with mortality from cholera. On the basis of the size of effect, it is suggested that water supply most strongly invited further consideration.

Lesson 6: reanalysing data using new “technology” may lead to new or better insights (as long as the new technology is itself more meritorious in some way than the preexisting technology, which is not always the case!).

But anyway, even without such modern-day developments, Snow’s analysis was conducted, and provided evidence that a particular water supply was causing a concentration of cholera cases in a particular district of London. He immediately got the authorities to remove the handle of the contaminated pump, hence preventing its use, and hundreds of people were immediately saved from drinking its foul water and dying.

That’s the story, right? Well, the key events themselves seem to be true, and it remains a great example of that all-too-rare phenomena of data analysis leading to direct action. But it overlooks the point that, by the time the pump was disabled, the local cholera epidemic had already largely subsided.

The International Journal of Epidemiology published a commentary regarding the Broad Street pump in 2002, which included a chart using data taken from Whitehead’s “Remarks on the outbreak of cholera in Broad Street, Golden Square, London, in 1854” paper, which was published in 1867. The chart shows, quite vividly, that by the date that the handle of the pump was removed, the local cholera epidemic that it drove was likely largely over.

handle

As Whitehead wrote:

It is commonly supposed, and sometimes asserted even at meetings of Medical Societies, that the Broad Street outbreak of cholera in 1854 was arrested in mid-career by the closing of the pump in that street. That this is a mistake is sufficiently shown by the following table, which, though incomplete, proves that the outbreak had already reached its climax, and had been steadily on the decline for several days before the pump-handle was removed

Lesson 7: timely analysis is often vital – but if it was genuinely important to analyse urgently, then it’s likely important to take action on the findings equally as fast.

It seems plausible that if the handle had been removed a few days earlier, many more lives could have been saved. This was particularly difficult in this case, as Snow had the unenviable task of persuading the authorities too take action based on a theory that was counter to the prevailing medical wisdom at the time. At least any modern-day analysts can take some solace in the knowledge that even our highest regarded dataviz heroes had some frustration in persuading decision makers to actually act on their findings.

This is not at all to reduce Snow’s impact on the world. His work clearly provided evidence that helped lead to germ theory, which we now hold to be the explanatory factor in cases like these. The implications of this are obviously huge. We save lives based on that knowledge.

Even in the short term, the removal of the handle, whilst too late for much of the initial outbreak, may well have prevented a deadly new outbreak. Whitehead happily acknowledged this in his article.

Here I must not omit to mention that if the removal of the pump-handle had nothing to do with checking the outbreak which had already run its course, it had probably everything to do with preventing a new outbreak; for the father of the infant, who slept in the same kitchen, was attacked with cholera on the very day (Sept. 8th) on which the pump-handle was removed. There can be no doubt that his discharges found their way into the cesspool, and thence into the well. But, thanks to Dr. Snow, the handle was then gone.

Lesson 8: even if it looks like your analysis was ignored until it was too late to solve the immediate problem, don’t be too disheartened –  it may well contribute towards great things in the future.

Books I read in 2016

Reading is one of the favoured hobbies in the DabblingWithData household. In 2016 my beloved fiance invited me to participate in the Goodreads Reading Challenge. It’s simple enough – you set a target and then see if you can read that many books.

The challenge does have its detractors; you can see that an obsession with it will perversely incentivise reading “Spot the Dog” over “Lord of the Rings“. But if you participate in good spirits, then you end up building a fun log of your reading which, if nothing else, gives you enough data that you’ll remember at least the titles of what you read in years hence.

I don’t quite recall where the figure came from, but I had my 2016 challenge set at 50 books. Fifty, you might say, that’s nearly one a week! Surely not possible – or so I thought. I note however that my chief competitor, following a successful year, has set this year’s target to 100, so apparently it’s very possible for some people).

Anyway, Goodreads has both a CSV export feature of the books you log as having read in the competition, and also an API.  I therefore thought I’d have a little explore of what I managed to read. Who knows, perhaps it’ll help improve my 2017 score!

Please click through for slightly more interactive versions of any chart, or follow this link directly. Most data is taken directly from Goodreads, with a little editing by hand.

How much did I read.png

Oh no, I missed my target 😦 Yes, fifty books proved too challenging for me in 2016 – although I got 80% of the way there, which I don’t think is too terrible. My 2017 target remains at fifty.

The cumulative chart shows a nice boost towards the end of August, which was summer holiday time for me. This has led me to conclude the following actionable step: have more holidays.

I was happy to see that I hadn’t subconsciously tried to cheat too much by reading only short books. From the nearly 14k page-equivalents I ploughed through, the single most voluminous book was Anathem. Anathem is a mix of sci-fi and philosophy, full of slightly made-up words just to slow you down further – an actual human:alien glossary is generously included in the back of the book.

The shortest was the Ladybird Book of the Meeting. This was essential reading for work purposes of course, and re-taught me eternal truths such as “Meetings are important because they give everyone a chance to talk about work. Which is easier than doing it”.

Most of my books were in the 2-400 page range – although of course different books make very different usages of a “page”.

So what did I read about?

what-did-i-read

Science fiction is #1 by book volume. I have an affinity for most things that have been deemed geeky through history (and perhaps you do too, if you got this far in!), so this isn’t all that surprising.

Philosophy at #2 is a relatively new habit, at least as a concerted effort. I felt that I’d got into the habit of concentrating too much on data (heresy I know), technology and related subjects in previous years’ reading habits – so thought I’d broaden my horizons a bit by looking into, well, what Google tells me is merely the study of “the fundamental nature of knowledge, reality, and existence”. It’s very interesting, I promise. Although it can be pretty slow to read as every other sentence one does risk ending up staring at the ceiling wondering whether the universe exists, and other such critical issues. Joking aside, the study of epistemology, reality and so on might not be a bad idea for analysty types.

Lower down we’ve got the cheap thriller and detective novels that are somewhat more relaxing, not requiring either a glossary or a headache tablet.

I was a little surprised at what a low proportion of my books were read in eBook format. For most – not all – books, I think eReaders give a much superior reading experience to ye olde paper. This I’m aware is a controversial minority  opinion but I’ll stick to it and point you towards a recent rant on the Hello Internet podcast to explain why.

 

So I’d have guessed a 80-90% eBook rate – but a fair number of paper books actually slipped in. Typically I suspect these are ones I borrowed, or ones that aren’t available in eBook formats. Some of Asimov’s books, of which I read a few this year, for instance are usually not available on Kindle.

On which subject, authors. Most included authors only fed my book habit once last year, although the afore-mentioned Asimov got his hooks into me. This was somewhat aided by the discovery of a cluster of his less well-known books fortuitously being available for 50p each at a charity sale. But if any readers are interested in predictive analytics and haven’t read the Foundation Trilogy, I’d fully recommend even a full price copy for an insight into what the world might have to cope with if your confusion matrix ever showed perfection in all domains.

Sam Harris was the second most read. That fits in with the philosophy theme. He’s also one of the rare people who can at times express opinions that intuitively I do not agree with at all, but does it in a way such that the train of thought that led him to his conclusions is apparent and often quite reasonable. He is, I’m aware, a controversial character on most sides of any political spectrum for one reason or another.

Back to format – I started dabbling with audio books, although at first did not get on so well with them; there’s a certain amount of concentration needed which comes easier to me when visual-reading than audio-reading. But I’m trying again this year, and it’s going better – practice makes perfect?

The “eBook /Audio” category refers to a couple of lecture series from the Great Courses  which give you  a set of half hour lectures to listen to, and an accompanying book to follow along with. These are not free but they cover a much wider range of topics than the average online MOOC seems to (plus you don’t feel bad about not doing assignments – there are none).

Lastly, the GoodReads rating. Do I read books that other people think are great choices? Well, without knowing the background distribution of ratings, and taking into account the number of reviews and from whom, it’s hard to do much except assume a relative ranking when the sample gets large enough.

It does look like my books are on the positive side of the 5-points scale, although definitely not the amongst GoodReads’ most popular. Right now, that list starts with The Hunger Games, which I have read and enjoyed, but it wasn’t in 2016. Looking down the global popularity list, I do see quite a few I’ve had a go at in the past, but almost none that I regret choosing one of my actual choices over this year at first sight!

For the really interested readers out there, you can see the full list of my books and links to the relevant Goodreads pages on the last tab of the viz.

5 Power BI features that might make Tableau users a little jealous

New year, new blog post, new tool version to play with! It’s clear that the field of data-related stuff progresses extremely rapidly at present, and hence it behoves those of us of an analyst bent to, now and then, go explore tools that we don’t use day-to-day. We may already have our favourites in each category, but, unless we’ve done a recent review, it’s quite possible the lesser-loved packages have developed a whole new bunch of goodies since the last checkup.

With that in mind, I’ve taken a look at the latest version of Microsoft Power BI. It’s billed in this manner by its creators:

Power BI transforms your company’s data into rich visuals for you to collect and organize so you can focus on what matters to you.

It’s therefore an obvious competitor for software like Tableau, Qlikview, chart.io, and many others, and largely can replace Microsoft’s previous PowerView offering, which was accessed directly via Excel. In a similar way to the Tableau suite, there’s a Power BI desktop package that analysts install locally on their computer primarily to manipulate data and construct visuals, and a web-based Power BI service that allows for publication and distribution of the resulting file. Actually the online service is pretty powerful in terms of allowing you to create reports and dashboards via the web, and includes a few other nifty features designed to improve the usability of this software genre – so even some analysts might get a lot out of the web-based version alone.

A lot of Power BI is actually free of charge to use, although there is an enhanced “Pro” edition at around US$10 a month, replete with plenty of more enterprisey features as you can see on their comparison chart. If you’re working somewhere with an Office 365 subscription, you might find you already have access to Power BI, even if you didn’t know about it. So, there’s not much to stop you having a play with it if you’re even remotely interested.

Anyhow, this post is not to review Power BI overall, but rather to point out 5 features that stood out to me as not being present in my current dataviz software of choice, Tableau. These therefore aren’t necessarily the general “5 best features of Power BI” – both Tableau and Power BI can create a pretty line chart, so it’s not really worth pointing that out in this context. My choices should then really be considered from the context of someone already deeply familiar with what Tableau or other competitors already offer.

Also note that software packages aren’t supposed to be feature-identical; many programs aimed at solving the same sort of problems may be completely different in their philosophy of design. Adding some features necessitates a cost in terms of whether other features can be supported. This then is not a request to Tableau and competitors to copy these features. But I do vehemently think it’s useful for day-to-day data practitioners to remain aware of what software features are out there in the wild today, just case it gives you a better option to solve a particular problem you encounter one day.

As a spoiler: for what it’s worth, my dive into Power BI hasn’t resulted in me throwing my lovely copy of Tableau away, not a chance; you can pry that from my cold dead hands etc. There’s a certain fluidity in Tableau, especially when used for adhoc analysis, that I’ve not yet encountered in its more obvious competitors, which seems very conducive to digging for insights.

But it has led me to believe that the Microsoft offering has improved substantially since the time years ago I used to battle against v1 PowerPivot (which itself was great for some specific data manipulation activities…but eventually I got tired of the out-of-memory errors!). And, especially due to the way its licensed – to be blunt, far cheaper than Tableau for some configurations – it’ll remain in my mind when considering tools for future projects.

So, in no particular order, here’s some bits and pieces that piqued my curiosity:

1: Focus mode

Let’s start with a simple one. Dashboards typically contain several charts or tables that are designed to provide insight upon a given topic. Ideally the combination of content that makes up a dashboard should usually fit on a single screen, and an overall impression of “is it good or bad news?” should be available at a glance.

In designing dashboards, especially those that are useful for multiple audiences, there’s often therefore a tension between providing enough visualisations such that every user has the information they need, vs making the screen so cluttered or hard to navigate through that no user enjoys the experience of trying to decipher 1-inch square charts whatsoever.

For cases where a particular chart on a dashboard is of interest to a user, Power BI has a “focus” mode that allows the observer to zoom in and interact with that single chart on a dashboard or report on a near-fullscreen basis, without requiring any extra development work on the part of the analyst.

It’s a simple enough concept – the user just clicks a button on whichever visualisation they’re interested in, and it zooms in to fill up most of the screen until they click out of it. It keeps its original interactivity, plus displays some extra meta-information that might be useful (last refresh time etc.). But the main point is it becomes big enough to potentially help generate deeper insights for a particularly interested end user in a way that a little 1 inch square chart shoved at the bottom of a dashboard might struggle to do, even if the 1 inch version is more appropriate for the average dashboard viewer.

If that description isn’t clear, then it’s probably better seen in video form. For example:

 

2: Data driven alerts

Regular readers might have established that I’m a big fan of alerting, when it comes to trying to promote data driven decision making. I’m fairly convinced that many dashboards come with a form of “engagement decay”, where the stakeholder is initially obsessively excited with their ability to access data. But as time goes on they get quite bored of checking to see if everything’s OK – especially if everything usually is OK – and hence stop taking the time to consult a potentially valuable source of decision making.

So, for these types of busy execs, and anyone else wanting to optimise productivity, I like alerts. Just have the dashboard send some sort of notification whenever there’s actually something “interesting” to see.

Sure enough, Power BI has the capacity to alert the user upon certain KPI events, via its own web-based notification centre or, more usefully, email or phone app.

powerbialert

The implementation is pretty simple and somewhat restrictive at the moment. Alerts can only be set up on “numeric tiles featuring cards, KPIs, and gauges”, the alert triggers are basic above X or below X type affairs, and you’re restricted to being alerted once an hour or once a day. So there’s a lot of potential room for development – I’d like to see statistical triggers for instance – “alert me if something unusual happens”.

The good news for Tableau users is that Tableau has promised a similar feature will be coming to their software in the future (and to some extent an analyst can create similar functionality event now with the “don’t send email if view is empty” option recently added). But if you want a nice simple “send me an email whenever my sales drop below £10,000” feature that non-analytical folks can easily use, then Power BI can do that right now.

3: Custom visualisations

All mainstream dataviz products should be able to squeeze out the tried-and-tested basic varieties of visuals; line chart, bar chat, scatterplot et al. And >= 90% of the time this is often enough, in fact usually the best approach for clarity. But sometimes, for better or worse, that’s not sufficient for certain use-cases. You can see this tension surfacing within the Tableau community where, despite the large number of proven chart types it can handle,  there are even larger number of blogs, references documents et al. as to what form one has to coerce your data into order to simulate more esoteric visualisation types within software that has not been natively designed to produce them.

A couple of common examples in recent times would include Sankey charts or hexagonal binning. Yes, you can construct these types of viz in Tableau and other competing products – but it requires a bit of workaroundy pre-work, and entirely interrupts the naturalistic method of exploring data that these tools seek to provide. For example, an average user wishing to construct a Sankey chart in Tableau, may want to search out and thoroughly read one or many of a profusion of useful posts, including those here, here, here, and here and several more places throughout the wilds of the web.

It’s very cool that these resources exist – but imagine if instead of having to rely on researching and recreating clever people’s ingenious workarounds, an expert could just provide a one-click solution to your problem. Or you could share your genius more directly with your peers.

Power BI presents an API where an advanced user can create their own visualisation types. These then integrate within Power BI’s toolbox, as though Microsoft had provided them in the base package. Hence data vizzers of all skill levels can use that type of visual without the need for any programming or mathematical workarounds. It should be noted that the procedure for creating these does require learning a superset of Javascript called Typescript, which would certainly not be expected of most Power BI audiences.

But this barrier is alleviated via the existence of a public gallery of these visualisations that Microsoft maintains, which allows generous developers to share their creations world-wide. A Power BI user wouldn’t have to think about the mathematical properties underyling a Sankey plot – they could just download a Sankey chart type addin such as this one.

sankey.PNG

Now, this open access does introduce some risks of course. Thanks to Spiderman, we all know what great power comes with. And even on the public custom visuals gallery, you’ll see some entries that, well, let’s say Stephen Few might object to.

pyramid

Bonus feature: you can also display native R graphics in your Power BI dashboard, with some limitations.

4: “Pin anything to  dashboard” for non-analyst end users

To understand this one, you need to know something about the Power BI object types. Simply that a “report” is made out of a “dataset”, and a “dashboard” is usually, but not exclusively, made out of components of reports*. A dataviz expert can publish any combination of those (or even publish a mixed set of them as a content pack, which any interested users can download to use with a few clicks – another potentially nifty idea!).

(* Tableau users – you can then think of a report as a worksheet, but a worksheet that can support multiple vizzes with arbitrary placement.)

Reports are what they sound like; the electronic equivalent of a notebook with between zero and many data visualisations on each page concerning a particular topic. Note though an important limitation of being restricted to a single datasource per report. In Power BI you create reports with the simple drag and drop of charting components and configurations, after selecting the appropriate datasource. Charts stick around, in interactive form, wherever you drag them to, almost  as though you were making a Powerpoint slide. No “containers” needed, Tableau-fans 🙂

Dashboards however have a more fixed format; always appearing as though they were a set of tiles, each with a different item in. There’s no restriction on data sources, but some restrictions on functionality; such as no-cross filtering between independent tiles. A dashboard tile can be any viz from any report, a whole report itself (which can then cross-filter within the scope of the report) or some miscellaneous other stuff including “live” Excel workbooks, static images, and even answers to natural language questions you may have asked in the fancy Q&A functionality (“what were our sales last month?”).

So, what’s this about non-analysts? Well, a difference between Power BI dashboards and those from some other tools is that even people considered as as being solely viz consumers can legitimately create their own dashboards. A non-analytical end-user can choose to pin any individual chart from any individual report (or the other types of items listed above) to a new dashboard and hence create a smorgasbord showing exactly the parts of each report / pre-made dashboard they are actually interested in all on one page. After all, the individual viz consumer is by definition best placed to know what’s most important to them.

Here’s what that looks like in reality:

This is perhaps one approach to solving the problem that often in reality the analyst is designing a dashboard for an multi-person audience, within which each individual has slightly different needs. Each user might be interested in a different 3 of the 5 charts in your dashboard. Here, each user could then choose to pin their favourite 3 to their own start up page, or any other dashboard they have control over, together with their favourite data table from another report and most loved Excel workbook, if they insist.

How this actually plays out in practice with novice users would be interesting to see. I think a certain type of non-analyst power user would find this pretty useful, and it’s a more realistic a concept of “even non-analysts can make dashboards with no training” than a lot of these types of tools foolishly promise.

5: More powerful data manipulation tools

This one is more for advanced users. Power BI lets you manipulate the data (you might even say business-user “ETL”) before you start employing it in your visualisations. Most dashboarding tools likely let you do this to some extent – Tableau recently improved its ability to union data for instance, together with some cleaning features, and it’s had joining and blending for a while. You can also write VizQL formulae to produce calculations at the time of connecting to data.

Power BI’s query editor seems to be more powerful than many, with a couple of particular nice features.

Firstly, it uses a language called ‘M’ which is specifically designed with data mashups in mind. Once you’ve obtained your data with the query editor, you can then go on to use the DAX language (designed for data analysis, and whose CALCULATE() function has a soft spot in my heart from previous projects) throughout Power BI in terms of working on data you already have access to.

The query editor is fully web-data enabled; even scraping data right off appropriately formatted web pages without any scripting work at all. Here’s the Microsoft team grabbing and applying a few transforms to IMDB data.

One query-editor feature I particularly like somewhat addresses the disadvantage that some of these user-friendly manipulation tools have vs scripting languages like R; that of reproducibility.

In Power BI, as you go through and apply countless modifications to your incoming dataset, a list of “applied steps” appears to the side of your data pane. Here’s an example from the getting started guide.

appliedsteps

It’s a chronological list of everything you’ve done to manipulate the data, and you also have the ability to go back and delete or edit the steps as you please. No more wondering “how on earth did I get the data into this format?” after an hour of fiddling around transforming data.

There’s plenty of built-in options for cleaning up mucky data; including unpivoting, reordering, replacing values and a fill-down type operation that fills down data until it next sees a value in the same column,  which handles those annoying Excel sheets where each group of rows only has its name filled in on the top row. Unioning and joining is of course very possible,  and you’ll have access to a relationships diagram view, for anyone who fancies having a look at, or modifying, how tables relate to each other.

Analysts are not limited to connecting to existing data either. Non-DBA types can create new tables directly in Power BI and type or paste data directly into them if you wish (although I’d be wary of over-using this feature…be sure to future-proof your work!). You can also upload your standard Excel workbooks directly to the service for web Power BI to access to its underlying data.

If Power BI already has the data tables you want, but they’re just formatted suboptimally or over-granular, then you can use DAX to create calculated tables whereby you use the contents of other imported tables to build your own in-memory virtual table. This might allow you to, for instance, reduce your use of intermediate database temporary tables for some operations, perhaps performing some 1-time aggregation before analysing for instance.