Behind the scenes of the FiveThirtyEight UK general election forecasting model

Here in the UK we’re about to go to the polls to elect some sort of government in just a few weeks. Nate Silver’s FiveThirtyEight team are naturally on the case in providing their famously accurate election forecasts.  They were kind enough to explain again the methodology being used in this blog post by Ben Lauderdale.

Go there and read it in full for the clear and interesting explanation, but in super-quick summary it is starting with their famed method of analysing poll results over time and adjusting for the historic bias each poll has shown vs reality, both in terms of source and time-left-before-election.

What the average poll says now is not the best guess of what will happen in the subsequent election…We can estimate how the relative weight to put on the polls changes as elections approach and use that in our forecast.

But it soon becomes more complex. In their view, due to the increasing influence on the results by parties that nationally have a low share of vote but with high regional variance, applying a uniform swing to the whole country based on national polls doesn’t work.

However, constituency-level polls are not frequent or numerous enough to include in the above. They did manage to get some, but, being relatively sparse, are developing a model around them.

We use a multilevel regression model to describe how vote intention at the constituency level depends on a variety of factors, including region, incumbency, constituency demographics and results in the last election. We then reconcile the constituency-level vote intentions we get from this data with the national-level forecast that we constructed using the national polls, by applying a swing model that we built from the historical record of constituency vote share swings from election to election.

I’m looking forward very much to seeing how it goes, even if I’m not greatly keen on the result they predict today! Follow their predictions here.

Their full description of their model includes a lesson on the importance of phrasing survey questions. Apparently people do not answer “If there was a general election tomorrow, which party would you vote for?” in the same way as “Thinking specifically about your own parliamentary constituency at the next general election and the candidates who are likely to stand for election to Westminster there, which party’s candidate do you think you will vote for in your own constituency”.

Advertisements

The most toxic place on Reddit

Reddit, the “front page of the internet” – and a network I hardly ever dare enter for fear of being sucked in to reading 100s of comments for hours on highly pointless yet entertaining things –  has had its share of controversies over the years.

The site is structurally divided up into “subreddits” , which one can imagine just as simple, quite old-school, forums where anyone can leave links and comments, and anyone else can up or downvote them as to whether they approve or not.

Reddit users were themselves busily engaged in a chat regarding “which popular subreddit has a really toxic community” when Ben Bell of Idibon  (a company big into text analysis)  decided to tackle the same question with a touch of data science.

But what is “toxic”? Here’s their definition.

Ad hominem attack: a comment that directly attacks another Redditor (e.g. “your mother was a hamster and your father smelt of elderberries”) or otherwise shows contempt/disagrees in a completely non-constructive manner (e.g. “GASP are they trying CENSOR your FREE SPEECH??? I weep for you /s”)

Overt bigotry:  the use of bigoted (racist/sexist/homophobic etc.) language, whether targeting any particular individual or more generally, which would make members of the referenced group feel highly uncomfortable

Now, text sentiment analysis isn’t all that perfect as of today. The CTO of Datasift  who has a very cool social-media-data-acquiring-tool was claiming around 70% accuracy as being about the peak possible, a couple of years ago. The CEO of the afore-mention Idibon claimed about 80% was possible today.

No-one is claiming nearly 100%, especially on such subtle determinations such as toxicity, and their chosen opposite, supportiveness. The learning process was therefore a mix of pure machine science and human involvement, with the Idibon sentiment analysis software highlighting, via the Reddit API, the subreddits most likely to be extreme, and humans classifying a subset of the posts into those categories.

But what is a toxic community? It’s not as simple as simply a place with a lot of toxic comments (although that’s probably not a bad proxy). It’s a community where such nastiness is approved of or egged on, rather than ignored, frowned upon or punished. Here Reddit provides a simple mechanism to indicate this, as each user can upvote (approve of) or downvote (disapprove of)  a post.

Their final formula they used to calculate judge the subreddits, as per their blog again, is 

The full results of their analysis are kindly available for interactive visualisation, raw data download and so on here.

But in case anyone is in need of a quick offending, here were the top 5 by algorithmic toxicity. It may not be advisable to visit them on a work computer.

Rank of bigotry Subreddit name Official description
1 TheRedPill Discussion of sexual strategy in a culture increasingly lacking a positive identity for men.
2 Opieandanthony The Opie and Anthony Show
3 Atheism The web’s largest atheist forum. All topics related to atheism, agnosticism and secular living are welcome.
4 Sex r/sex is for civil discussions about all facets of sexuality and sexual relationships. It is a sex-positive community and a safe space for people of all genders and orientations.
5 Justneckbeardthings A subreddit for those who adorn their necks with proud man fur.Neckbeard: A man who is socially inept and physically unappealing, especially one who has an obsessive interest in computing:- Oxford Dictionary

[Edited to correct Ben Bell’s name and column title of table – my apologies!]

Free data: data.gov.uk – thousands of datasets from the UK government

Data.gov.uk is the official portal that releases what the UK government deems of as open data.

The government is opening up its data for other people to re-use. This is only about non-personal, non-sensitive data – information like the list of schools, crime rates or the performance of your council.

At the time of writing it has nearly 20k published datasets available of various qualities and in various formats both pleasant and unpleasant  (xml, csv, pdf, html etc.)  surrounding the following list of topics:

  • Environment
  • Mapping
  • Government Spending
  • Society
  • Government
  • Towns & Cities
  • Health
  • Education
  • Transport
  • Business & Economy

Gephi basics: simple network graph analysis from spreadsheet data

Several interesting phenomena can be modelled and analysed using graph theory. Graph theory, which Wikipedia tells me first had a paper published about it in 1736 (!) can at its most basic perhaps be thought of as mathematical techniques to analyse problems where one can represent the protagonists as a set of objects (nodes) and lines connecting them (edges).

A common example would be analysis of social networks (each person is a node, each friendship connecting them an edge), referral schemes (people involved are nodes, the act of referring them are edges) or in a more physical sense perhaps transport (each airport a node, each flight between them an edge).

Most common business analysis tools I have seen do not really try and tackle the classic graph/network visualisations between objects very well. So far it seems it hasn’t been a traditional avenue of analysis for most businesses, but as the most obvious application, “social data”, becomes ever more interesting I’m sure interest will only rise.

gephilogo#Luckily there is a pretty awesome tool. Not only is it super-fully-featured in this sphere, but it’s licensed under Gnu GPL, and hence free! Kudos to the heros who create and maintain Gephi.

Whilst it produces research publication level output and has no shortage of advanced features, coming to it originally as a total graph-novice I did not find it overly easy to use. This is no slight on the software; my sense is the sheer power of it and target audience is not conducive to a hand-holding wizard type system, especially when it’s a labour of love!

Here then follows a few notes on how to get the most basic data from a spreadsheet on one’s computer into Gephi, and how to visualise a stupidly simple graph from it. Of course in reality the data involved will be far larger and more interesting than this fictional example, but hopefully it helps to get started.

Here’s the data to use in this example. First my set of nodes:

ID Label
1 A
2 B
3 C
4 D
5 E
6 F
7 G
8 H
9 I
10 J

Read this as saying I have 10 “objects” (e.g. people on Facebook), which have the names A, B, C etc.

and now the set of edges:

Source Target
1 2
1 3
1 4
1 7
1 9
2 8
2 7
2 1
2 10
3 6
3 8

Read this as saying I have 11 connections (e.g. friendships between people), which involve my 10 nodes.

Note that some nodes may have no connections to other nodes, and others may have very many. Here we see that node ID 1 (“A”) has a connection to node ID 2 (“B”). This doesn’t necessarily imply that B is connected to A in our example, which makes it a “directed graph” – for instance B might be a Twitter follower of A even though A is not following B. Gephi can naturally handle both directed and undirected graphs with ease.

Note that, for Gephi, the column headers are important.

  • Every node should have a column called “ID” and, if you want to show some sort of human-readable labels, then one called “Label”.
  • Every edge should have a column called “Source” and one called “Target” which are the ID numbers of the nodes that should be connected (for an undirected graph like this one it doesn’t really matter which goes in source and which in target – but for 1-way friendships it would).
  • You can add any other columns you like to the file, which you can then use in Gephi itself if you wish, but should always try and ensure you have at least the above ones.

Excel files must be in CSV format. So, if your data is not in CSV format, the first step is to save the data as a CSV, instead of a XLSX.

The CSV format supports only 1 “worksheet”, so you will have to create 2 files, one for the nodes and one for the edges. Technically you can just use the edges file and have Gephi assume that every entry in the edges table relates to a node it should create (and that there are no other nodes) but I find it safer at first to approach it explicitly with the separate files.

Now it’s time to open Gephi!

First, create a new project (or use the existing one) and press the “Data Laboratory” button at the top of the screen

pic1

Now choose “Import spreadsheet” from the resulting screen.

pic2

Select your nodes CSV file. Make sure the dialogue box is set to recognise your file as a “Nodes table” and not an “Edges table” or you will get an error about it being in the wrong format.

Click through to “Next” and “Finished”. You’ll see there are various importing options which are useful in more complex cases – but not needed for this example.

You’ll be returned to the “Data laboratory” screen, hopefully with your nodes data showing like this:

pic3

Note that this data is now embedded in Gephi. If you change something in your CSV it will not automatically update. Likewise there are many features in Gephi to add, remove, filter or calculate new columns which are not passed back to the CSV. Be sure to save in Gephi often!

Now, repeat that process but this time selecting to import your edges csv as a “Edges table”. You should end up with something like this when clicking on the little “Edges” button at the top left of the screen.

4

Now for the fun part. Switch back to the Overview section using the button at the top left of the screen. You should see your data, visualised as a graph!

6

It may not look super pretty at first, but you can see that it’s accomplished its task. Perhaps in a later post I will go through a few of the formatting options (Gephi can produce very, very beautiful output if one is prepared to try!) but for a few quick tips now:

At the bottom left of the main display window there are some formatting controls.

7

Right click and drag to pan. Use the mouse scroll button to zoom.

The button with the letter “T” applies the labels to the nodes. You may need to zoom in or adjust the size slider to the right to see them properly. This section is also where you find the basic colours, sizes etc. (there are also functions to go into later that let you colour code each node based on a variable or characteristics).

If you don’t like where a node has been physically placed on the screen there are 2 key options.

  1. Click it and drag it to where you prefer. Gephi will keep the edge properly connected.
  2. If it’s more than the odd one, or you wish to experiment with several different positioning algorithms in order to find the most effective one for your data, then note the “layout” box to the left.

8
Here you can have Gephi apply algorithms to fulfill certain sorting, manipulation or positioning operations, and it’s very fast, even on a lot of data. The software comes with some built in ones and there are more possibilities to download extra – more on this in a future followup. They’re quite non-destructive, so it’s quite possible to save your file and play with them – although note a lack of an actual “undo” in this software!

Some, like the one pictured have some parameters you can set and then press “Run” to actually apply it. I have had some success with “Force Atlas 2” in making sense of datasets somewhat larger than this example.

Here’s an example of how Force Atlas 2 and a bit of formatting made it clear that in the test dataset I had one lonely node, “E”, who has no connections. It’s also easy to see that A and B are amongst the most heavily connected nodes.

9

That’s the basics of how to get data in and on the screen covered. Gephi does far more than this; there are all sort of formatting, partitioning, ranking, calculating, filtering and many more abilities to help get insights out of graphs – but popping the data in is and having the gratification of seeing a visualised network is the first step.

Happy graphing!

Quantified-self products of the future – an exhibition

Dublin Science Gallery have got a very data-focused exhibition on until April 17th, on the sort-of hot topic of “lifelogging”, that ever-increasing hobby of tracking not only one’s Fitbit steps but most anything one can vaguely quantify about one’s life.

It sounds like one of the funnest parts of it might be Karl Toomey’s “lifelogging products of the future” area. Here he imagines, albeit possibly slightly toungue-in-cheek, a set of future commercial products perfectly tailored to those of us who track. The gravestone pictured above looks to be a particular highlight, and one I wouldn’t be surprised if someone already went for.

A life with 92% positive eBay feedback is surely a life well lived. The only thing that’s missing really is the deceased’s Klout score.

The buying Justin Bieber’s tracked data vision must surely be a winner if there’s already companies offering to buy and sell even the data of the non-famous, be it social, recreational, geographical or anything else crunchable,

Exhibition also apparently features the always-beautiful Feltron report and much more.

Calculating prior year differences with custom calendars Tableau challenge

Visualisations of KPIs always require some context in order to make the analysis conducive to decision-making rather than just looking pretty.

One that is very common within businesss and elsewhere would be to check some value, for example sales revenue, against the same value within the same time period last year.  “Same time period last year” can be very important for any sort of analysis where annual seasonality is in play. For example the period of time just before Christmas makes those weeks very profitable for some retailers, and it would be nonsense to take pride that one sold more on the week before Christmas versus on slow-sales summer holiday week.

Tableau and most of its competitors have nice features for dealing with dates in that context. Here’s a nice blog entry  showing how simple it is in Tableau to do a prior year comparison in basically 1 click.

However, it appears become a little more challenging when a) your data is not actually associated with a date, and b) your business calendar does not really tie up with a “normal” calendar as defined in Tableau or any other sort of standard way.

Specifically, the data I need to use is at a weekly granularity; with each record showing a fiscal year and a fiscal week.

In most circumstances this remains a simple operation, because one can easily filter on all the week [X]s and have the software calculate the % difference over the years for any given week.

However, reality complicates things further: to take into account leap years and other such annoyances, our fiscal calendar sometimes has 52 weeks in a year and sometimes 53. The fiscal calendar is set in stone.

  • 2015 has 52 weeks,
  • 2014 had 53
  • 2013 had 52
  • 2012 also had 52.

Note that just because a year is a leap year, that does not mean that it has 53 weeks in the fiscal calendar!

The rules are such that when the 53 weeks are involved then no longer is it a valid comparison for my use case to compare week X this year with week X last year. Instead, it becomes necessary to compare to week X+1 in the previous year.

So as a concrete example using the data above, let’s say I want to look at the sales data for fiscal week 3 2015 and see how it compares to the prior years by calculating a like-for-like % difference. This table hopefully explains what has to be done to get the % growth value desired.

First we have to identify the correct weeks to compare. We start with fiscal week 3 2015. Having established that 2014 had 53 fiscal weeks, we now know that the comparable week in 2014 was week 4. The comparable week in 2013 was also week 4 because 2013 had only 52 weeks, and so on. Notice that the correct figures we want are 9% growth between week 3 2015 and week 4 2014, and 18% growth between week 4 2014 and week 4 2013.

So the question is how to automate implement this rule in Tableau so that only these valid like-for-like comparisons are shown, not only for these years but for the rest of eternity.

So far I have tried 2 approaches. Both work for the limited scenarios I need them for – but  neither of seem satisfactory in a general sense. In the interests of self-reminding documentation here they are shown below. If any kind internet-wanderer immediately notices I’ve missed a far more obvious and better solution, telling me so would be most welcome!

Attempt 1 – use a parameter to let someone choose the correct relative week to compare to.

I have a parameter called “prior like for like week modifier” that when set to 0 will mean that one should compare this year’s fiscal week with the same fiscal week last year. Right now in the GM dashboard it is set to 1, meaning add 1 to the fiscal week number this year to understand which week it should be compared to last year.

Then I created a calculated field called “Selected like-for-like fiscal week flag” which should return “true” if the record we are considering is either the currently selected fiscal week or the equivalent one to compare to in the prior year.

([Fiscal week]=[Choose fiscal week] AND [Fiscal year]=[Choose fiscal year])
OR<
([Fiscal week]=[Choose fiscal week] +[Prior year like-for-like week modifier] AND [Fiscal year]=[Choose fiscal year] - 1)

This is then used as a filter set to true for any worksheets that need it.

Here’s a picture of it in action, and note it is correctly displaying the 9% figure for comparing the sales in week 3 2015 vs the comparable period of week 4 2014.

Advantages:

  • It lets the user select exactly which week they want to compare to. In fact they could choose to do a different time frame calculation instead of year vs prior year if that was a valuable feature to them. It’s also possible to hide the worksheet with the parameter on before publishing to Tableau server so there is no risk than a “normal” viewer, who is not interested in why this works, will try and change it.

Disadvantages:

  • Requires manually  changing when the difference alters (no more than once a year of course, but one has to remember to do it!)
  • Doesn’t scale to multiple years, i.e. if the dashboard has to allow you to compare 2015 with 2014, and 2014 to 2013, and 2013 to 2012, then the offset would be different for each year, and so can’t be represented with a single parameter. It may be possible  to add a parameter for each year (e.g. “2015 modifier”, “2014 modifier” etc. but that seems like a lot of work if you are wanting a dashboard covering the past 20 years for instance.
  • Only really works intuitively when showing a 1 week summary; it’s not easy to see how to construct a time series out of this.

Attempt 2 – table calculation to take advantage of the fact the duration between the valid comparison weeks always remain the same

The rules concerning years with 53 and 52 weeks can be summarised by saying you should always consider that if one was to lay the weekly calendar out from start to finish then you would want to compare your current weekly data with one from 52 “spaces” in the past. This implies the correct outcome in a year after the 52 week year you are comparing to the same fiscal week number of the prior year, whereas in a year after the 53 week year you are comparing to the “fiscal week + 1” number of the prior year.

Therefore, with a correct layout, one can adapt the % difference from Tableau table formula like this:

(ZN(SUM([Sales])) - LOOKUP(ZN(SUM([Sales])), -52)) / ABS(LOOKUP(ZN(SUM([Sales])), -52))

Here’s a picture of it in action, note the annotate points are displaying the correct % growth as defined above.

Advantages:

  • Does not require the user to input any parameter
  • If the “52 spaces before” rule works every year then it will never need to be updated

Disadvantages:

  • Tableau is not aware that these fiscal years and fiscal weeks numbers relate to actual dates. Therefore it is really looking 52 datapoints backwards, not 52 weeks. If there is an entry every week for a metric then 52 data points = 52 weeks so everything works nicely. But if there was a missing datapoint, for example imagine there was a week with no sales at all so there was no row for that week, then 52 data points would actually be 53 weeks. This would mean the calculation would be wrong and, worse yet, probably quite hard for the user to notice that.
  • It means you have to have all 52 weeks in the view of the chart – although one can use the “hide” function or a “Jedi”-style table calculation filter trick to make sure the end users doesn’t see it if necessary. I suspect this causes slightly poorer performance if you have a lot of data, but so far Tableau has been more than fast enough that this isn’t an issue in reality.

Summary and conclusion

Both of these methods seem to work for the precise use cases described, but they also both have limitations such that they can’t be described as a set-and-forget general solution to the problem. Better ideas are most welcome. An example Tableau file implementing both of the above can be found here if anyone is kind enough to dabble.

Why version 9 will {FIX} Tableau for me (and workarounds in the mean time)

Excitement builds within the dataviz world as the next version of Tableau gets close to launch, supposedly within the next 60 days.

It has many new features, which data geeks and other fans can see an preview being dripped out piece by piece in the Tableau blog, and summaries elsewhere, but one has really caught my attention, even though it is probably on the less glamorous side of things.

The general topic is “level of detail” calculations; a new type of calculations in the forthcoming version of Tableau, which I first saw in action via the Information Lab’s wonderful preview video. All the ones I’ve read about so far (INCLUDE, EXCLUDE, ….) seem that they will be very useful indeed for such super-common analytics tasks as comparing something with a larger group of somethings it’s part of, but potentially most thrilling to me is FIXED.

These calculations are interesting to me in that they provide a way to use data from other rows of data to calculate something in this row. Tableau table calculations  can do some common cases of this, e.g. TOTAL, WINDOW_SUM et al. but not yet flexible enough to what I imagine is necessary for most in-depth analysts on a regular basis; specifically accessing data at a pre-defined granularity not included in the visual analysis design itself that is neither “current row” or “all rows” – but something in-between.

Of course, not proclaiming to be a leading Tableau expert, it could simply be that I don’t know the easy way of doing it in Tableau today so comments most welcome if I am missing something obvious – but even with various training and Tableau doctor sessions a few questions remain unsolved in my mind.To highlight this, I’m going to steal something very similar (but far simpler) to the example Tableau gave in their 3rd example here, about sales by customer tenure.Imagine a very simple table like this:

And now that you want to know the total transaction value for these customers based on their first transaction date. The business question might be something to the effect of “Are our newer customers bringing in the same level of revenue as our older ones?”. This silly sample data is so small it’s trivial to calculate in ones’ head, but you can imagine many real life equivalent tables that consist of millions of customers and hundreds of millions of transactions.The sort of data structure that would make it wonderfully easy to calculate this today in Tableau would be to add another column, e.g. “first transaction date”. This could then be used in a dimension so that one can pop it on rows, colours, shapes or what have you, put transaction values into the measures and play away.

Note the point here is that every unique customer has the same “first transaction date” in all their records, whereas the actual transaction date varies as they buy more products over time.

That column is very easy to calculate in Excel and its ilk. But, today, it does not seem so easy to calculate that in Tableau in aggregate to answer such seemingly simple questions like “how much did customers whose first transaction was in 2001 bring us over their lifetime so far vs those starting in 2002? “.

To be fair, with small examples like this it’s usually easy enough to do this even in current Tableau by creating a series of Tableau calculated fields and using them in Tableau sets to represent all customers with a minimum transaction date of a certain year. I’m not going to write about that here, as hopefully I won’t need to do it ever again in the near future! I’d note though that, as well as being as intuitive as most of Tableau is, in the real world this technique does not seem to scale at all well, for instance when you have large number of customers who joined in many different years. I have even managed to get Tableau to tell me it’s out of memory whilst trying to doing this sort of thing with too many records, which is a very rare occurrence indeed.

So personally I have tended to force myself to switch to other tools to pre-build this extra column when I need to do something like this in Tableau.

I’m all for using the right tool for the right job, and there are many great tools available. No serious data-person should restrict themselves to a single tool.  However I haven’t experienced an ad-hoc visualisation tool I like nearly as much as Tableau, so I end up spending time taking the data from Tableau out to a another tool just to make a file to re-import back into Tableau.

Unquestionably, doing so interrupts the flow of analysis, and can be a proper time-sapping pain if the source of your data is for instance a TDE or something else that most other packages don’t natively open. Here’s a reference  list of some tools and methods I’ve to do this successfully in the past – and probably will continue doing so for the next 59 days or so.

Microsoft Excel: I probably don’t need to explain what this one is, to anyone who read as far down as here anyway!

Example method: many possible options, 1 of which would be this style of array formula if we are sure no transactions in the year 2099 or afterwards exist in the dataset.
{=MIN(IF(Sheet1!$A17=Sheet1!$A$13:$A$19,Sheet1!$B$13:$B$19,DATE(2099,1,1)))}

Pros:

  • Very easy to do, most analysts will be able to create a formula for this very quickly.
  • You probably already own the software.

Cons:

  • Excel cannot handle files with over 1 million rows, and it begins to struggle long, long before that.
  • Generally will have to re-do or extend the formula when the rest of the data updates. Excel has some live-linking abilities, but I’ve never found them terribly reliable.
  • As ever with Excel, because it’s so easy to use and supremely flexible about most things, it’s easy to make a mistake, not notice and hence create inaccurate analysis. Imagine if you forgot the dollar signs in the above formula!

SQL server (or other RDMS):

Example method: again many possible options. 1 of which could be:

SELECT data.[Customer ID], data.[Transaction date], data.[Transaction value], data_copy.FirstTransactionDate
FROM data
LEFT JOIN
(SELECT [Customer ID], MIN([Transaction date]) AS FirstTransactionDate FROM data GROUP BY [Customer ID]) data_copy
ON data.[Customer ID] = data_copy.[Customer ID]

Pros:

  • Very often my data is coming from a SQL server anyway, so why not retrieve and manipulate it in one go?
  • Performance CAN be great over even large datasets
  • You can include data updates just by refreshing the query

Cons:

  • SQL is not a visually intuitive language. Personally I would recommend most analysts who have access to relational databases to take the time to learn at least the basics – but beyond the basics it can certainly get rather complex and take a lot of thinking and debugging time that could be better spent in providing insights. Especially for the case where you are running against a database that was designed perhaps for operational or transactional reasons, and not aimed at analysts.
  • SQL servers in companies are often centrally maintained by a technology department and may be being used for many non-analytical yet mission-critical tasks by many other people. Business analysts do not always have a good amount of control over those environments; for instance I have effectively low priority read only access a lot of ones I use.
  • For the above reason, you may need a lot of documentation to understand where to look in the database; badgering the database managers for data dictionaries, entity relationship diagrams and other complexities takes time on both sides. It might well be that such documentation doesn’t exist. And then “educated guessing” is about the best you can do.
  • For the above reasons, performance CAN be terrible depending on what you want to do
  • Accessing a remote SQL server can’t be done if you have a dodgy internet connection or outside firewalls in some cases (think ‘travelling’ for instance).
  • Some SQL databases or licenses may be very expensive, if you don’t already have access to one.

Powerpivot

Example method:

=CALCULATE(MIN(Table1[Transaction date]),ALLEXCEPT(Table1,Table1[Customer ID]))

Pros:

  • Is a free download
  • It fits directly into Excel, so relatively user friendly and easy for any decent Excel user, especially those who pride themselves on pivot table skills.
  • It enables Excel to handle a much bigger volume of data and introduces new functions that can be used for many useful things that standard Excel pivot tables can’t do.
  • Everything seems to be precalculated so performance can be extremely good after the initial setup of your data is done.
  • The language it uses for formulae, DAX, is for the most part pretty similar to standard Excel functions where it can be.
  • The CALCULATE() function used above uses a syntax I find very intuitive, flexible and really like a lot; possibly my favourite implementation of a feature like the ones you need to do this task.

Cons:

  • You, and other people who want to manipulate your Excel file if any, need to have Powerpivot addin installed. This needs downloading and temporary admin rights if you’re running Excel 2010. I have no experience with Excel 2013 variants.
  • Formulas in Powerpivot are written in “DAX” language. DAX is not a language that all that many people speak in my experience (although the basics aren’t complicated).
  • The version you can use seems dependent on other Microsoft infrastructure. For instance, for me, I am restricted to the 2008 version due to other software that runs. 7 years is a LONG time for analysis tools!
  • Behind the scenes it is creating something similar to a SSAS cube. This means that when you connect Tableau to it as a datasource you suffer the loss of many of Tableau’s finest features, as it’s effectively an OLAP cube. Many of Tableau’s competitors won’t connect to it at all.
  • The big showstopper for me: it’s an in-memory software I believe, so if you try and use row-scanning operations like the CALCULATE() formula above, especially with a FILTER() over millions of rows I have consistently had “out of memory” errors. This means it whirs away for half an hour and then gives the error message and totally refuses to do what you asked it to. I come up against this a LOT. To be fair, maybe a version that isn’t 7 years old has far better performance than this one, but sadly with our current infrastructure that’s not an option for me. It can handle data far bigger than native Excel without question, but it still seems to have a limit that one comes up against frequently even when not dealing with mega-terabyte-Hadoop type stuff. I would love to see if new versions of it have helped on that front at all as it’s a great tool for it’s (non-existent) cost.

Alteryx:

Example method: Alteryx files look like visual flow charts, so below an example picture. One option for the crux of the matter is probably a multi-row formula tool with the following formula, grouped by Customer ID, after the data is sorted by customer ID then transaction date.

IF ISNULL([Row-1:Transaction date]) THEN [Transaction date] ELSE [Row-1:First transaction date] ENDIF

Pros:

  • It is blazing fast. Really, really fast. I have seen instances where the exact same task as made PowerPivot crash out of memory after 20 minutes has been completed successfully in under 30 seconds in Alteryx.
  • After a little experience, the flowchart interface is easy to use and I find I make fewer mistakes here than in some other tools.
  • If you need to repeat it with updated data, just re-run the Alteryx routine and you replicate the results without needing to recreate or edit anything.

Cons:

  • If you don’t already have it, and most people don’t, Alteryx isn’t the cheapest of tools. I think the “normal” version is around £2500 per user these days. Definitely not over-priced for what the package can do – but obviously it is not cost effective if your only task really is  to add a column to a table once!
  • The interface is very different from many “normal” tools, which means it might put many inexperienced users off trying it out unless they really need to.
  • Super-petty criticism, but on my computer it takes a relatively long time for Alteryx to load and get itself into a usable state; a definite “flow interrupter”.

Tableau 9 (my hope):Example method:  I didn’t get around to installing the beta test version yet so this is assumed from reading the blog posts of many wiser people than myself – but if I understand correctly it will take just <=6 little words (depending on how you count).

{Fixed [Customer ID]: MIN([Transaction Date])} 

Pros:

  • If this is true, then you won’t have to leave Tableau – and your flow of analysis – at all to do the task.
  • It will refresh as and when the data does, and work offline with extracts or online without.
  • Typing the above in will take approximately 5 seconds or less!

Cons:

  • None come to mind, although I have one potential concern; that of performance. Functions like this I would assume have to scan over a lot of rows to get the result. Tableau has (almost) always seemed a very fast tool to me so I’m optimistic – and Alteryx has shown these functions can run superfast –  but I’m curious to know how well these type of “level of detail” functions work over medium-large datasets (100,000s – 10,000,000s of rows). I will look forward to finding this out in the not too distant future.

Either way this feature will surely save me, and I’m sure many many others, an incredible amount of time on very common, and quite dull, analytical tasks. I am grateful indeed.