Future features coming to Alteryx 10.6 and beyond

One of my favourite parts of attending the ever-growing Alteryx Inspire 2016 conference and its like is hearing about the fun new features that tools such as the wonderful Alteryx are going to make available soon. It’s always exciting to think about how such developments might improve our job efficiency, satisfaction or enable whole new activities that so far have not been practical.

From this blog’s page view stats, it seems like others out in the great mysterious internet also find that sort of topic interesting, so below are a few notes I made from the various public sessions I was lucky enough to attend, about some of what Alteryx is thinking to add over the next few versions.

In-database tools:

Since the addition of in-database tools, Alteryx has allowed analysts to push some of the heavy lifting / bandwidth hoggage back to the database servers that provide the data to analysts. If you’re an analyst who regularly uses moderate to large datasets obtained from databases you should really look into this feature, as by default Alteryx spends time sucking data from the remote database to your local machine otherwise. Anyway, a few new developments are apparently planned:

  • New in-database data sources.
  • New in-database predictive analytics (I believe SQL Server was on the list)
  • A makeover of the in-database connection tool to make it easier to use

New data sources:

New predictive tools: 

Many of these may be delivered via the Alteryx Predictive District (in fact it’s very worth looking there now for the existing tools – although I appreciate they don’t want to clog up your toolbar with thousands of icons, it’s not always easy to remember to check these fantastic districts! May I suggest an in-Alteryx search feature for these in the future?)

  • Time series model factory
  • Time series forecast factory
  • Time series factor sample
  • Cross validation model comparison
  • Model based value imputation
  • K medoids cluster analysis
  • Text classification tools, to enable e.g. sentiment analysis, key phrase extraction, language detection, topic modelling.

An analytic app that will allow you to install your own choice of R packages from CRAN.

Some “Getting started kits” that will help newcomers to predictive analytics, each focusing on a specific business question, examples include:

  • How does a price change impact my bottom line?
  • How can I predict how much a customer will spend?
  • How can I predict whether a customer will buy the produce I put on sale?

Prescriptive analysis tools:

Yes, the next stage after predicting something is to prescribe what we should then do. A new toolbar will come with tools in this category. Starting with:

  • Optimisation: have Alteryx maximise or minimise a value based on constraints for an optimum outcome. One example demonstrated was “what’s the best product mix to stock on a shop shelf to maximise profits, whilst ensuring the shelf has no more than 1 of any particular item?”.
  • Simulation: think here of things like Monte Carlo simulations, and, in the future, agent-based simulations.

Improvements to existing tools:

  • Formula tool: will include
    • autocomplete,
    • inline search for functions & fields,
    • suggestions of common options based on context such as field type,
    • a data preview to show you right away what the results of your formula will be on a sample record.

This one makes me happy! Without meaning to cause offence, the current incarnation of the formula tool, which has to be one of the most used tools for most everyone, is a little…erm…”old fashioned” to those of us spoilt with auto-correcty/lookup things from other vendors in recent times when typing in code. No more digging around trying to remember if a function to create a date is under “date/time” or “conversion” etc.

  • Smarter data profiling tools
  • Improved reporting output tools
  • Web based scheduling

Alteryx server updates:

I must admit to not being a server user, so I am not 100% these whether these are new features. But it seemed so:

  • Row level security on data, i.e. different users see different records in the same datasource.
  • Version history

Estimated release dates:

Version 10.6 may be around the end of this month. Version 11 towards the end of the year (no promises made). I did not note which features were planned for which version.

Advertisements

Accessing Adobe Analytics data with Alteryx

Adobe Analytics (also known as Site Catalyst, Omniture, and various other names both past and present) is a service that tracks and reports on how people use websites and apps. It’s one of the leading solutions for organisations who are interested in studying how people are actually using their digital offerings.

Studying real-world usage is often far more insightful, in my view, than surveying people before or after the fact. Competitors to Adobe Analytics would include Google Analytics and other such services that allow you to follow web traffic, and answer questions from those as simple as “how many people visited my website today?” up to “can we predict how many people from New York will sign up to my service after having clicked button x, watched my promo video and spent at least 10 minutes reading the terms and conditions?”

In their own words:

What is Adobe Analytics?
It’s the industry-leading solution for applying real-time analytics and detailed segmentation across all of your marketing channels. Use it to discover high-value audiences and power customer intelligence for your business.

I use it a lot, but until recently have always found that it suffers from a key problem. Please pardon my usage of the 4-letter “s word” but, here, at least, the Adobe digital data has always pretty much remained in a silo. Grrr!

There are various native solutions, some of which are helpful for certain use cases (take a look at the useful Excel addin or the – badly named in my opinion, and somewhat temperamental – “data warehouse” functionality for instance). We have also had various technology teams working on using native functionality to move data from Adobe into a more typical and accessible relational database, but that seems to be a time-consuming and resource-intensive operation to get in place.

So none of the above solutions yet really proved to meet my needs to extract reasonably large volumes of data quickly and easily on an adhoc basis for integration with other datasources in a refreshable manner. And without that, in this world that ever-increasingly moves towards digital interactions, it’s hard to get a true overall view of your customer’s engagement.

So, imagine how the sun shone and the angels sung in my world when I saw the Alteryx version 10.5 release announcement.

…Alteryx Analytics 10.5 introduces new connectors to Google Sheets, Adobe Analytics, and Salesforce – enhancing the scope of data available for analytic insights

I must admit that I had had high hopes that this would happen, insomuch as when looking at the detailed schedule agenda for this year’s Alteryx Inspire conference (see you there?) I noticed that there was mention of Adobe Analytics within a session called “How to process and visualise data in the cloud”. But yesterday it actually arrived!

It must be said that the setup is not 100% trivial, so below I have outlined the process I went through to get a successful connection, in case it proves useful for others to know.

Firstly, the Adobe Analytics data connector is not actually automatically installed, even when you install even the full, latest version of Alteryx. Don’t let this concern you. The trick is, after you have updated Alteryx to at least version 10.5, is to go and download the connector separately from the relevant page of the Alteryx Analytics gallery. It’s the blue “Adobe Analytics install” file you want to save to your computer, there’s no need to press the big “Run” button on the website itself.

(If you don’t already have one, you may have to create a Alteryx gallery user account first, but that’s easy to do and free of charge, even if you’re not an Alteryx customer. And whilst you’re there, why not browse through the manifold other goodies it hosts?).

You should end up with a small file called “AdobeAnalytics.yxi” on your computer. Double click that, Alteryx will load up, and you’ll go through a quick and simple install routine.

Capture

CaptureOnce you’ve gone through that, check on your standard Alteryx “Connectors” ribbon and you should see a new tool called “Adobe Analytics”.

Just like any other Alteryx tool you can drag and drop that into your workflow and configure it in the Configuration pane. Once configured correctly, you can use it in a similar vein to the “Input data” tool.

The first thing you’ll need to configure is your sign-in method, so that Alteryx becomes authorised to access your Adobe Analytics account.

This isn’t necessarily as straightforward as with most other data connectors, because Adobe offers a plethora of different types of account or means of access, and it’s quite possible the one that you use is not directly supported. That was the case for me at least.

Alteryx have provided some instructions as to how to sort that out here. Rather than use my standard company login, instead I created a new Adobe ID (using my individual corporate email address), logged into marketing.adobe.com with it, and used the “Get access” section of the Adobe site to link my company Adobe Analytics login to my new Adobe ID.

That was much simpler than it sounds, and you may not need to do it if you already have a proper Adobe ID or a Developer login, but that’s the method I successfully used.

Then you can log in, via the tool’s configuration  panel.

Capture

CaptureOnce you’re happily logged in (using the “User login” option if you followed the same procedure as I did above), you get to the juicy configuration options to specify what data you want your connector to return from the Adobe Analytics offerings.

Now a lot of the content of what you’ll see here is very dependent on your Adobe setup, so you might want to work with the owner of your Adobe install if it’s not offering what you want, unless you’re also multitasking as the the Adobe admin.

In essence, you’re selecting a Report Suite, the metrics (and dimensions, aka “elements”) you’re interested in, the date range of significance and the granularity. If  you’re at all familiar with the web Adobe Analytics interface, it’s all the same stuff with the same terminology (but, if it offers what you want, so much faster and more flexible).

Leave “Attempt to Parse Report” ticked, unless for some reason you prefer the raw JSON the Adobe API returns instead of a nice Alteryx table.

Once you’ve done that, then Alteryx will consider it as just another type of datasource. The output of that tool can then be fed into any other Alteryx tool – perhaps start with a Browse tool to see exactly what’s being returned from your request. And then you’re free to leverage the extensive Alteryx toolkit to process, combine, integrate, analyse and model your data from Adobe and elsewhere to gain extra insights into your digital world.

Want an update with new data next week? Just re-open your workflow and hit run, and see the latest data flow in. That’s a substantial time and sanity saving improvement on the old-style battle-via-Excel to de-silo this data, and perhaps even one worth buying Alteryx for alone if you do a lot of this!

Don’t forget that with the Alteryx output data tool, and the various enhanced output options including the in-database tools and Tableau output options from the latest version, you could also use Alteryx simply to move data from Adobe Analytics to some other system, whether for visualisation in Tableau or integration into a data warehouse or similar.

A use case might simply be to automatically push up web traffic data to a datasource hosted in Tableau Server for instance, so that any number of your licensed analysts can use it in their own work. You can probably find a way to do a simple version of this for free” using the native Adobe capabilities if you try hard enough, but anything that involves a semblance of transform or join, at least in our setup, seems far easier to do with external tools like Alteryx.

Pro-tip: frustrated that this tool, like most of the native ones, restricts you to pulling data from one Adobe Report Suite at a time? Not a problem – just copy and paste the workflow once for each report suite and use an Alteryx Union tool to combine the results into one long table.

Here’s screenshots of an example workflow and results (not from any real website…) to show that in action. Let’s answer a simple question: how many unique visitors have we had to 2 different websites, each represented by a different report suite, over the past week?

Capture

Capture

Performance: in my experience, although Adobe Analytics can contain a wealth of insightful information, I’ve found the speed of accessing it to be “non-optimal” at times. The data warehouse functionality for instance promises/threatens that:

Because of to the complexity of Data Warehouse reports, they are not immediately available, can take up to 72 hours to generate, and are accessible via email, FTP or API delivery mechanisms.

The data warehouse functionality surely allows complexity that’s an order of magnitude beyond what a simple workflow like this does, but just for reference, this workflow ran in about 20 seconds. Pulling equivalent data for 2 years took about 40 seconds. Not as fast as you’d expect a standard database to perform, but still far quicker than making a cup of tea.

Sidenote: the data returned from this connector appears to come in string format, even when it’s a column of a purely numeric measure. You might want to use a Select tool or other method in order to convert it to a more processable type if you’re using it in downstream tools.

Overall conclusion: HOORAY!

Setting up the automatic “upload to Tableau Online / Server” feature in Alteryx 10.1

CaptureAlteryx released their new version 10.1 this week – license holders/demo seekers can pop along to the download page to install it.

Amongst the new features it promised was:

Automatically update Tableau Server and Tableau Online with a new Tableau macro

The concept is simple: process/model/advanced-analyse your data in Alteryx and then it will push up the results as a Tableau data source which all your Tableau Server or Tableau Online users can use in their Tableau workbooks.  If you set it up nicely, you could therefore have it auto-update a bunch of complicated datasources with one  click of a button.

As the community of Tableau users gets ever more integrated with the community of Alteryx users, this is a pretty exciting development. In fairness, the geniuses at the Information Lab had already enabled this via a couple of nice macros they were kind enough to distribute – but I guess to see this made a feature of the core software itself at least shows that what they did was highly desired!

After quickly installing 10.1 though, I couldn’t find the tool immediately. It turns out you need to download and install it. It’s very easy to do, but below are some notes on the process to follow, and how to configure the tool if you use Tableau Online, rather than Tableau Server.

Steps to install the “Publish to Tableau Server” tool

Firstly, make sure you have installed Alteryx 10.1. Get it here.

Then go download the “Publish to Tableau Server” tool from the Alteryx Analytics Gallery.

Double-click to run it once you have downloaded it and it will open in Alteryx.

Press the toolbar button for “Run as an anCapturealytic app”. This is not the usual “play” triangle, but rather the one that looks like a wand next to it.

From the resulting wizard, choose “Install”

Close and restart Alteryx – and now you should have the sparkly new Publish to Tableau Server tool available.

Configuring the tool to upload to Tableau Online

(which is really pretty much the same to a Tableau Server, but there’s a couple of extra bits to bear in mind for those using the cloud online version of Tableau.)

As with any other Alteryx tool, pop it into your workflow such that the data you want uploading to Tableau Online/Server is going into its input. The output it passes on will be the same as whatever input you gave it, so you can continue processing the data in Alteryx after the upload has happened.

Capture

Click on the Publish tool and you will see you have to configure 2 tabs: Connection and Output.

They’re quite self explanatory, but note, for Tableau Online:

When configuring with a server URL, do not just enter online.tableau.com. Instead you have to include the subdomain before the “online” bit. To find out what that is just log onto your Tableau Online site in your web browser and note the first part of the address. The one I tested on for instance is called: https://10az.online.tableau.com

The “Use default site” option will not work for Tableau Online. This is presumably because there is not really a default site when a single server is being used by a large number of different customers! Manually type in the name of your Tableau online site, which you can find in the URL of your web browser again when you log into Tableau Online after the “/site/” bit. It’s probably the name of your company or something similar (if you’re working in a company).

You can then use the “Output” tab of the configuration to specify the project, data source and so on to save to. Then just hit the normal “run” button, wait a couple of minutes until Alteryx tells you it’s finished, and you should find you have a nice new datasource in your Tableau Online site.

 

Awesome Alteryx cache tool from the Alteryx community

Cache dataset iconAlteryx is a superb tool for data manipulation and it’s generally very fast at what it does. However this only encourages us to put large volumes of data through its manipulation capabilities, which can cause annoying pauses during workflow development. Perhaps it’s because your source database is non-too-fast or simply whatever function you’re asking Alteryx to do over a billion rows of data is complicated.

Being inclined to somewhat iterative development, I’ve often had to go make several cups of tea whilst it whirs away. I had desperately longed for a fictional feature that would let me right-click on a tool and say “run from here”. In my dream world, this would take a copy of the data coming in from the selected tool’s input stream from the last time it ran, and process it through (only) the tools to the right of the selected tool. Thus if you wanted, for instance, to run some R routine on a a big pile of previously manipulated data, right-click run-from the R tool, and you would not have to wait for data to be collected and transformed again.

My dream world is not quite here, but in the mean time there’s a super-useful tool available from Alteryx employee MacRo on their community site – the Cache Dataset Macro.

Clear instructions are shown on their site, but essentially you pop it just before the tool you want to “run from here”. Set it to “Write” mode, which will then write out a file to disk of all the data coming into it the next time you run the workflow.

Then switch it to “Read” mode, disable all the tools to the left of it, and then when you next run the workflow it will ignore everything to the left of the cache tool and feed your “written” file into the tool it’s connected to.

It’s really equivalent to manually using an output tool to save out a file, and then an input tool to read it back in, but it’s life-changingly faster and less hassle than actually doing that.

Right now it doesn’t delete the files it writes (apparently that’s an upcoming feature – but in the mean time be sure to clear any massive ones out manually) and there’s no way for it disable the tools to the left of it automatically – you can use a tool container to do that – but nonetheless it’s a huge step up for my Alteryx productivity and one I am most grateful for!

Bonus point: I assume the icon MacRo chose for the tool is based on the obvious homophone-based pun, which is surely enough to brighten up the day of any long-suffering analyst.

A first look at Alteryx 10’s Network Analysis tool

Network visualisation tool iconAlteryx version 10 was recently released, with all sorts of juicy new features in realms such as usability, data manipulation and statistical modelling. Perhaps one of the most interesting ones for me though is the new Network Analysis tool.

This provides an easy way to make network graph visualisations natively, something that many general purpose analytical tools don’t do (or require workarounds). Behind the scenes, it uses R, but, as per the other Alteryx R tools, you don’t need to worry about that.

Until now, I had used the Gephi for such work; it’s a great free open-source program which is tremendously capable at this style of analysis, but not always particularly friendly or easy to use, and requires data to be exported into it.

In a previous post I wrote about the basics of getting data into Gephi and visualising it. The very simple example I gave there is easily replicable in Alteryx. Here’s how:

First create your tables of nodes (the dots) and edges (the lines between the dots).

The documentation states that your nodes must have a unique identifier with the fieldname of “_name_” and the edges must have fields “from” and “to”. Actually in practice I found it often works fine even without using those specific field names, but it is to rename columns in Alteryx (use the Select tool for instance) so one might as well follow the instructions where possible.

So for a basic example, here’s our table of nodes:

_name_ label Category
1 A Cat1
2 B Cat1
3 C Cat1
4 D Cat2
5 E Cat2
6 F Cat2
7 G Cat3
8 H Cat3
9 I Cat3
10 J Cat3

And edges:

From To
1 2
1 3
1 4
1 7
1 9
2 8
2 7
2 1
2 10
3 6
3 8

Pop a “Network Analysis” tool onto the canvas. It’s in the Predictive section of the Alteryx toolbar. Then hook up your nodes file to the N input and edges file to the E input.

Alteryx network viz workflow

There’s some configuration options on the Network Analysis tool I’ll mention briefly shortly, but for now, that’s it, job done! Press the run button and enjoy the results.

The D output of the tool gives you a data table, 1 row per node, showing various graph-related statistics per node: betweenness, degree, closeness, pagerank and evcent. You can then directly use these statistics later on in your workflow.

The I output gives you a interactive graphical representation of your network with cool features like ability to search for a given node, tooltips upon hover, click to drag/highlight nodes, some summary stats and a histogram of various graph statistics that describe the characteristics of your network like this:

Capture

Although for most tools the “auto-browse” function of Alteryx 10 negates the need for a Browse tool, you will need one connected to the I output if you want to see the graphic representation of your network.

There are some useful configuration options in the Network Visualisation tool itself in 3 categories; nodes, edges and layout.

Perhaps the 3 most interesting ones are:

  • ability to size nodes either based on their network statistics or another variable,
  • ability to have directed (A connects to B, B might not connect to A) or undirected (A connects to B implies B connects to A) edges.
  • ability to group nodes by either network statistics or another variable (e.g. to differentiate between Facebook friends and Facebook groups).

Here for example is the above diagram where the nodes are sized by degree (# connections), coloured by my variable “Category” and the edges are set to directed.

Options for network viz tool

Network viz with options


Sidenote 1: There seems to be a trick to getting the group-by-variable to work though, which I’m not sure is intentional(?). I found that the tool would only recognise my grouping variables if they were specifically of type “String”.

Alteryx text from an input file usually defaults to type “V_string” but the Network Viz tool would not let me select my “Category” field to group nodes by if I left it at that. However it’s very easy to convert from V_string to String by use of a Select tool

Select tool to string

Sidenote 2: For people like me who are locked down to an old version of Internet Explorer (!) – the happy news is that the Alteryx network viz works even in that situation. In previous versions of Alteryx I found that the “interactive” visualisations tended to fail if one had an old version of IE installed.


Overall, the tool seems to work well, and is as quick and easy to use as users of Alteryx have probably come to expect. It even, dare I say it, has an element of fun to it.

It’s not going to rival – and probably never will try to – the flexibility of Gephi for those hand-crafting large complex networks with a need for in-depth customisation options and output. Stick with that if you need the more advanced features (or if you can’t afford to buy Alteryx!).

But for many people, I believe it contains enough features even in this first version to do the basics of what most analysts probably want a network viz for, and will save you hours in comparison vs finding and learning another package.

At least for relatively small numbers of nodes anyway; on my first try I found it hard to gain much insight from the display of a larger network as the viewing area was quite small – but some of this is innate to the nature of the visualisation type itself. I have also not yet experimented very much with the different layout options available, some of which might dramatically improve things if they have similar impact to the Gephi layout options. Picking the optimum location to display each node is a distinctly non-trivial task for software to do!

Remember also that as the “D” output gives a data table of network stats per node, one could always use that output to pre-filter another incarnation of the network viz tool and show only the most “interesting” nodes if that was more useful.

In general this new Alteryx tool is so easy to use and gives such quick results that I hope to see it promote effective use of such diagrams in real-world situations where they can be useful. At the very least, I’m sure it’ll convince a few new “data artisans” to give network analysis a try.

4 ways to make Alteryx tools easier to find

Alteryx is surely the king of easy-to-use data manipulation tools, and as a bonus handles predictive and spatial analysis via a very friendly set of tools. One method by which it makes things easy to use is that it has a huge array of built-in tools to drag and drop, thus saving you the task of programming procedures yourself.

This means each tool is easy to use for the desired purpose, but has a downside in that one has to actually locate the tool one wants, when there are probably over a hundred to choose from (I haven’t counted, and besides, you can add more if you wish).

The default tool-picker interface looks like this:

Capture

There’s a thin list of tool categories (In/out, Preparation etc.) and then, within each category, a larger list of actual tools (Browse, Date Time Now etc.). The list is so long it cannot fit on one screen, even on a reasonable widescreen monitor.

I understand that the programmers behind Alteryx are already considering improvements to this interface in order to make it easier to find what you need in a hurry, but there are already a few ways that – whilst not life-changing – may help you save a couple of seconds of looking for the correct tool many times a day.

1: Tool search

You see on the top left of the above screenshot that there’s a search box. If you know the tool you are after, just type its name in here and the tools list will instantly filter down to those that match what you typed.

Tool search

In fact, it’s cleverer than it seems. Tools will pop up even if the name doesn’t match as long as Alteryx thinks it “sort of” relates to what you type. This means, on a good day, it’s also a quick way to find a tool you might have forgotten the name of.

For example, perhaps you need to create what Excel would call a pivot table. There’s no tool called “pivot” in default Alteryx, so what to do? Just type “pivot” into the tool search box and it shows you 3 tools it thinks might relate to pivoting – cross tab, transpose and table. Sure enough, it’s first suggestion of “cross tab” that you probably want to use.

pivot

2: Tool favourites

You’ve probably noticed already the “Favorites” tool category. This contains a few tools aggregated from other tool sections. The defaults aren’t bad, but perhaps you want to change them to fit your personal workflow.

It’s as simple as right-click -> remove from favorites once you’re in the favorites section to get rid of one you don’t want (but keep it in its main category section), and right-click -> add to favorites from any of the non-favorite tool categories to add the tool you clicked on to your favourites.  In this way you can construct your own custom ribbon that contains the tools you use most often, no matter from which section they come from.

Favorites section

For those who are not fans of the right-click, if you hover over any of the tools in the other sections, you will see a faint star appear to the top right of the tool. Clicking on that will also add the tool to your favourites section.

3: Setting default section tools

You see the icons showing the tool sections available to you on the upper thin ribbon? They’re more than just a way to open a tool category. Note how each one is labelled with all the icon of one of the tools within the category. For example, the “Preparation” section is labelled with a formula-tool conical flask icon.

You can actually drag the small icons down to your workflow and you’ll get the default option for that tool category in your workflow without having to change to its tool category first.

In the example here, clicking and dragging the Preparation category icon will give you a formula tool to use, without changing the select ribbon category away from “Transform”.

Drag default tool

But what if you don’t use the formula tool much and wish you can get to drag the Filter tool down without opening that tool’s section? Just right-click the Filter tool and click on “Set default tool”. You’ll notice the icon next to the “Preparation” category changes to resemble the filter tool’s icon – and now you can click and drag that into your workflow without changing away from another tool section.

Change default tool

4: Pin your preferred categories to the left

Perhaps you are less bothered about individual tools, but more bothered by categories as a whole. By default some of the categories are off the right-hand of the screen unless you have a giant monitor. You might be frustrated if you’re a constant “Social Media Tools” category user when you find you have to constantly scroll right to find the category every time you wish to use it.

You can however right-click on the category icon and choose “Pin Social Media” in this case. That will move it to the left and will keep in nice and handy next to your favourites category.

Pin section

Social media pinned

Basic text tokenisation with Alteryx

Free text analytics seems a fashionable pastime at present. The most commonly seen form in the wild might be the very basic text visualisation known as the “word cloud”. Here, for instance is the New York Times’ “most searched for terms” represented in such a cloud.

When confronted with a body of human-written text, one of the first steps for many text-related analytical techniques is tokenisation. This is the process where one decomposes the lengthy scrawlings received from a human into suitable units for data mining; for example sentences, paragraphs or words.

Here we will consider splitting text up into individual words with a view to later processing.

My (trivial) sample data is:

ID Free text 1 Free text 2 Free text 3
1 Here is some Free text! But it needs seperating
2 Into
individual words
Can
Alteryx
Do it?
3 Of course ALTERYX can , no problem

Let’s imagine I want to know how many times each word occurs. For instance, “Alteryx” appears in there twice.

Just to complicate things marginally, I’d also like to :

  • remove all the “common” words that tend to dominate such analysis whilst adding little insight – “I”, “You” etc.
  • ignore capitalisation: the word ALTERYX should be considered the same as Alteryx for example.

Alteryx has the tools to do a basic version of this in just a few seconds. Here’s a summary of one possible approach.

First I used an Input tool to connect to my datasource which contained the free text for parsing.

Then use a formula tool to pull all the free text fields into one single field, delimited by a delimiter of my choice. The formula is simply a concatenation:
[Free text 1] + "," + [Free text 2] + "," + [Free text 3]

The key manipulation was then done using the “Text To Columns” tool. Two configuration options were particularly helpful here.

  •  Despite its name, it has a configuration option to split your text to rows, rather than columns.This is great for this type of thing because each field might contains a different, and unknown amount of words in in – and for most analytic techniques it is often easier to handle a table of unknown length than one of unknown width.

    You will still be able to track which record each row originally came from as Alteryx preserves the fields that you do not split on each record, similar to how the “Transpose” tool works.

  • You can enter several delimiters – and it will consider any of them independently. The list I used was ', !;:.?"' , meaning I wanted to consider that a new word started or ended whenever I saw a comma, space, full stop, question mark and so on. You can add as many as you like, according to how your data is formatted. Note also the advanced options at the bottom if you want to ignore delimiters in certain circumstances.Text to columns tool

When one runs the tool, if there are several delimiters next to each other, this will (correctly) cause rows with blank “words” to be generated. These are easily removed with a Filter tool, set to exclude any record where the text IsEmpty().

Next I wanted to remove the most common basic words, such that I don’t end up with a frequency table filled with “and” or “I” for instance. These are often called stopwords. But how to choose them?

In reality, stopwords are best defined by language and domain. Google will find you plenty based on language but you may need to edit them to suit your specific project. Here, I simply leveraged the work of the clever developers of the famous Python Natural Language Toolkit 

This toolkit contains a corpus of default stopwords in several languages. The English ones can be extracted via:

from nltk.corpus import stopwords
stopwords.words('english')

which results in a list of 127 such words – I, me, my, myself etc. You can see the full current list on my spreadsheet: NLTK stopwords.

I loaded these into an Alteryx text input tool, and used a Join tool to connect the words my previous text generated (on the left side) to the words in this stopword corpus (on the right side), and took the left-hand output of the join tool.

Join tool

The overall effect of this is what relational database fans might call a LEFT JOIN: an output that gives me all of the records from my processed text that do not match those in the stopword corpus.

(Alteryx has a nice “translate from SQL to Alteryx” page in the knowledgebase for people looking to recreate something they did in SQL in this new-fangled Alteryx tool).

The output of that Join tool is therefore the end result; 1 record for each word in your original text, tokenised, that is not in the list of common words. This is a useful format for carrying on to analyse in Alteryx or most other relevant tools.

If you wanted to do a quick frequency count within Alteryx for instance, you can do this in seconds by dropping in a Summarise tool that counts each record, grouping by the word itself.

You can download the full workflow shown above here.

Extracting SPSS variable labels and factors with Alteryx

SPSS is a nice statistics/analytics package that, since 1968 (!), seems to have been well-regarded program for classic statistics. It now has many new bits and pieces that target the predictive modelling market too.

In my experience it was previously mostly used in academia, especially the social sciences, but these days it seems it has made inroads into business and government data too.But it’s not yet common enough that we all have a copy sadly, and hence I faced a challenge of how to use a large data file in SPSS format within the other tools at our disposal.

Importing the data straight was not a big issue. Both Alteryx, and Tableau can read the files in natively which is great (this is a new feature in the latest release of Tableau – choose “Statistical file” in the “Connect” data box.)

However, they import the data as-is, and don’t seem to be able to access all the special metadata of an SPSS file all that well. This is very fair – but quite annoying if we are talking about 300,000 questionnaire responses with 400 fields somebody else designed, all helpfully encoded and code-booked as per best data practice 🙂

My first issue was that SPSS has the concept of:

  •  variable labels: e.g. a field called “QUESTION13PARTB” might have the label “What is your gender?”.
  • factors (a.k.a. value labels): meaning the answers to that question might actually be saved in the data as 0 or 1, they actually represent female” or “male”.

My goal was to undo the variable labels and value factors so that instead of:
“QUESTION13PARTB:0” I could get “What is your gender?: Female”.

Alteryx default settings can get you part of the way there very easily. When used on an SPSS (*.SAV) file the  “Input Data” tool has an option to “expand factors”. If not ticked you’ll get 0 or 1 in the above example, if ticked you get female or male. Super easy.

Input Data tool with SPSS file

But I could not find a way to extract the variable labels, i.e. translate QUESTION13PARTB into “What is your gender?”.

Alteryx Analytics Gallery to the rescue! It has a freely downloadable SPSS Input tool there which neatly slots into your Alteryx Designer software, courtesy of Neil Ryan.

Upon use you get the following options, which you can see pretty perfectly coincide with the task at hand.

SPSS input tool

The concept of SPSS Variable Labels is pretty much the same as Alteryx field descriptions, so the second tickbox there is basically recreating the original SPSS setup but in Alteryx. You will see the field descriptions in Alteryx tools like the select tool in that case – or you have the first option which literally replaces the coded field names with their SPSS descriptions if you prefer.

The gallery does warn that this tool is not super fast. I would agree, the default Alteryx input tool is much faster – but even with a recordset in the hundreds of thousands with a with of a few hundred columns I only had to remain patient for about 5 minutes before it was done.

Those without Alteryx – you can do this all in R (indeed that is apparently how this Alteryx tool actually works) – but it will quite possibly be a much less friendly experience if you’re coming to it from scratch! Here’s some docs on the read.spss command you might start from though for any r-curious.

Why version 9 will {FIX} Tableau for me (and workarounds in the mean time)

Excitement builds within the dataviz world as the next version of Tableau gets close to launch, supposedly within the next 60 days.

It has many new features, which data geeks and other fans can see an preview being dripped out piece by piece in the Tableau blog, and summaries elsewhere, but one has really caught my attention, even though it is probably on the less glamorous side of things.

The general topic is “level of detail” calculations; a new type of calculations in the forthcoming version of Tableau, which I first saw in action via the Information Lab’s wonderful preview video. All the ones I’ve read about so far (INCLUDE, EXCLUDE, ….) seem that they will be very useful indeed for such super-common analytics tasks as comparing something with a larger group of somethings it’s part of, but potentially most thrilling to me is FIXED.

These calculations are interesting to me in that they provide a way to use data from other rows of data to calculate something in this row. Tableau table calculations  can do some common cases of this, e.g. TOTAL, WINDOW_SUM et al. but not yet flexible enough to what I imagine is necessary for most in-depth analysts on a regular basis; specifically accessing data at a pre-defined granularity not included in the visual analysis design itself that is neither “current row” or “all rows” – but something in-between.

Of course, not proclaiming to be a leading Tableau expert, it could simply be that I don’t know the easy way of doing it in Tableau today so comments most welcome if I am missing something obvious – but even with various training and Tableau doctor sessions a few questions remain unsolved in my mind.To highlight this, I’m going to steal something very similar (but far simpler) to the example Tableau gave in their 3rd example here, about sales by customer tenure.Imagine a very simple table like this:

And now that you want to know the total transaction value for these customers based on their first transaction date. The business question might be something to the effect of “Are our newer customers bringing in the same level of revenue as our older ones?”. This silly sample data is so small it’s trivial to calculate in ones’ head, but you can imagine many real life equivalent tables that consist of millions of customers and hundreds of millions of transactions.The sort of data structure that would make it wonderfully easy to calculate this today in Tableau would be to add another column, e.g. “first transaction date”. This could then be used in a dimension so that one can pop it on rows, colours, shapes or what have you, put transaction values into the measures and play away.

Note the point here is that every unique customer has the same “first transaction date” in all their records, whereas the actual transaction date varies as they buy more products over time.

That column is very easy to calculate in Excel and its ilk. But, today, it does not seem so easy to calculate that in Tableau in aggregate to answer such seemingly simple questions like “how much did customers whose first transaction was in 2001 bring us over their lifetime so far vs those starting in 2002? “.

To be fair, with small examples like this it’s usually easy enough to do this even in current Tableau by creating a series of Tableau calculated fields and using them in Tableau sets to represent all customers with a minimum transaction date of a certain year. I’m not going to write about that here, as hopefully I won’t need to do it ever again in the near future! I’d note though that, as well as being as intuitive as most of Tableau is, in the real world this technique does not seem to scale at all well, for instance when you have large number of customers who joined in many different years. I have even managed to get Tableau to tell me it’s out of memory whilst trying to doing this sort of thing with too many records, which is a very rare occurrence indeed.

So personally I have tended to force myself to switch to other tools to pre-build this extra column when I need to do something like this in Tableau.

I’m all for using the right tool for the right job, and there are many great tools available. No serious data-person should restrict themselves to a single tool.  However I haven’t experienced an ad-hoc visualisation tool I like nearly as much as Tableau, so I end up spending time taking the data from Tableau out to a another tool just to make a file to re-import back into Tableau.

Unquestionably, doing so interrupts the flow of analysis, and can be a proper time-sapping pain if the source of your data is for instance a TDE or something else that most other packages don’t natively open. Here’s a reference  list of some tools and methods I’ve to do this successfully in the past – and probably will continue doing so for the next 59 days or so.

Microsoft Excel: I probably don’t need to explain what this one is, to anyone who read as far down as here anyway!

Example method: many possible options, 1 of which would be this style of array formula if we are sure no transactions in the year 2099 or afterwards exist in the dataset.
{=MIN(IF(Sheet1!$A17=Sheet1!$A$13:$A$19,Sheet1!$B$13:$B$19,DATE(2099,1,1)))}

Pros:

  • Very easy to do, most analysts will be able to create a formula for this very quickly.
  • You probably already own the software.

Cons:

  • Excel cannot handle files with over 1 million rows, and it begins to struggle long, long before that.
  • Generally will have to re-do or extend the formula when the rest of the data updates. Excel has some live-linking abilities, but I’ve never found them terribly reliable.
  • As ever with Excel, because it’s so easy to use and supremely flexible about most things, it’s easy to make a mistake, not notice and hence create inaccurate analysis. Imagine if you forgot the dollar signs in the above formula!

SQL server (or other RDMS):

Example method: again many possible options. 1 of which could be:

SELECT data.[Customer ID], data.[Transaction date], data.[Transaction value], data_copy.FirstTransactionDate
FROM data
LEFT JOIN
(SELECT [Customer ID], MIN([Transaction date]) AS FirstTransactionDate FROM data GROUP BY [Customer ID]) data_copy
ON data.[Customer ID] = data_copy.[Customer ID]

Pros:

  • Very often my data is coming from a SQL server anyway, so why not retrieve and manipulate it in one go?
  • Performance CAN be great over even large datasets
  • You can include data updates just by refreshing the query

Cons:

  • SQL is not a visually intuitive language. Personally I would recommend most analysts who have access to relational databases to take the time to learn at least the basics – but beyond the basics it can certainly get rather complex and take a lot of thinking and debugging time that could be better spent in providing insights. Especially for the case where you are running against a database that was designed perhaps for operational or transactional reasons, and not aimed at analysts.
  • SQL servers in companies are often centrally maintained by a technology department and may be being used for many non-analytical yet mission-critical tasks by many other people. Business analysts do not always have a good amount of control over those environments; for instance I have effectively low priority read only access a lot of ones I use.
  • For the above reason, you may need a lot of documentation to understand where to look in the database; badgering the database managers for data dictionaries, entity relationship diagrams and other complexities takes time on both sides. It might well be that such documentation doesn’t exist. And then “educated guessing” is about the best you can do.
  • For the above reasons, performance CAN be terrible depending on what you want to do
  • Accessing a remote SQL server can’t be done if you have a dodgy internet connection or outside firewalls in some cases (think ‘travelling’ for instance).
  • Some SQL databases or licenses may be very expensive, if you don’t already have access to one.

Powerpivot

Example method:

=CALCULATE(MIN(Table1[Transaction date]),ALLEXCEPT(Table1,Table1[Customer ID]))

Pros:

  • Is a free download
  • It fits directly into Excel, so relatively user friendly and easy for any decent Excel user, especially those who pride themselves on pivot table skills.
  • It enables Excel to handle a much bigger volume of data and introduces new functions that can be used for many useful things that standard Excel pivot tables can’t do.
  • Everything seems to be precalculated so performance can be extremely good after the initial setup of your data is done.
  • The language it uses for formulae, DAX, is for the most part pretty similar to standard Excel functions where it can be.
  • The CALCULATE() function used above uses a syntax I find very intuitive, flexible and really like a lot; possibly my favourite implementation of a feature like the ones you need to do this task.

Cons:

  • You, and other people who want to manipulate your Excel file if any, need to have Powerpivot addin installed. This needs downloading and temporary admin rights if you’re running Excel 2010. I have no experience with Excel 2013 variants.
  • Formulas in Powerpivot are written in “DAX” language. DAX is not a language that all that many people speak in my experience (although the basics aren’t complicated).
  • The version you can use seems dependent on other Microsoft infrastructure. For instance, for me, I am restricted to the 2008 version due to other software that runs. 7 years is a LONG time for analysis tools!
  • Behind the scenes it is creating something similar to a SSAS cube. This means that when you connect Tableau to it as a datasource you suffer the loss of many of Tableau’s finest features, as it’s effectively an OLAP cube. Many of Tableau’s competitors won’t connect to it at all.
  • The big showstopper for me: it’s an in-memory software I believe, so if you try and use row-scanning operations like the CALCULATE() formula above, especially with a FILTER() over millions of rows I have consistently had “out of memory” errors. This means it whirs away for half an hour and then gives the error message and totally refuses to do what you asked it to. I come up against this a LOT. To be fair, maybe a version that isn’t 7 years old has far better performance than this one, but sadly with our current infrastructure that’s not an option for me. It can handle data far bigger than native Excel without question, but it still seems to have a limit that one comes up against frequently even when not dealing with mega-terabyte-Hadoop type stuff. I would love to see if new versions of it have helped on that front at all as it’s a great tool for it’s (non-existent) cost.

Alteryx:

Example method: Alteryx files look like visual flow charts, so below an example picture. One option for the crux of the matter is probably a multi-row formula tool with the following formula, grouped by Customer ID, after the data is sorted by customer ID then transaction date.

IF ISNULL([Row-1:Transaction date]) THEN [Transaction date] ELSE [Row-1:First transaction date] ENDIF

Pros:

  • It is blazing fast. Really, really fast. I have seen instances where the exact same task as made PowerPivot crash out of memory after 20 minutes has been completed successfully in under 30 seconds in Alteryx.
  • After a little experience, the flowchart interface is easy to use and I find I make fewer mistakes here than in some other tools.
  • If you need to repeat it with updated data, just re-run the Alteryx routine and you replicate the results without needing to recreate or edit anything.

Cons:

  • If you don’t already have it, and most people don’t, Alteryx isn’t the cheapest of tools. I think the “normal” version is around £2500 per user these days. Definitely not over-priced for what the package can do – but obviously it is not cost effective if your only task really is  to add a column to a table once!
  • The interface is very different from many “normal” tools, which means it might put many inexperienced users off trying it out unless they really need to.
  • Super-petty criticism, but on my computer it takes a relatively long time for Alteryx to load and get itself into a usable state; a definite “flow interrupter”.

Tableau 9 (my hope):Example method:  I didn’t get around to installing the beta test version yet so this is assumed from reading the blog posts of many wiser people than myself – but if I understand correctly it will take just <=6 little words (depending on how you count).

{Fixed [Customer ID]: MIN([Transaction Date])} 

Pros:

  • If this is true, then you won’t have to leave Tableau – and your flow of analysis – at all to do the task.
  • It will refresh as and when the data does, and work offline with extracts or online without.
  • Typing the above in will take approximately 5 seconds or less!

Cons:

  • None come to mind, although I have one potential concern; that of performance. Functions like this I would assume have to scan over a lot of rows to get the result. Tableau has (almost) always seemed a very fast tool to me so I’m optimistic – and Alteryx has shown these functions can run superfast –  but I’m curious to know how well these type of “level of detail” functions work over medium-large datasets (100,000s – 10,000,000s of rows). I will look forward to finding this out in the not too distant future.

Either way this feature will surely save me, and I’m sure many many others, an incredible amount of time on very common, and quite dull, analytical tasks. I am grateful indeed.