Actually you can use variables, CTEs and other fancy SQL with Tableau after all

A few months ago, I blogged about how you can use Tableau parameters when connecting to many database datasources in order to exert the same sort of flexibility that SQL coders can build into their queries using SQL variables.

This was necessary because Tableau does not let you use SQL variables, common table expressions, temp table creation or other such fanciness when defining the data you wish to analyse, even via custom SQL. You can copy and paste a SQL query using those features that works fine in Microsoft’s SQL Server Management Studio or Oracle’s SQL Developer into Tableau’s custom SQL datasource box, and you’ll get nothing but errors.

But it turns out that there are actually ways to use these features in Tableau, as long as you only need to run them once per session – upon connection to the database.

Why do this? Well, if you have a well-designed data warehouse or other analytically-friendly datasource hopefully you’ll never need to. However, if you’re accessing “sub-optimally” designed tables for your task, or large raw unprocessed operational data tables, or other such scenarios that consistently form one of the bane of an analyst’s life, then you might find manipulating or summarising data before you use it in Tableau makes life much easier, much faster, or both.

To do this, we’ll use the “Initial SQL” feature of Tableau. This is supported on some, but not all, of the database types Tableau can connect to. You’ll know if your data connection does support it, because an option to use it will appear if you click on the relevant connection at the top left of the data source screen.

capture

If you click that option, you’ll get a text box where you can go crazy with your SQL in order to set up and define whatever final data table you’d like to connect to. Note that you’ll be restricted to the features of your database and the permissions your login gives you – so no standard table creation if you don’t have write-permissions etc. etc. But, to take the example of SQL Server, in my experience most “normal” users can write temporary tables.

Tableau isn’t great at helping you write this initial SQL – so if you’re not super-proficient and intend to write something complicated then you might want to play with it in a friendlier SQL tool first and paste it into Tableau when you know it works, or ask a friendly database-guru to do it for you.

Below then is an example of how one can use variables, CTEs and temporary tables in order to pre-build a data table you can then analyse in Tableau, by virtue of pasting code into the initial SQL box. This code will be re-run and hence refreshed every time you open the workbook. But as the name “initial SQL” suggests, it will not be refreshed every time you create or modify a new chart if you’ve not re-connected to the datasource inbetween.

(For what it’s worth, this uses the default demo database in SQL server – AdventureWorks. It’s the equivalent of Tableau’s famous Superstore dataset, in the Microsoft world 🙂 )

DECLARE @granularity AS varchar(50);
SET @granularity = 'yearly';

WITH OrdersCTE AS
(
SELECT OrderDate, SalesOrderID, TotalDue
FROM [SalesLT].[SalesOrderHeader]
)

SELECT
CASE WHEN @granularity = 'yearly' THEN CONVERT(varchar,YEAR(OrderDate))
WHEN @granularity = 'monthly' THEN CONVERT(varchar,YEAR(OrderDate)) + '-' + CONVERT(varchar,MONTH(OrderDate))
WHEN @granularity = 'daily' THEN CONVERT(date,OrderDate)
END AS Period,

COUNT(SalesOrderID) AS CountOfSales,
SUM(TotalDue) AS TotalDue
INTO #initialSQLDemo
FROM OrdersCTE

GROUP BY
CASE WHEN @granularity = 'yearly' THEN CONVERT(varchar,YEAR(OrderDate))
WHEN @granularity = 'monthly' THEN CONVERT(varchar,YEAR(OrderDate)) + '-' + CONVERT(varchar,MONTH(OrderDate))
WHEN @granularity = 'daily' THEN CONVERT(date,OrderDate)
END;

Now, this isn’t supposed to be an SQL tutorial, so I’ll not explain in detail what the above does. But for those of you already familiar with (T)SQL, you’ll note that I set a variable as to how granular I want my dates to be, use a CTE to build a cut-down version of a sales table with fewer columns, and then aggregate my sales table to the level of date I asked for in my variable into a temporary table I called #initialSQLDemo.

In the above, it’s set to summarise sales by year. This means Tableau will only ever receive 1 row per year. You’ll not be able to drill down into more granular dates – but if this is detailed enough for what you want, then this might provide far better performance than if you connected to a table with 1 row per minute and, implicitly or explicitly, ask Tableau to constantly aggregate it in expensive ways.

Later on, perhaps I realise I need daily data. In which case, I can just change the second line above to :

SET @granularity = 'daily';

…which will expose data at a daily level to Tableau, and I’m good to go.

SQL / Adventureworks aficionados will probably realise that my precise example is very contrived, but hopefully it’s clear how the initial SQL feature works, and hence in the real world you’ll be able to think of cases that are truly useful. Note though that if your code does something that is slow, then you will experience this slowness whenever you open your workbook.

A quick note on temporary tables:

If you’re following along, you might notice something. Although I created a temp table called #initialSQLDemo (which being temporary, will be deleted from the database as soon as you close your session), it never appears in the Table list on the left of the Tableau data source  screen.

capture

Why so? Well, in SQL Server, temporary tables are created in a database called “tempdb” that Tableau doesn’t seem to show. This is not a fatal deal though, as they’re still accessible via the “New Custom SQL” option shown above.

In my example then, I dragged New Customer SQL to the data pane and entered a simple “give me everything” into the resulting box.

capture

Now, there is a downside in that I understand using custom SQL can reduce performance in some cases, as Tableau is limited in how it can optimise queries. But a well-designed temporary table might in any case be intrinsically faster to use that an over-sized ill-designed permanent table.

When venturing back into Tableau’s main viz-design screen, you’ll see then your temporary table is treated just as any other table source would be. Tableau doesn’t care that it’s temporary.

capture

Although we should note that if you want to use the table in the future in a different Tableau workbook, you’d have to have it run the initial SQL again there, as standard temporary tables are not shareable and do not persist between database sessions.

 

Workaround for the Tableau “custom SQL” restriction on SQL Server variables

SQL Server (among other databases) has a handy feature for easy re-use of queries, in the guise of variables. You can declare variables and use them repeatedly in any query in the current session. That’s pretty handy for any complicated query forms you use repeatedly, where each time you might need to change some basic criteria.

As a trivial example: here’s how to select all the customers I have that live in postcode 12345

DECLARE @postcode varchar(5)
SET @postcode = '12345'

SELECT POSTAL_CODE, CUSTOMER_NO FROM [MY_CUSTOMERS] WHERE POSTAL_CODE = @postcode

I can then run it another day for postcode 23456 by just changing the second line to

SET @postcode = '23456'

This example is a bit silly because you might as well just type 12345 or 23456 into the query itself – but imagine a more complex SQL statement, or set of coordinated SQL statements, with screenfuls of code, and you might see how it makes it much easier to maintain.

But, sad to say, you can’t use these sort of variables as part of the custom SQL feature in Tableau. If you try, then you’ll get this slightly unhelpful message:

[Microsoft][SQL Server Native Client 11.0][SQL Server]Incorrect syntax near the keyword 'DECLARE'.
[Microsoft][SQL Server Native Client 11.0][SQL Server]Incorrect syntax near ')'.

But rather than have to hardcode your queries and modify them for every extract you can use a very simple alternative: our friend, the Tableau parameter.

For each SQL variable, create a Tableau parameter of the same data type. You can put in a default value as the “Current value” when setting it up.

capture

Then, when connecting to your database, paste in the original SQL statement you had into the Custom SQL box, but removing the variable declaration and setting from the top.
SELECT POSTAL_CODE, CUSTOMER_NO FROM [MY_CUSTOMERS] WHERE POSTAL_CODE = @postcode

capture

One at a time, highlight each SQL variable there, i.e. the @postcode part of the above, press the “Insert Parameter” button under the text field and select the appropriate Tableau parameter.

(There is actually the option to create a new parameter at this stage, if you didn’t do so in advance).

You’ll end up with a similar looking SQL statement, but with the variables now referred to in this sort of way:

SELECT POSTAL_CODE, CUSTOMER_NO FROM [MY_CUSTOMERS]
WHERE POSTAL_CODE = <Parameters.Postcode>

And it’ll work!

You can then use the standard Tableau functionality to change the parameters in Tableau, and behind the scenes your SQL statement will update appropriately.

Note firstly, for performance, that each change of this sort of parameter is going to cause the query to be re-run on the database (unless it’s already been cached).

Secondly, if you have created a Tableau Data Extract from this datasource, then altering the parameter won’t change anything until you refresh the extract.

Finally, the Tableau parameter is simply a normal Tableau parameter, just like anything else. So there’s nothing to stop a Tableau user altering it, which depending on your setup might be either exactly what you want, or exactly what you don’t want. If the latter, you might consider having the datasource involving parameters in a separate workbook to the user-facing dashboard (publishing to Tableau Server to share), or at least making sure it’s not visible on the end-user dashboards if they have no reason to change it.

Quick sidenote: I wondered if the “Initial SQL” feature might work, to declare variables upon connection to the datasource which could then be used in custom SQL later. But sadly not, I guess it doesn’t count as the same session. If you try, you’ll see a message like:
[Microsoft][SQL Server Native Client 11.0][SQL Server]Must declare the scalar variable "@postcode"

Quick tip: find location of column in a SQL database

Now we all have nice access to large sprawling relational databases, the next problem is locating in which table you need to look to find the data you’re after. For instance, which of the 100 tables available contain my sales?

Of course the proper way would to consult the comprehensive, accurate, well-governed and accessible data dictionary to understand the entities involved and the precise location and definition of what you need.

However, those of us cursed to live in the real world do not always (haha) have that luxury.

Fortunately, if you can guess roughly what the field name of the data you want might be, then you can often use the database’s metadata to find where to look.

In Microsoft SQL Server for instance, running the below query will show you a table of results that tell you in which schema, table and column names you can find fields that have “sales” in their name.

SELECT TABLE_SCHEMA,TABLE_NAME, COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME LIKE '%sales%'

Notes:

  • I believe it will only show tables to which you have access. If you don’t find what you want, perhaps you don’t have permissions.
  • It looks only for the pattern you type. If you ask for “sales”, but the field is actually called “revenue”, you’re out of luck.
  • Of course it gives no clues as to the definition of the field and whether it matches what you hoped it did. For instance is it net or gross sales, how are vouchers integrated, are records ever obseleted, and so on.
  • The query is just normal SQL, so you can use any valid “WHERE” clause to filter. The % signs there for instance are acting as wildcards; so you could remove them if you only want to see columns that are called precisely “sales” and not “product sales” for instance.
  • Other metadata about the fields is available in that INFORMATION_SCHEMA.COLUMNS table such as datatype etc. Here’s the official documentation on the entirety of that table.
  • This is a table in the same sense as the explicit data tables – so you could always connect tools like Tableau etc. to it if you want to produce some fun, or perhaps even useful, meta-visualisations.

Given the absence of any more proper clues, this might still be better than manually reviewing thousands of tables if you’re willing to accept the caveats. Good luck.

Note that although the above example is from SQL Server, a lot of other mainstream RDBMS’ have similar capabilities. For example, here’s some documentation on equivalents for MySQL and Oracle.

 

Exporting CSV data from SQL Server Management Studio

SQL Server Management Studio is a commonly-used bit of the Microsoft SQL Server install, and a decent enough tool for browsing, querying and managing the data.

Sometimes though you might have the urge to extract a big chunk of data – most often I do this to generate a big text-file dump for import into other data analysis tools. Even when said tools can link and query the SQL server directly themselves I often find it useful to do this to ensure my extract is guaranteed to remain static, portable, fast, local and not subject to the whims of far-away database admins, especially for the larger side of extracts.

There are at least 3 methods to do this. They can all produce the same results but some have more options than others. For simple CSV exports though, take your pick. They work nicely on biggish extracts too – they regularly serve me up datafiles with hundreds of millions of rows and a bunch of columns with minimal stress.

Option 1: Right clicking the data output grid

If you already ran your query such that the results are in the grid output at present, it’s possible to simply right click anywhere on the resulting data and choose “Save results as”.

Save results as

In the following box you get an option to save what was shown in the grid as a CSV or a tab delimited file.

Quick and easy, but a couple of downsides:

  • no option here to customise the delimiter or use fixed width text file format. This is fine with nice regular numeric data, but beware of anything with text in that might have the dreaded extra commas or equivalent within the field contents.
  • you need to have already got the results of the query output to the grid. This is fine for small result sets but I don’t especially like to make this happen if it’s going to generate millions of rows (and if it’s really small, then copy and paste to Excel or wherever is another simple option!).

Option 2: Output the query to a file instead of the grid

If you have no need to see the results of the query in the interactive grid, why waste your time? Instead output the results of the SQL directly to a file.

The first thing to do is to go to Query -> Query Options menu. Find the Results -> Text part of the options and check to make sure you are happy with the delimiters, including column headers, max characters etc.

Query options

Then, OK on that, and go back to the main SSMS screen.

Then press the toolbar button “Results to file” (or Ctrl + Shift + F)  Results to file

Now when you execute your query it will prompt you for a filename. Enter in the name/location of where you want your text file and it will happily churn away until it’s finished generating it.

Of course you can open a new query tab and carry on working if you want to get stuff done whilst waiting for it to complete.

Warning: SQL server is apt to add an annoying “count of records” line at the bottom of extracts performed in this way. This creates a mostly-blank and invalid record as far as any software importing the output is concerned. To stop this happening, make the first line of your SQL query:
SET NOCOUNT ON“.

Option 3: Use the “data export” task 

For the maximum levels of configuration, you can use the SQL Import and Export Wizard to get data out into CSV (or a lot of other things).

For this, in the Object Explorer window, right click on the name of the database concerned. It must be the database name; right clicking the table, columns, keys or elsewhere will not give you the correct options.

From this choose Tasks -> Export Data

Export data

It will then open up an Import/Export wizard.

Fill in the name of your SQL server and the database concerned if it isn’t already selected.

Database details

Choose the destination for your data. There are many options here, including copying it to another SQL server database if you have one dedicated to analysis or similar. For a CSV style output pick “Flat File Destination”.

Browse to tell it the filename you want it to output. Then make sure the format, column names options and so on are set as you wish. Don’t worry that you can’t change the delimiter itself on this screen, that comes later.

Destination

The next screen gives you the chance to copy data from a set of tables of views. But if you have already written some custom SQL then pick the “Write a query to specify the data to transfer” option.

You’ll get blank box which you can type / paste your SQL query in, exactly as you’d run it in the main SSMS environment.

Once that’s done, you’ll get to a screen where you can pick both a row and column delimiter, assuming you plumped for the delimited output type. Leave “source query” as “Query”.

Destination file configuration

Here you have buttons to quickly preview your data. If you want to perform some basic manipulations, then hit the “Edit Mappings” button first.

Column mappings

Here you can set options as to whether you want to create a new CSV, or delete/append to an existing one if you happened to pick a filename that already exists.

More interestingly though, you can see all the output fields and their characteristics. Click into the grid to change anything that you don’t like – or if it got the data type wrong (which it does sometimes). If you don’t need the column at all in your output, drop down the “destination” field for that column and chose “”.

Once you’re done there, you get to the save/run screen. Here you can either chose to run it immediately, or save the job as an SSIS package if it’s something you need to store and run or re-run in future.

Run or save option

Hitting “next” will give you a nice summary of the job you set up. Finally “Finish” will start the job going, assuming that’s the option you asked for. It will list the stages it needs to go through and you can follow whereabouts it is. Note of course if it is a large and complicated query, this stage may take a long, long time.

Hopefully you’ll end up with something like the below, and the file you wanted has been created.

Successful export

If things aren’t good though, you may get some sort of error message. Of course Google remains everyone’s best friend should these pop up if it is not self-explanatory, but my experience to date has been that most often it is the case that the incorrect field type has been selected (often automatically…) in the above “Edit mappings” stage. Go back and correct it if so.

Which option is best?

Surprise, surprise – it depends on your exact task.

I find option 2 is the one I use the most, as a nice compromise between speed (minimal configuration, mapping etc.) and customisation (you can pick your delimiter etc.).

But if you need any re-mapping or the ability to save the job for the future, you’ll need option 3.

1 is super easy, if you just want to export data that is already on your screen into a classic CSV format.

Bonus pro-tip…make sure your delimited field text doesn’t contain delimiters within its legitimate content!

Most of us have probably had the joy of receiving data in supposed CSV format that started off looking something like this:

ID Comment
1 Your site is nice
2 Your site is horrible, and I don’t like it

and was later translated into “CSV” something like:

ID,Comment
1,Your site is nice
2,Your site is horrible, and I don't like it

where the third row above has 2 commas in it – which many programs will interpret as it have 3 columns when in fact it has 2.

People have a surprising propensity to fill their text-fields with commas or even pipes and dollars even when you don’t expect them to – and none of the above export methods have by default (in my experience) reliably escaped them properly to the satisfaction of other analytical tools.

If you are not bothered about keeping those extra characters, then you can just process them out within the query SQL, in which case you’re guaranteed to be safe from this CSV curse.

This for example will replace any commas with spaces.
SELECT REPLACE([name_of_your_field],',',' ') AS name_of_your_field
FROM ....

Of course this solution is no good if you need to keep the delimiter characters as part of your analysis – but if you can live with removing them it works a treat.