Extracting the date and time a UUID was created with Bigquery SQL (with a brief foray into the history of the Gregorian calendar)

I was recently working with records in a database that were identified by a Universally Unique Identifier, aka a UUID. These IDs are strings of characters that look something like “31ae75f0-cbe0-11e8-b568-0800200c9a66”.

I needed to know which records were generated during in a particular time period, but sadly there was no field about dates to be found. Unambiguously a database design flaw given what I needed to do – but it did lead me to discover that at least “version 1 UUIDs” have a date and time of creation embedded within them.

So how do we get from 31ae75f0-cbe0-11e8-b568-0800200c9a66 to the date and time the UUID was generated? I’d say “simple”, but it isn’t exactly. Thanks to Wikipedia, and famkruithof.net for the behind-the-scenes info of how this works.

So, the key components of the UUID to note by position are those highlighted below:

31ae75f0-cbe0-11e8-b568-0800200c9a66

Take the highlighted parts of the UUID aside, reversing the order of the chunks, so as to get:

1e8cbe031ae75f0

There’s your “60-bit timestamp, being the number of 100-nanosecond intervals since midnight 15 October 1582” (thanks Wikipedia).

Rambling sidenote alert:

In case you’re wondering why 15 October 1582 is so special, then it’s because that was the date that the previous “Julian” calendar system was first replaced with the newer “Gregorian” calendar system, now the most widely used calendar throughout the world, under the diktat of Pope Gregory XIII.

Why? The Julian calendar had worked on the premise that the average year is 365.25 days long (the 0.25 days being compensated for by the existence of a leap day every 4 years).

However, that’s slightly different to the true length of the solar year, which is 365 days, 5 hours, 48 minutes, 45.25 seconds. Consequently. it was noticed that there was some “drift” whereby the date the calendar noted that the equinoxes should occur slowly became out of sync with real life observations of the equinox. Effectively, as the Britannica notes, this discrepancy causes the “calendar dates of the seasons to regress almost one day per century”. This is relevant to religions such as Catholicism in that it affects the calculation of, for instance, when to celebrate Easter.

The Gregorian calendar made a couple of changes, perhaps most notably introducing the rule that a century year (e.g. 1900) only counts as a leap year if its number is divisible by 400 with no remainder, instead of adhering to the Julian system where it only needs to be divisible by 4.  On average, this shortens the length of the measured year by  0.0075 days, which keeps it in better sync with the reality of the solar year. It’s still not perfect, but leads to a much lower rate of drift of around 1 day per 3,030 years.

In order to account for the drift that had occurred by the time of this realisation, the Pope also advocated for fast forwarding the date by a few days to catch up with reality. So for Gregorian advocates, there was literally never a 14 October 1582. Overnight, the date skipped from October 4th 1582 through to October 15 1582, at least in countries where this system was accepted right away (and subsequently by the relevant ISO standards that most computing systems adhere to).

Not everywhere was keen to adopt this system right away – perhaps unsurprisingly Catholic countries were more likely to take the leap quicker. But type 1 UUIDs presumably don’t care too much about religious politics.

End of rambling side note

Note that the base value of this timestamp, 15 October 1582, is a different date than the classic  January 1st, 1970-based timestamp you may know and love from Unix-type systems, which many databases, including Google BigQuery, work with. So it needs conversion.

Let’s start by getting it into decimal (using one of many web-based converters – I picked this one for no particular reason).

1e8cbe031ae75f0 hex = 137583952401430000 decimal

This is in units of 100 nanoseconds. I really don’t care about nanosecond granularity for my use case, so let’s divide it by 10,000,000 to get to seconds.

137583952401430000  100-nanoseconds intervals = 13758395240.1 seconds

This is now the number of seconds that have elapsed between the start of October 15 1582 and the date / time my UUID was created.

To get it into a more conventional  0 = 1970-01-01 based timestamp format, we then need to subtract the number of seconds between October 15 1582 and January 1st 1970 from it (12,219,292,800, as it happens):

13758395240 - 12219292800 = 1539102440

So now you have the number of seconds since Jan 1st 1970 and the time your UUID was generated. Pop it into a Unix timestamp interpreter (this one, for example) to translate it into something human-understandable, and you’ll discover I generated that UUID on October 9th 2018. Hooray.

Sidenote: I generated my test UUID on this site, if you ever need a valid one a hurry.

Flicking between websites and calculators is clearly a painful and inefficient way to do this operation, especially if you have a whole column of UUIDs to decode. Luckily many databases, including Bigquery, have the necessary functions to do it en masse.

Below is some Bigquery Standard SQL code that works for me – with most thanks and credit due to Roland Bouman, whose MySQL code version was nice and easy to adapt to the Bigquery SQL dialect.

In the below, imagine your UUID is stored in the field “uuid”. Pop this into your query, and the “extracted_timestamp” field will now contain the timestamp showing when the UUID was generated.

TIMESTAMP_SECONDS (
CAST (
(
CAST(
CONCAT('0x', -- starting the value with 0x will ensure Bigquery realises it's a hex string
SUBSTR(uuid, 16, 3), -- extract relevant parts of the UUID needed to recreate timestamp
SUBSTR(uuid, 10, 4),
SUBSTR(uuid, 1, 8)
)
AS INT64) -- convert hex to dec
/ 10000000) -- divide to get from 100 nanoseconds periods to seconds
- (141427 * 24 * 60 * 60) -- rebase from start of Gregorian calendar (1582-10-15) to unixtime (1970-01-01)
AS INT64 )
) AS extracted_timestamp

 

Advertisements

Actually you can use variables, CTEs and other fancy SQL with Tableau after all

A few months ago, I blogged about how you can use Tableau parameters when connecting to many database datasources in order to exert the same sort of flexibility that SQL coders can build into their queries using SQL variables.

This was necessary because Tableau does not let you use SQL variables, common table expressions, temp table creation or other such fanciness when defining the data you wish to analyse, even via custom SQL. You can copy and paste a SQL query using those features that works fine in Microsoft’s SQL Server Management Studio or Oracle’s SQL Developer into Tableau’s custom SQL datasource box, and you’ll get nothing but errors.

But it turns out that there are actually ways to use these features in Tableau, as long as you only need to run them once per session – upon connection to the database.

Why do this? Well, if you have a well-designed data warehouse or other analytically-friendly datasource hopefully you’ll never need to. However, if you’re accessing “sub-optimally” designed tables for your task, or large raw unprocessed operational data tables, or other such scenarios that consistently form one of the bane of an analyst’s life, then you might find manipulating or summarising data before you use it in Tableau makes life much easier, much faster, or both.

To do this, we’ll use the “Initial SQL” feature of Tableau. This is supported on some, but not all, of the database types Tableau can connect to. You’ll know if your data connection does support it, because an option to use it will appear if you click on the relevant connection at the top left of the data source screen.

capture

If you click that option, you’ll get a text box where you can go crazy with your SQL in order to set up and define whatever final data table you’d like to connect to. Note that you’ll be restricted to the features of your database and the permissions your login gives you – so no standard table creation if you don’t have write-permissions etc. etc. But, to take the example of SQL Server, in my experience most “normal” users can write temporary tables.

Tableau isn’t great at helping you write this initial SQL – so if you’re not super-proficient and intend to write something complicated then you might want to play with it in a friendlier SQL tool first and paste it into Tableau when you know it works, or ask a friendly database-guru to do it for you.

Below then is an example of how one can use variables, CTEs and temporary tables in order to pre-build a data table you can then analyse in Tableau, by virtue of pasting code into the initial SQL box. This code will be re-run and hence refreshed every time you open the workbook. But as the name “initial SQL” suggests, it will not be refreshed every time you create or modify a new chart if you’ve not re-connected to the datasource inbetween.

(For what it’s worth, this uses the default demo database in SQL server – AdventureWorks. It’s the equivalent of Tableau’s famous Superstore dataset, in the Microsoft world 🙂 )

DECLARE @granularity AS varchar(50);
SET @granularity = 'yearly';

WITH OrdersCTE AS
(
SELECT OrderDate, SalesOrderID, TotalDue
FROM [SalesLT].[SalesOrderHeader]
)

SELECT
CASE WHEN @granularity = 'yearly' THEN CONVERT(varchar,YEAR(OrderDate))
WHEN @granularity = 'monthly' THEN CONVERT(varchar,YEAR(OrderDate)) + '-' + CONVERT(varchar,MONTH(OrderDate))
WHEN @granularity = 'daily' THEN CONVERT(date,OrderDate)
END AS Period,

COUNT(SalesOrderID) AS CountOfSales,
SUM(TotalDue) AS TotalDue
INTO #initialSQLDemo
FROM OrdersCTE

GROUP BY
CASE WHEN @granularity = 'yearly' THEN CONVERT(varchar,YEAR(OrderDate))
WHEN @granularity = 'monthly' THEN CONVERT(varchar,YEAR(OrderDate)) + '-' + CONVERT(varchar,MONTH(OrderDate))
WHEN @granularity = 'daily' THEN CONVERT(date,OrderDate)
END;

Now, this isn’t supposed to be an SQL tutorial, so I’ll not explain in detail what the above does. But for those of you already familiar with (T)SQL, you’ll note that I set a variable as to how granular I want my dates to be, use a CTE to build a cut-down version of a sales table with fewer columns, and then aggregate my sales table to the level of date I asked for in my variable into a temporary table I called #initialSQLDemo.

In the above, it’s set to summarise sales by year. This means Tableau will only ever receive 1 row per year. You’ll not be able to drill down into more granular dates – but if this is detailed enough for what you want, then this might provide far better performance than if you connected to a table with 1 row per minute and, implicitly or explicitly, ask Tableau to constantly aggregate it in expensive ways.

Later on, perhaps I realise I need daily data. In which case, I can just change the second line above to :

SET @granularity = 'daily';

…which will expose data at a daily level to Tableau, and I’m good to go.

SQL / Adventureworks aficionados will probably realise that my precise example is very contrived, but hopefully it’s clear how the initial SQL feature works, and hence in the real world you’ll be able to think of cases that are truly useful. Note though that if your code does something that is slow, then you will experience this slowness whenever you open your workbook.

A quick note on temporary tables:

If you’re following along, you might notice something. Although I created a temp table called #initialSQLDemo (which being temporary, will be deleted from the database as soon as you close your session), it never appears in the Table list on the left of the Tableau data source  screen.

capture

Why so? Well, in SQL Server, temporary tables are created in a database called “tempdb” that Tableau doesn’t seem to show. This is not a fatal deal though, as they’re still accessible via the “New Custom SQL” option shown above.

In my example then, I dragged New Customer SQL to the data pane and entered a simple “give me everything” into the resulting box.

capture

Now, there is a downside in that I understand using custom SQL can reduce performance in some cases, as Tableau is limited in how it can optimise queries. But a well-designed temporary table might in any case be intrinsically faster to use that an over-sized ill-designed permanent table.

When venturing back into Tableau’s main viz-design screen, you’ll see then your temporary table is treated just as any other table source would be. Tableau doesn’t care that it’s temporary.

capture

Although we should note that if you want to use the table in the future in a different Tableau workbook, you’d have to have it run the initial SQL again there, as standard temporary tables are not shareable and do not persist between database sessions.