Extracting the date and time a UUID was created with Bigquery SQL (with a brief foray into the history of the Gregorian calendar)

I was recently working with records in a database that were identified by a Universally Unique Identifier, aka a UUID. These IDs are strings of characters that look something like “31ae75f0-cbe0-11e8-b568-0800200c9a66”.

I needed to know which records were generated during in a particular time period, but sadly there was no field about dates to be found. Unambiguously a database design flaw given what I needed to do – but it did lead me to discover that at least “version 1 UUIDs” have a date and time of creation embedded within them.

So how do we get from 31ae75f0-cbe0-11e8-b568-0800200c9a66 to the date and time the UUID was generated? I’d say “simple”, but it isn’t exactly. Thanks to Wikipedia, and famkruithof.net for the behind-the-scenes info of how this works.

So, the key components of the UUID to note by position are those highlighted below:

31ae75f0-cbe0-11e8-b568-0800200c9a66

Take the highlighted parts of the UUID aside, reversing the order of the chunks, so as to get:

1e8cbe031ae75f0

There’s your “60-bit timestamp, being the number of 100-nanosecond intervals since midnight 15 October 1582” (thanks Wikipedia).

Rambling sidenote alert:

In case you’re wondering why 15 October 1582 is so special, then it’s because that was the date that the previous “Julian” calendar system was first replaced with the newer “Gregorian” calendar system, now the most widely used calendar throughout the world, under the diktat of Pope Gregory XIII.

Why? The Julian calendar had worked on the premise that the average year is 365.25 days long (the 0.25 days being compensated for by the existence of a leap day every 4 years).

However, that’s slightly different to the true length of the solar year, which is 365 days, 5 hours, 48 minutes, 45.25 seconds. Consequently. it was noticed that there was some “drift” whereby the date the calendar noted that the equinoxes should occur slowly became out of sync with real life observations of the equinox. Effectively, as the Britannica notes, this discrepancy causes the “calendar dates of the seasons to regress almost one day per century”. This is relevant to religions such as Catholicism in that it affects the calculation of, for instance, when to celebrate Easter.

The Gregorian calendar made a couple of changes, perhaps most notably introducing the rule that a century year (e.g. 1900) only counts as a leap year if its number is divisible by 400 with no remainder, instead of adhering to the Julian system where it only needs to be divisible by 4.  On average, this shortens the length of the measured year by  0.0075 days, which keeps it in better sync with the reality of the solar year. It’s still not perfect, but leads to a much lower rate of drift of around 1 day per 3,030 years.

In order to account for the drift that had occurred by the time of this realisation, the Pope also advocated for fast forwarding the date by a few days to catch up with reality. So for Gregorian advocates, there was literally never a 14 October 1582. Overnight, the date skipped from October 4th 1582 through to October 15 1582, at least in countries where this system was accepted right away (and subsequently by the relevant ISO standards that most computing systems adhere to).

Not everywhere was keen to adopt this system right away – perhaps unsurprisingly Catholic countries were more likely to take the leap quicker. But type 1 UUIDs presumably don’t care too much about religious politics.

End of rambling side note

Note that the base value of this timestamp, 15 October 1582, is a different date than the classic  January 1st, 1970-based timestamp you may know and love from Unix-type systems, which many databases, including Google BigQuery, work with. So it needs conversion.

Let’s start by getting it into decimal (using one of many web-based converters – I picked this one for no particular reason).

1e8cbe031ae75f0 hex = 137583952401430000 decimal

This is in units of 100 nanoseconds. I really don’t care about nanosecond granularity for my use case, so let’s divide it by 10,000,000 to get to seconds.

137583952401430000  100-nanoseconds intervals = 13758395240.1 seconds

This is now the number of seconds that have elapsed between the start of October 15 1582 and the date / time my UUID was created.

To get it into a more conventional  0 = 1970-01-01 based timestamp format, we then need to subtract the number of seconds between October 15 1582 and January 1st 1970 from it (12,219,292,800, as it happens):

13758395240 - 12219292800 = 1539102440

So now you have the number of seconds since Jan 1st 1970 and the time your UUID was generated. Pop it into a Unix timestamp interpreter (this one, for example) to translate it into something human-understandable, and you’ll discover I generated that UUID on October 9th 2018. Hooray.

Sidenote: I generated my test UUID on this site, if you ever need a valid one a hurry.

Flicking between websites and calculators is clearly a painful and inefficient way to do this operation, especially if you have a whole column of UUIDs to decode. Luckily many databases, including Bigquery, have the necessary functions to do it en masse.

Below is some Bigquery Standard SQL code that works for me – with most thanks and credit due to Roland Bouman, whose MySQL code version was nice and easy to adapt to the Bigquery SQL dialect.

In the below, imagine your UUID is stored in the field “uuid”. Pop this into your query, and the “extracted_timestamp” field will now contain the timestamp showing when the UUID was generated.

TIMESTAMP_SECONDS (
CAST (
(
CAST(
CONCAT('0x', -- starting the value with 0x will ensure Bigquery realises it's a hex string
SUBSTR(uuid, 16, 3), -- extract relevant parts of the UUID needed to recreate timestamp
SUBSTR(uuid, 10, 4),
SUBSTR(uuid, 1, 8)
)
AS INT64) -- convert hex to dec
/ 10000000) -- divide to get from 100 nanoseconds periods to seconds
- (141427 * 24 * 60 * 60) -- rebase from start of Gregorian calendar (1582-10-15) to unixtime (1970-01-01)
AS INT64 )
) AS extracted_timestamp

 

Advertisements

Free dataset: all Reddit comments available for download

As terrifying a thought as it might be, Jason from Pushshift.io has extracted pretty much every Reddit comment from 2007 through to May 2015 that isn’t protected, and made it available for download and analysis.

This is about 1.65 million comments, in JSON format. It’s pretty big, so you can download it via a torrent, as per the announcement on archive.org.

If you don’t need a local copy, Reddit user fhoffa has loaded most of it into Google BigQuery for anyone to use.

If you have an account over there, then as Tableau now has a native BigQuery connector you can visualise it directly in Tableau – which Mr Hoffa has indeed done and shared with the world at Tableau Public.

Although you get a certain amount of uploading and usage from BigQuery for free, you will most likely need a paid account to integrate it directly into a Tableau (or equivalent) project like this, as you’ll want to create a BigQuery dataset to connect Tableau to.

However, if you only need to run some SQL on the freely available dataset to get some output – which you can then manually download and integrate into whatever you like – your free monthly allowance of BigQuery usage might well be enough.

Here’s the link to the data in BigQuery – at least one of the tables. You’ll see the rest in the interface on the left as per this screenshot:

BigQuery reddit data

You can then run some BigQuery SQL over it using the web interface – for free, up to a point, and retrieve whichever results you need.

For example:

SELECT * FROM [fh-bigquery:reddit_comments.2007] LIMIT 10

will give you 10 Reddit comments from (surprise surprise) 2007.

BigQuerySQL

As you can see on the bottom right, you can save results into a BigQuery table (this requires a dataset for which you need to enable billing on your BigQuery account) or download as CSV / JSON to do whatever you want with.