Retrieving Adobe SiteCatalyst data with R

Adobe SiteCatalyst (part of Adobe Analytics) is a nicely comprehensive tool for tracking user interactions upon one’s website, app and more. However, in the past I’ve had a fair amount of trouble de-siloing its potentially immensely useful data into external tools, such that I could connect, link and process it for insights over and above those you can get within the default web tool (which, to be fair, is itself improving over time).

I’ve written in the past about my happiness when Alteryx released a data connector allowing one to access Sitecatalyst data from inside their own tool. That’s still great, but the tool is necessarily constrained to the specific tasks the creator designed it to do, and subject to the same API limits as everyone else is. I have no doubt that there are ways and means to get around that in Alteryx (after all, it speaks R). But sometimes, when trying to automate operations, coding in something like might actually be easier…or at least cheaper!

With that in mind, after having a recent requirement to analyse web browsing data at a individual customer level, I successfully experimented with the awesome RSiteCatalyst package, and have some notes below as to the method which worked well for me.

Note that RSiteCatalyst is naturally subject to the usual Adobe API limits – the main one that causes me sadness being the inability to retrieve over 50k rows at a time – but, due to the incessant flexibility of R and the comprehensiveness of this package, I’ve not encountered a problem I couldn’t solve just yet.

So, how to set up?

Install the RSiteCatalyst package

First, open up R, or your favourite R environment, and install the RSiteCatalyst package (the first time you use it) or load the library (each new session).

if(!require(RSiteCatalyst)) install.packages("RSiteCatalyst")
library(RSiteCatalyst)

Log in to your SiteCatalyst account

You next need to authenticate against your Adobe Sitecatalyst installation, using the SCAuth function. There’s an old way involving a shared secret, and a new way using OAuth. The latter is to be preferred, but at the time I first looked at it there seemed to be a web service issue that prevented the OAuth process completing. At the moment then, I can confirm that the old method still works!

 key <- "username:company>"
 secret <- "https://sc.omniture.com/p/suite/1.3/index.html?a=User.GetAccountInfo >"
 SCAuth(key, secret)

 

Retrieve metadata about your installation

Once you’re in, there’s a plethora of commands to retrieve useful metadata, and then to run and retrieve various types of report data. For several such commands, you’ll need to know the ID of the Adobe Report Suite concerned, which is fortunately as easy as:

suites <- GetReportSuites()

whereby you’ll receive a dataframe containing all available report suites by title and ID.

If you already know the title of the report suite you’re interested in then you can grab the ID directly with something like:

my.rsid <- suites[suites$site_title=="My favourite website",1]

You can then find out which elements are available within your report suite:

elements.available <- GetElements(my.rsid)

and later on which metrics are available for a given element

metrics.available <- GetMetrics(my.rsid, elements = "<<myFavouriteElement>>")

Retrieve Adobe Analytics report data

There are a few different RSitecatalyst functions you can call, depending on the type of report you’re interested in. In each case they start with “queue”, as what you’re actually doing is submitting a data request to the Sitecatalyst reporting queue.

If your request is pretty quick, you can wait a few seconds and get the results sent back to you immediately in R. If it’s going to take a very long time, you can instead store a report request ID and then use the GetReport function to go back and get it later, once it’s finished.

The current list of queue functions, which are named such that Adobe aficionados will probably be able to guess which type of data they facilitate, is:

  • QueueDataWarehouse
  • QueueFallout
  • QueueOvertime
  • QueuePathing
  • QueueRanked
  • QueueSummary
  • QueueTrended

Here I’ll show just a couple of examples – but the full official documentation for all of them and much more besides is available at Cran.

Firstly, an “over  time” report to get me the daily pageview and visit counts my site received in the first half of 2016.

Here’s how the documentation describes this type of report:

“A QueueOvertime report is a report where the only granularity allowed is time. This report allows for a single report suite, time granularity, multiple metrics, and a single segment”

An example would be a day by day count of total pageviews to your site.

Remember that above we set the variable “my.rsid” to the Report Suite ID of the report suite I am looking at. So:

dailyStats <- QueueOvertime(my.rsid,
                            date.from = "2016-01-01",
                            date.to = "2016-06-30",
                            metrics = c("pageviews","visits"),
                            date.granularity = "day"
                            )

The parameters are fairly self-evident, especially in combination with the documentation. There are many more available, including the ability to isolate by segment and to forecast data.

Your site will probably have some different metrics available to mine, depending on how the Adobe admin set it up, but the basics like page views and visits should be accessible pretty much anywhere.

What you’ll get back with the above command, i.e. the contents of the dailyStats variable after it has run, is a dataframe in this sort of format:

datetime name year month day segment.id segment.name pageviews visits
01/01/2016 Fri. 1 Jan. 2016 2016 1 1 100 75
02/01/2016 Sat. 2 Jan. 2016 2016 1 2 200 150
03/01/2016 Sun. 3 Jan. 2016 2016 1 3 250 180

Which you can then go on to process as you would any other such snippet of data in R.

My second, slightly more complex, example concerns the QueueRanked function. It’s described like this:

A QueueRanked report is a report that shows the ranking of values for one or more elements relative to a metric, aggregated over the time period selected.

That’s a bit more vague, but was useful to me in my original goal of identifying specifically which customers logged into my website in December, and how many times they visited.

A key feature of this function is that you can ask for the results from rank [x] to rank [y], instead of just the top [n].

This is super-useful where, like I did,  you expect to get over 50k rows. 50k is maximum row limit you can retrieve in one request via the Adobe API, which this R package uses. But R is full of the typical program language features like loops, thus allowing one to iterate through the commands to retrieve for instance results 1-50,000, then results 50,001 -100,000, then 100,001 – 150,000 and so on.

So, I built a loop that would generate these “ranked reports”, starting at row ‘i’ and giving me the next ‘count.step’ records, where count.step = 50000, the maximum I’m allowed to retrieve in one go.

Thus, I’d call the function repeatedly, each time asking for the next 50,000 records. At some point, when there were no more customers to download, I’d get a blank report sent back to me. At that point, I know I have everything so quit the loop.

I wanted to retrieve the ID of the customer using the website, which in my setup is stored in an custom element called “prop1”. All that sort of detail is controlled by your Adobe Sitecatalyst administrator, should you have exactly the same sort of requirement as I did – so best go ask them which element to look in, as there’s no real chance your setup is identical to mine at that level.

Nonetheless, the code pattern below could likely be used without much modification in order to iterate through any SiteCatalyst data that exceeds the API row limits.

 

count.limit <- 500000 #the max number of records we're interested in
count.step <- 50000 #how many records to retrieve per request, must not exceed 50k
count.start <- 1 #which record number to start with
CustomerList <- NULL #a variable to store the results in
fromDate <- "2016-12-01"
toDate <- "2016-12-31"

for(i in seq(1, count.limit, by = count.step)) {
  print(paste("Requesting rows",i, "through", i + count.step - 1))

  tempCustomerList <- QueueRanked(my.rsid,
                          date.from = fromDate,
                          date.to = toDate,
                          metrics = "visits",
                          elements = "prop1",
                          top = count.step,
                          start = i
                          )

  if  (nrow(tempCustomerList) == 0 ) {   # no more rows were returned - presumably we have them all now
    print("Last batch had no rows, exiting loop")
    break
  }
   
  tempCustomerList$batch.start.row <- i

  CustomerList <- rbind(customerList, tempCustomerList)

}

 

After running this, you should end up with a “CustomerList” dataframe that looks something like this, where “name” is the value of prop1, in my case the customer ID:

name url visits segment.id segment.name batch.start.row
ID123 10 1
ID234 5 1

which, again, you can process as though it was any standard type of R data.

Advertisements

Accessing Adobe Analytics data with Alteryx

Adobe Analytics (also known as Site Catalyst, Omniture, and various other names both past and present) is a service that tracks and reports on how people use websites and apps. It’s one of the leading solutions for organisations who are interested in studying how people are actually using their digital offerings.

Studying real-world usage is often far more insightful, in my view, than surveying people before or after the fact. Competitors to Adobe Analytics would include Google Analytics and other such services that allow you to follow web traffic, and answer questions from those as simple as “how many people visited my website today?” up to “can we predict how many people from New York will sign up to my service after having clicked button x, watched my promo video and spent at least 10 minutes reading the terms and conditions?”

In their own words:

What is Adobe Analytics?
It’s the industry-leading solution for applying real-time analytics and detailed segmentation across all of your marketing channels. Use it to discover high-value audiences and power customer intelligence for your business.

I use it a lot, but until recently have always found that it suffers from a key problem. Please pardon my usage of the 4-letter “s word” but, here, at least, the Adobe digital data has always pretty much remained in a silo. Grrr!

There are various native solutions, some of which are helpful for certain use cases (take a look at the useful Excel addin or the – badly named in my opinion, and somewhat temperamental – “data warehouse” functionality for instance). We have also had various technology teams working on using native functionality to move data from Adobe into a more typical and accessible relational database, but that seems to be a time-consuming and resource-intensive operation to get in place.

So none of the above solutions yet really proved to meet my needs to extract reasonably large volumes of data quickly and easily on an adhoc basis for integration with other datasources in a refreshable manner. And without that, in this world that ever-increasingly moves towards digital interactions, it’s hard to get a true overall view of your customer’s engagement.

So, imagine how the sun shone and the angels sung in my world when I saw the Alteryx version 10.5 release announcement.

…Alteryx Analytics 10.5 introduces new connectors to Google Sheets, Adobe Analytics, and Salesforce – enhancing the scope of data available for analytic insights

I must admit that I had had high hopes that this would happen, insomuch as when looking at the detailed schedule agenda for this year’s Alteryx Inspire conference (see you there?) I noticed that there was mention of Adobe Analytics within a session called “How to process and visualise data in the cloud”. But yesterday it actually arrived!

It must be said that the setup is not 100% trivial, so below I have outlined the process I went through to get a successful connection, in case it proves useful for others to know.

Firstly, the Adobe Analytics data connector is not actually automatically installed, even when you install even the full, latest version of Alteryx. Don’t let this concern you. The trick is, after you have updated Alteryx to at least version 10.5, is to go and download the connector separately from the relevant page of the Alteryx Analytics gallery. It’s the blue “Adobe Analytics install” file you want to save to your computer, there’s no need to press the big “Run” button on the website itself.

(If you don’t already have one, you may have to create a Alteryx gallery user account first, but that’s easy to do and free of charge, even if you’re not an Alteryx customer. And whilst you’re there, why not browse through the manifold other goodies it hosts?).

You should end up with a small file called “AdobeAnalytics.yxi” on your computer. Double click that, Alteryx will load up, and you’ll go through a quick and simple install routine.

Capture

CaptureOnce you’ve gone through that, check on your standard Alteryx “Connectors” ribbon and you should see a new tool called “Adobe Analytics”.

Just like any other Alteryx tool you can drag and drop that into your workflow and configure it in the Configuration pane. Once configured correctly, you can use it in a similar vein to the “Input data” tool.

The first thing you’ll need to configure is your sign-in method, so that Alteryx becomes authorised to access your Adobe Analytics account.

This isn’t necessarily as straightforward as with most other data connectors, because Adobe offers a plethora of different types of account or means of access, and it’s quite possible the one that you use is not directly supported. That was the case for me at least.

Alteryx have provided some instructions as to how to sort that out here. Rather than use my standard company login, instead I created a new Adobe ID (using my individual corporate email address), logged into marketing.adobe.com with it, and used the “Get access” section of the Adobe site to link my company Adobe Analytics login to my new Adobe ID.

That was much simpler than it sounds, and you may not need to do it if you already have a proper Adobe ID or a Developer login, but that’s the method I successfully used.

Then you can log in, via the tool’s configuration  panel.

Capture

CaptureOnce you’re happily logged in (using the “User login” option if you followed the same procedure as I did above), you get to the juicy configuration options to specify what data you want your connector to return from the Adobe Analytics offerings.

Now a lot of the content of what you’ll see here is very dependent on your Adobe setup, so you might want to work with the owner of your Adobe install if it’s not offering what you want, unless you’re also multitasking as the the Adobe admin.

In essence, you’re selecting a Report Suite, the metrics (and dimensions, aka “elements”) you’re interested in, the date range of significance and the granularity. If  you’re at all familiar with the web Adobe Analytics interface, it’s all the same stuff with the same terminology (but, if it offers what you want, so much faster and more flexible).

Leave “Attempt to Parse Report” ticked, unless for some reason you prefer the raw JSON the Adobe API returns instead of a nice Alteryx table.

Once you’ve done that, then Alteryx will consider it as just another type of datasource. The output of that tool can then be fed into any other Alteryx tool – perhaps start with a Browse tool to see exactly what’s being returned from your request. And then you’re free to leverage the extensive Alteryx toolkit to process, combine, integrate, analyse and model your data from Adobe and elsewhere to gain extra insights into your digital world.

Want an update with new data next week? Just re-open your workflow and hit run, and see the latest data flow in. That’s a substantial time and sanity saving improvement on the old-style battle-via-Excel to de-silo this data, and perhaps even one worth buying Alteryx for alone if you do a lot of this!

Don’t forget that with the Alteryx output data tool, and the various enhanced output options including the in-database tools and Tableau output options from the latest version, you could also use Alteryx simply to move data from Adobe Analytics to some other system, whether for visualisation in Tableau or integration into a data warehouse or similar.

A use case might simply be to automatically push up web traffic data to a datasource hosted in Tableau Server for instance, so that any number of your licensed analysts can use it in their own work. You can probably find a way to do a simple version of this for free” using the native Adobe capabilities if you try hard enough, but anything that involves a semblance of transform or join, at least in our setup, seems far easier to do with external tools like Alteryx.

Pro-tip: frustrated that this tool, like most of the native ones, restricts you to pulling data from one Adobe Report Suite at a time? Not a problem – just copy and paste the workflow once for each report suite and use an Alteryx Union tool to combine the results into one long table.

Here’s screenshots of an example workflow and results (not from any real website…) to show that in action. Let’s answer a simple question: how many unique visitors have we had to 2 different websites, each represented by a different report suite, over the past week?

Capture

Capture

Performance: in my experience, although Adobe Analytics can contain a wealth of insightful information, I’ve found the speed of accessing it to be “non-optimal” at times. The data warehouse functionality for instance promises/threatens that:

Because of to the complexity of Data Warehouse reports, they are not immediately available, can take up to 72 hours to generate, and are accessible via email, FTP or API delivery mechanisms.

The data warehouse functionality surely allows complexity that’s an order of magnitude beyond what a simple workflow like this does, but just for reference, this workflow ran in about 20 seconds. Pulling equivalent data for 2 years took about 40 seconds. Not as fast as you’d expect a standard database to perform, but still far quicker than making a cup of tea.

Sidenote: the data returned from this connector appears to come in string format, even when it’s a column of a purely numeric measure. You might want to use a Select tool or other method in order to convert it to a more processable type if you’re using it in downstream tools.

Overall conclusion: HOORAY!