Retrieving Adobe SiteCatalyst data with R

Adobe SiteCatalyst (part of Adobe Analytics) is a nicely comprehensive tool for tracking user interactions upon one’s website, app and more. However, in the past I’ve had a fair amount of trouble de-siloing its potentially immensely useful data into external tools, such that I could connect, link and process it for insights over and above those you can get within the default web tool (which, to be fair, is itself improving over time).

I’ve written in the past about my happiness when Alteryx released a data connector allowing one to access Sitecatalyst data from inside their own tool. That’s still great, but the tool is necessarily constrained to the specific tasks the creator designed it to do, and subject to the same API limits as everyone else is. I have no doubt that there are ways and means to get around that in Alteryx (after all, it speaks R). But sometimes, when trying to automate operations, coding in something like might actually be easier…or at least cheaper!

With that in mind, after having a recent requirement to analyse web browsing data at a individual customer level, I successfully experimented with the awesome RSiteCatalyst package, and have some notes below as to the method which worked well for me.

Note that RSiteCatalyst is naturally subject to the usual Adobe API limits – the main one that causes me sadness being the inability to retrieve over 50k rows at a time – but, due to the incessant flexibility of R and the comprehensiveness of this package, I’ve not encountered a problem I couldn’t solve just yet.

So, how to set up?

Install the RSiteCatalyst package

First, open up R, or your favourite R environment, and install the RSiteCatalyst package (the first time you use it) or load the library (each new session).

if(!require(RSiteCatalyst)) install.packages("RSiteCatalyst")
library(RSiteCatalyst)

Log in to your SiteCatalyst account

You next need to authenticate against your Adobe Sitecatalyst installation, using the SCAuth function. There’s an old way involving a shared secret, and a new way using OAuth. The latter is to be preferred, but at the time I first looked at it there seemed to be a web service issue that prevented the OAuth process completing. At the moment then, I can confirm that the old method still works!

 key <- "username:company>"
 secret <- "https://sc.omniture.com/p/suite/1.3/index.html?a=User.GetAccountInfo >"
 SCAuth(key, secret)

 

Retrieve metadata about your installation

Once you’re in, there’s a plethora of commands to retrieve useful metadata, and then to run and retrieve various types of report data. For several such commands, you’ll need to know the ID of the Adobe Report Suite concerned, which is fortunately as easy as:

suites <- GetReportSuites()

whereby you’ll receive a dataframe containing all available report suites by title and ID.

If you already know the title of the report suite you’re interested in then you can grab the ID directly with something like:

my.rsid <- suites[suites$site_title=="My favourite website",1]

You can then find out which elements are available within your report suite:

elements.available <- GetElements(my.rsid)

and later on which metrics are available for a given element

metrics.available <- GetMetrics(my.rsid, elements = "<<myFavouriteElement>>")

Retrieve Adobe Analytics report data

There are a few different RSitecatalyst functions you can call, depending on the type of report you’re interested in. In each case they start with “queue”, as what you’re actually doing is submitting a data request to the Sitecatalyst reporting queue.

If your request is pretty quick, you can wait a few seconds and get the results sent back to you immediately in R. If it’s going to take a very long time, you can instead store a report request ID and then use the GetReport function to go back and get it later, once it’s finished.

The current list of queue functions, which are named such that Adobe aficionados will probably be able to guess which type of data they facilitate, is:

  • QueueDataWarehouse
  • QueueFallout
  • QueueOvertime
  • QueuePathing
  • QueueRanked
  • QueueSummary
  • QueueTrended

Here I’ll show just a couple of examples – but the full official documentation for all of them and much more besides is available at Cran.

Firstly, an “over  time” report to get me the daily pageview and visit counts my site received in the first half of 2016.

Here’s how the documentation describes this type of report:

“A QueueOvertime report is a report where the only granularity allowed is time. This report allows for a single report suite, time granularity, multiple metrics, and a single segment”

An example would be a day by day count of total pageviews to your site.

Remember that above we set the variable “my.rsid” to the Report Suite ID of the report suite I am looking at. So:

dailyStats <- QueueOvertime(my.rsid,
                            date.from = "2016-01-01",
                            date.to = "2016-06-30",
                            metrics = c("pageviews","visits"),
                            date.granularity = "day"
                            )

The parameters are fairly self-evident, especially in combination with the documentation. There are many more available, including the ability to isolate by segment and to forecast data.

Your site will probably have some different metrics available to mine, depending on how the Adobe admin set it up, but the basics like page views and visits should be accessible pretty much anywhere.

What you’ll get back with the above command, i.e. the contents of the dailyStats variable after it has run, is a dataframe in this sort of format:

datetime name year month day segment.id segment.name pageviews visits
01/01/2016 Fri. 1 Jan. 2016 2016 1 1 100 75
02/01/2016 Sat. 2 Jan. 2016 2016 1 2 200 150
03/01/2016 Sun. 3 Jan. 2016 2016 1 3 250 180

Which you can then go on to process as you would any other such snippet of data in R.

My second, slightly more complex, example concerns the QueueRanked function. It’s described like this:

A QueueRanked report is a report that shows the ranking of values for one or more elements relative to a metric, aggregated over the time period selected.

That’s a bit more vague, but was useful to me in my original goal of identifying specifically which customers logged into my website in December, and how many times they visited.

A key feature of this function is that you can ask for the results from rank [x] to rank [y], instead of just the top [n].

This is super-useful where, like I did,  you expect to get over 50k rows. 50k is maximum row limit you can retrieve in one request via the Adobe API, which this R package uses. But R is full of the typical program language features like loops, thus allowing one to iterate through the commands to retrieve for instance results 1-50,000, then results 50,001 -100,000, then 100,001 – 150,000 and so on.

So, I built a loop that would generate these “ranked reports”, starting at row ‘i’ and giving me the next ‘count.step’ records, where count.step = 50000, the maximum I’m allowed to retrieve in one go.

Thus, I’d call the function repeatedly, each time asking for the next 50,000 records. At some point, when there were no more customers to download, I’d get a blank report sent back to me. At that point, I know I have everything so quit the loop.

I wanted to retrieve the ID of the customer using the website, which in my setup is stored in an custom element called “prop1”. All that sort of detail is controlled by your Adobe Sitecatalyst administrator, should you have exactly the same sort of requirement as I did – so best go ask them which element to look in, as there’s no real chance your setup is identical to mine at that level.

Nonetheless, the code pattern below could likely be used without much modification in order to iterate through any SiteCatalyst data that exceeds the API row limits.

 

count.limit <- 500000 #the max number of records we're interested in
count.step <- 50000 #how many records to retrieve per request, must not exceed 50k
count.start <- 1 #which record number to start with
CustomerList <- NULL #a variable to store the results in
fromDate <- "2016-12-01"
toDate <- "2016-12-31"

for(i in seq(1, count.limit, by = count.step)) {
  print(paste("Requesting rows",i, "through", i + count.step - 1))

  tempCustomerList <- QueueRanked(my.rsid,
                          date.from = fromDate,
                          date.to = toDate,
                          metrics = "visits",
                          elements = "prop1",
                          top = count.step,
                          start = i
                          )

  if  (nrow(tempCustomerList) == 0 ) {   # no more rows were returned - presumably we have them all now
    print("Last batch had no rows, exiting loop")
    break
  }
   
  tempCustomerList$batch.start.row <- i

  CustomerList <- rbind(customerList, tempCustomerList)

}

 

After running this, you should end up with a “CustomerList” dataframe that looks something like this, where “name” is the value of prop1, in my case the customer ID:

name url visits segment.id segment.name batch.start.row
ID123 10 1
ID234 5 1

which, again, you can process as though it was any standard type of R data.

Clustering categorical data with R

Clustering is one of the most common unsupervised machine learning tasks. In Wikipedia‘s current words, it is:

the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups

Most “advanced analytics” tools have some ability to cluster in them. For example, Alteryx has K-Centroids AnalysisR, Python, SPSS, Statistica and any other proper data sciencey tools all likely have many methods – and even Tableau, although not necessarily aimed at the same market, just added a user-friendly clustering facility.  You can do the calculations in Excel, should you really want to (although why not cheat and use a nice addin if you want to save time?).

However, many of the more famous clustering algorithms, especially the ever-present K-Means algorithm, are really better for clustering objects that have quantitative numeric fields, rather than those that are categorical. I’m not going delve into the details of why here, but, simplistically, they tend to be based on concepts like Euclidean distance – and in that domain, it’s conceptually difficult to say that [bird] is Euclideanly “closer” to [fish] than [animal]; vs the much more straightforward task of knowing that an income of £100k is nearer to one of £90k than it is to 50p. IBM has a bit more about that here.

But, sometimes you really want to cluster categorical data! Luckily, algorithms for that exist, even if they are rather less widespread than typical k-means stuff.

R being R, of course it has a ton of libraries that might help you out. Below are a couple I’ve used, and a few notes as to the very basics of how to use them – not that it’s too difficult once you’ve found them. The art of selecting the optimum parameters for the very finest of clusters though is still yours to master, just like it is on most quantitative clustering.

The K-Modes algorithm

Like k-means, but with modes, see 🙂 ? A paper called ‘Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values‘ by Huang gives the gory details.

Luckily though, a R implementation is available within the klaR package. The klaR documentation is available in PDF format here and certainly worth a read.

But simplistically, you’re looking at passing a matrix or dataframe into the “kmodes” function.

Imagine you have a CSV file something like:

RecordID FieldA FieldB FieldC FieldD
1 0 0 0 1
2 0 0 0 0
3 0 0 0 1
4 1 1 0 0

Here’s how you might read it in, and cluster the records based on the contents of fields “FieldA”, “FieldB”, “FieldC”, and “FieldD”.

install.packages("klaR")
library(klaR)
setwd("C:/Users/Adam/CatCluster/kmodes")
data.to.cluster <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-kmodes(data.to.cluster[,2:5], 3, iter.max = 10, weighted = FALSE ) #don't use the record ID as a clustering variable!

Here I’ve asked for 3 clusters to be found, which is the second argument of the kmodes function. Just like k-means, you can specify as many as you want so you have a few variations to compare the quality or real-world utility of.

This is the full list of parameters to kmodes, per the documentation.

kmodes(data, modes, iter.max = 10, weighted = FALSE)
  • data: A matrix or data frame of categorical data. Objects have to be in rows, variables
    in columns.
  • modes: Either the number of modes or a set of initial (distinct) cluster modes. If a
    number, a random set of (distinct) rows in data is chosen as the initial modes.
  • iter.max: The maximum number of iterations allowed.
  • weighted: Whether usual simple-matching distance between objects is used, or a weighted version of this distance.

What do you get back?

Well, the kmodes function returns you a list, with the most interesting entries being:

  • cluster: A vector of integers indicating the cluster to which each object is allocated.
  • size: The number of objects in each cluster.
  • modes: A matrix of cluster modes.
  • withindiff: The within-cluster simple-matching distance for each cluster

Here’s an example what it looks like when output to the console:

K-modes clustering with 3 clusters of sizes 3, 5, 12

Cluster modes:
 FieldA FieldB FieldC FieldD
1 1 0 0 0
2 1 0 1 1
3 0 0 0 0

Clustering vector:
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
 3 3 3 1 3 1 2 3 3 3 2 2 2 3 3 2 1 3 3 3

Within cluster simple-matching distance by cluster:
[1] 2 2 8

Available components:
[1] "cluster" "size" "modes" "withindiff" "iterations" "weighted"

So, if you want to append your newly found clusters onto the original dataset, you can just add the cluster back onto your original dataset as a new column, and perhaps write it out as a file to analyse elsewhere, like this:

cluster.output <- cbind(data.to.cluster,cluster.results$cluster)
write.csv(cluster.output, file = "kmodes clusters.csv", row.names = TRUE)

 

The ROCK algorithm

Some heavy background reading on Rock is available in this presentation by Guha et al.

Again, a benevolent genius has popped an implementation into R for our use. This time you can find it in package “cba”. The PDF docs for cba are here.

But the most simplistic usage is very similar to k-modes, albeit with different optional parameters based on the how the algorithms differ.

Here’s what you’d do to cluster the same data as above, and write it back out, this time with the Rock clusters appended. Note here that ideally you’re specifically passing in a matrix to the rockCluster function.

install.packages("cba")
library(cba)
setwd("C:/Users/Adam/CatCluster/rock")
data.to.cluster <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-rockCluster(as.matrix(data.to.cluster[,2:5]), 3 )
cluster.output <- cbind(data.to.cluster,cluster.results$cl)
write.csv(cluster.output, file = "Rock clusters.csv", row.names = TRUE)

The full list of parameters to the relevant function, rockCluster is:

rockCluster(x, n, beta = 1-theta, theta = 0.5, fun = "dist", funArgs = list(method="binary"), debug = FALSE)
  • x: a data matrix; for rockLink an object of class dist.
  • n: the number of desired clusters.
  • beta: optional distance threshold
  • theta: neighborhood parameter in the range [0,1).
  • fun: distance function to use.
  • funArgs: a list of named parameter arguments to fun.
  • debug: turn on/off debugging output.

This is the output, which is of class “rock”, when printed to the screen:

data: x 
 beta: 0.5 
theta: 0.5 
 fun: dist 
 args: list(method = "binary") 
 1 2 3 
14 5 1

The object is a list, and its most useful component is probably “cl”, which is a factor containing the assignments of clusters to your data.

Of course once you have the csv files generated in the above ways, it’s just bog-standard data – so you’re free to visualise in R, or any other tool.

Kruskal Wallis significance testing with Tableau and R

Whilst Tableau has an increasing number of advanced statistical functions – a case in point being the newish analytics pane from Tableau version 9 – it is not usually the easiest tool to use to calculate any semi-sophisticated function that hasn’t yet been included.

Various clever people have tried to work some magic aroud this, for instance by attempting “native” Z-testing. But in general, it’s sometimes a lot of workaroundy effort for little gain. However, a masterstroke of last year’s Tableau version was that is included the ability to interface with R.

R is a statistical programming language that really does have every statistical or modelling function you’ll ever need. When someone invents new ones in the future, I have no doubt an R implementation will be forthcoming. It’s flexibility and comprehensiveness (and price…£0) is incredible. However it is not easy to use for the average person. For a start, the default interface is a command line, and – believe it or not – there are people alive today who never had the singular joy of the command line being the only way to use a computer.

It must be said that in some ways Tableau has not really made R super easy to use. You still need to know the R language in order to use R functionality inside Tableau. But using the Tableau interface into R has 2 huge benefits.

  1. It integrates into your exist Tableau workspace. The Tableau interface will push any data you have in Tableau, with all your filters, calculations, parameters and so on to R and wait for a result. There’s no need to export several files with different filters etc. and read.csv(“myfile1.csv”) them all into R. Once you’ve set up the R script, it reruns automatically whenever your data changes in Tableau.
  2. It visualises the results of the R function immediately and dynamically, just like any other Tableau viz. R has a lot of cool advanced graphics stuff but it can be hard to remember quite how to use and the most of the basic components offer no interactivity. No drag and drop pills or “show me” to be seen there!

So recently when I needed to perform a Kruskal Wallis significance test over data I was already analysing in Tableau, this seemed the obvious way to do it.

Kruskal Wallis is fully explained in Wikipedia .To over-simplify for now, it allows you to detect whether there is a difference in data when it comes from multiple groups – like ANOVA more famously does, but KW doesn’t require data to be parametric.

This is classic hypothesis testing. With my questionnaire, we start with the hypothesis that there is no difference in the way that different groups answer the same question, and see if we can disprove that.

In honesty, the fact it’s a KW test is not all that important to this post. Using R to perform a certain test in Tableau is pretty much the same as it would be for any other test, aside from needing to know the name and syntax of the test in R. And Google will never let you down on that front.

In this case, I wanted to analyse questionnaire results scored on a Likert-type scale to determine on several dimensions whether the answering patterns were significantly different or not. For instance, is there a real difference in how question 1 is answered based on the respondents age? Is there a real difference in how question 2 is answered based on respondents country of origin? How confident can we be that any difference in averages truly represents a real difference and is not the results of random noise?

I wanted to do this for about 30 questions over 5 different dimensions; a total of 150 Krusal-Wallis tests, and I didn’t overly fancy doing it manually.

Had I got only a copy of pure R to hand, this might be how I would have tackled the question – assuming that I output the results of question 1 dimensioned by groups in a CSV file before, in a format like this:

RespondentID Group Score
R1 Group A 4
R2 Group B 5
R3 Group B 5
R4 Group A 3
R5 Group C 1


setwd('../desktop/questionnaire')
q1_data<-read.csv('q1_responses.csv')
q1_dataframe<-setNames(data.frame(q1_data[3], q1_data[2]), c('scores','groups'))
kruskal.test(scores ~ groups, data=q1_dataframe)

This would give me output like the below.


Kruskal-Wallis rank sum test
data: scores by groups
Kruskal-Wallis chi-squared = 2.8674, df = 2, p-value = 0.2384

To understand the syntax and what it returns, see the man page for kruskal.test here.

But to pick a singular point, it shows case shows that the “p-value”, the probability of the differences in scores between the groups being purely down to the random fluctuations of chance, is 0.2384, aka 24%. Numbers above 5-10% are often held to mean that any difference seen has not been statistically proven. Here there’s about a 1 in 4 chance that any differences in these answers are purely due to random chance.

There are endless discussions/controversies as to the use of P values, and statistical vs “clinical” differences – but this is the end point we are aiming for right now.

I could then have done the above process for the other 149 combinations I wanted to test and constructed a table of P results (or scripted to do the same), but it seemed more fun to use Tableau.

First you have to set up Tableau to interface with R correctly. This means running an R-server (called Rserve, handily enough) which is of course also free, and can be started from within the R interface if you have it open. Here’s Tableau’s handy instructions  but it can be boiled down to:
install.packages("Rserve")
library(Rserve)
Rserve()

A few notes:

  • It’s fine to run the Rserve on the same computer as Tableau is on.
  • You only have to enter the first line the first time you use Rserve.
  • Capitalisation of Rserve matters for some commands!
  • If you are going to use Rserve a lot more than you use R you can start it from an exe file instead, as detailed in the above Tableau article.
  • Workbooks using R will not work in Tableau Reader or Tableau Online.

Then carry on the process of setting up R as per the Tableau article. Notably you have to look in the Help menu (a strange place for it perhaps) for “Settings and performance” then “Manage R connection” and then type in the name of your Rserve host. If you’re running it on your computer then choose “localhost” with no user/password.

R connection

Next, assuming you have your Tableau workbook already set up with your questionnaire data, create a new worksheet which filters to show only a single question, and has each unique person that answered that question and the grouping variable on the rows. This layout is important, as will be discussed later.

Layout

Now it’s time to write the formula that interfaces between Tableau and R.

This Tableau blog post explains in detail how to do that in general – but the essence is to use a calculated field with one of the Tableau SCRIPT_ functions that corresponds to the type of result (string, integer etc.) that you want to retrieve from R. You then pass it literal R code and then use placeholder variables .arg1, .arg2…argN to tell it which data you want Tableau to send to R.

The calculated field will then send the R script with the relevant data inserted as a R vector into the placeholders to the R server, and wait for a result to be returned. The result can then be used inside Tableau as, just as any other table calculation can .

For the Kruskal Wallis test, when we retrieve the P value from a KW test, we saw above that it is in a decimal number format. This is what Tableau calls a “real” number, so we use the SCRIPT_REAL function.

Within that function we enter R code resembling the last line in my R-only attempt, which was the one that actually did the calculations.

You can put more than one R code line in here if you want to. However note that a Tableau R function can only handle returning a single value or vector by default. Here I have therefore specifically asked it to return just the p value as it would be hard to handle the general table of results and associated text. Read the R documentation for any function to understand which other types of value are possible to return from that function.

So, in Tableau, here’s the calculated field to create.

Kruskal Wallis p score:


SCRIPT_REAL(
"
kruskal.test(.arg1~.arg2, data=data.frame(.arg1,.arg2 ))$p.value
",
AVG([Score]),ATTR([Group]))

If you compare it to my first R script, you can see that I want it to replace “.arg1” with the first argument I gave after the R script (AVG([SCORE]) and “.arg2” with the GROUP. It’s safe, and necessary, to use ATTR() as you have to pass it an aggregation. Because we set the layout to by 1 row per respondent we know each row will have only a single group associated with that respondent (assuming the relationship between respondents and groups is 1:many, which it needs to be for a successful KW test).

You could use .arg3, arg4… etc. if you needed to pass more data to R, but that’s not necessary for the Kruskal Wallis test.

You can then double click your new calculated field and it’ll appear in the text pill and display the p value we saw above (remember, that 0.2384 number from above? Horray, it matches!) repeatedly, once per respondent. The nature of the KW test is that it returns one value that describes the whole population.

KW test
Being Tableau we can filter to show another single (see below) question, change the respondent pool, highlight, exclude and do all sorts of other things to the data and the new P value for the given breakdown is immediately shown (subject to the rules of table calculations…remember that SCRIPT_… Functions are table calculations!)

But what if we want the Kruskal P value for all the questions in the survey in one table, and not have to iterate manually through every question?

We can’t just unfilter or add questions willy-nilly, because a single p value will be calculated for the whole dataset of ( question 1 + question 2 ) responses, instead of one being calculated for ( question 1) and another for ( question 2 ). This is unlikely to be useful.

KW whole population

See how there’s no 0.2384 value in the above. Instead the same number is repeated for every single question and respondent. It’s compares the scores as though they were all for the same question.

By virtue of the SCRIPT_ Tableau functions being a table calculation though, this one is relatively easy to solve.

Right clicking on the R calculation field gives the normal Edit Table Calculation option. Pick that, and then select “Advanced” under Compute Using.

Simplistically, we want to partition the P values that are calculated and returned to 1 per survey question, so put “Question” in the partition box and the other fields into the addressing box.

Advanced compute by

Hit OK, Tableau refreshes its R stuff, and all is good. That 0.2384 is back (and the 0 for question two is the result of rounding a number so small that the p value shows significance at all practical levels.)
1 row per response

However, we’re getting 1 row for every respondent to every question instead of a nice summary of one row per question.

We can’t just remove the respondent ID from the Tableau view because, whilst this gives the layout we want, the SCRIPT_REAL calculation will pass only the aggregated values of the respondent data to R, i.e. one average value per question in this case, not 1 per person answering the question.

This will give either an error or a wrong answer, depending on which R function you’re trying to call. In this case it would be passing the average response to the question to R to perform the KW test on, whereas R needs each individual response to the question to provide its answers.

The solution to this issue is the same as the classic trick they teach in Tableau Jedi school that allows data to be dynamically hidden yet remain within scope of a table calculation. Namely that when you use a table calculation as a filter, the calculations are mostly done before a filter is applied, in contrast to a non-table calculation where the calculations are done on the post-filtered data.

So, create a simple dummy table calculation that returns something unique for the first row. For instance, create a calculated field just called “Index” that has the following innocuous function in it.


INDEX()

Now if you put that on filters and set it to only show result “1” (the first result) you will indeed get 1 result shown – and the 0.2384 P value remains correct.

1 row

But we actually wanted the first result per question, right, not just the single first result?

Luckily our index function is also a table calculation, so you can do exactly the same compute-by operation we did about to our R-based field.

Right click the index filter pill, go back to “Edit table calculation”, and fill the Compute Using -> Advanced box in in the same way as we previously did to our R calculation function, putting “Question” into the partition box and the rest into addressing.

Tada!

1 row per question

In the interests of neatness we can hide the useless respondent ID and group columns as they don’t tell us anything useful. Right click the respective Rows pill and click ‘Show header” to turn the display off, whilst leaving it in the pills section and hence level of detail.

Clean table

Bonus tips:

  • You can send the values of Tableau parameters through the R function, controlling either the function itself or the data it sends to R – so for instance allowing a novice user to change the questions scored or the groups they want to compare without having to mess with any calculated fields or filters.
  • The R-enabled calculation can be treated as any other table calculation is in Tableau – so you can use it in other fields, modify it and so on. In this example, you could create a field that highlights questions where the test showed that the differences were significant with at 95% certainty like this.

Highlight 95% significance:


IF [Kruskal Wallis test p value] <= 0.05 THEN "Significant" ELSE "Not significant" END

Pop it on the colours card and you get:

Highlight significance

Now it’s easy to see that, according to Kruskal-Wallis, there was a definite difference in the way that people answered question 2 depending on which group they were in, whereas at this stringent level, there was no proven difference in how they answered question 1.

You can download the complete workbook here.