Clustering categorical data with R

Clustering is one of the most common unsupervised machine learning tasks. In Wikipedia‘s current words, it is:

the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups

Most “advanced analytics” tools have some ability to cluster in them. For example, Alteryx has K-Centroids AnalysisR, Python, SPSS, Statistica and any other proper data sciencey tools all likely have many methods – and even Tableau, although not necessarily aimed at the same market, just added a user-friendly clustering facility.  You can do the calculations in Excel, should you really want to (although why not cheat and use a nice addin if you want to save time?).

However, many of the more famous clustering algorithms, especially the ever-present K-Means algorithm, are really better for clustering objects that have quantitative numeric fields, rather than those that are categorical. I’m not going delve into the details of why here, but, simplistically, they tend to be based on concepts like Euclidean distance – and in that domain, it’s conceptually difficult to say that [bird] is Euclideanly “closer” to [fish] than [animal]; vs the much more straightforward task of knowing that an income of £100k is nearer to one of £90k than it is to 50p. IBM has a bit more about that here.

But, sometimes you really want to cluster categorical data! Luckily, algorithms for that exist, even if they are rather less widespread than typical k-means stuff.

R being R, of course it has a ton of libraries that might help you out. Below are a couple I’ve used, and a few notes as to the very basics of how to use them – not that it’s too difficult once you’ve found them. The art of selecting the optimum parameters for the very finest of clusters though is still yours to master, just like it is on most quantitative clustering.

The K-Modes algorithm

Like k-means, but with modes, see 🙂 ? A paper called ‘Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values‘ by Huang gives the gory details.

Luckily though, a R implementation is available within the klaR package. The klaR documentation is available in PDF format here and certainly worth a read.

But simplistically, you’re looking at passing a matrix or dataframe into the “kmodes” function.

Imagine you have a CSV file something like:

RecordID FieldA FieldB FieldC FieldD
1 0 0 0 1
2 0 0 0 0
3 0 0 0 1
4 1 1 0 0

Here’s how you might read it in, and cluster the records based on the contents of fields “FieldA”, “FieldB”, “FieldC”, and “FieldD”.

setwd("C:/Users/Adam/CatCluster/kmodes") <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-kmodes([,2:5], 3, iter.max = 10, weighted = FALSE ) #don't use the record ID as a clustering variable!

Here I’ve asked for 3 clusters to be found, which is the second argument of the kmodes function. Just like k-means, you can specify as many as you want so you have a few variations to compare the quality or real-world utility of.

This is the full list of parameters to kmodes, per the documentation.

kmodes(data, modes, iter.max = 10, weighted = FALSE)
  • data: A matrix or data frame of categorical data. Objects have to be in rows, variables
    in columns.
  • modes: Either the number of modes or a set of initial (distinct) cluster modes. If a
    number, a random set of (distinct) rows in data is chosen as the initial modes.
  • iter.max: The maximum number of iterations allowed.
  • weighted: Whether usual simple-matching distance between objects is used, or a weighted version of this distance.

What do you get back?

Well, the kmodes function returns you a list, with the most interesting entries being:

  • cluster: A vector of integers indicating the cluster to which each object is allocated.
  • size: The number of objects in each cluster.
  • modes: A matrix of cluster modes.
  • withindiff: The within-cluster simple-matching distance for each cluster

Here’s an example what it looks like when output to the console:

K-modes clustering with 3 clusters of sizes 3, 5, 12

Cluster modes:
 FieldA FieldB FieldC FieldD
1 1 0 0 0
2 1 0 1 1
3 0 0 0 0

Clustering vector:
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
 3 3 3 1 3 1 2 3 3 3 2 2 2 3 3 2 1 3 3 3

Within cluster simple-matching distance by cluster:
[1] 2 2 8

Available components:
[1] "cluster" "size" "modes" "withindiff" "iterations" "weighted"

So, if you want to append your newly found clusters onto the original dataset, you can just add the cluster back onto your original dataset as a new column, and perhaps write it out as a file to analyse elsewhere, like this:

cluster.output <- cbind(,cluster.results$cluster)
write.csv(cluster.output, file = "kmodes clusters.csv", row.names = TRUE)


The ROCK algorithm

Some heavy background reading on Rock is available in this presentation by Guha et al.

Again, a benevolent genius has popped an implementation into R for our use. This time you can find it in package “cba”. The PDF docs for cba are here.

But the most simplistic usage is very similar to k-modes, albeit with different optional parameters based on the how the algorithms differ.

Here’s what you’d do to cluster the same data as above, and write it back out, this time with the Rock clusters appended. Note here that ideally you’re specifically passing in a matrix to the rockCluster function.

setwd("C:/Users/Adam/CatCluster/rock") <- read.csv('dataset.csv', header = TRUE, sep = ',')
cluster.results <-rockCluster(as.matrix([,2:5]), 3 )
cluster.output <- cbind(,cluster.results$cl)
write.csv(cluster.output, file = "Rock clusters.csv", row.names = TRUE)

The full list of parameters to the relevant function, rockCluster is:

rockCluster(x, n, beta = 1-theta, theta = 0.5, fun = "dist", funArgs = list(method="binary"), debug = FALSE)
  • x: a data matrix; for rockLink an object of class dist.
  • n: the number of desired clusters.
  • beta: optional distance threshold
  • theta: neighborhood parameter in the range [0,1).
  • fun: distance function to use.
  • funArgs: a list of named parameter arguments to fun.
  • debug: turn on/off debugging output.

This is the output, which is of class “rock”, when printed to the screen:

data: x 
 beta: 0.5 
theta: 0.5 
 fun: dist 
 args: list(method = "binary") 
 1 2 3 
14 5 1

The object is a list, and its most useful component is probably “cl”, which is a factor containing the assignments of clusters to your data.

Of course once you have the csv files generated in the above ways, it’s just bog-standard data – so you’re free to visualise in R, or any other tool.