Free dataset: all Reddit comments available for download

As terrifying a thought as it might be, Jason from Pushshift.io has extracted pretty much every Reddit comment from 2007 through to May 2015 that isn’t protected, and made it available for download and analysis.

This is about 1.65 million comments, in JSON format. It’s pretty big, so you can download it via a torrent, as per the announcement on archive.org.

If you don’t need a local copy, Reddit user fhoffa has loaded most of it into Google BigQuery for anyone to use.

If you have an account over there, then as Tableau now has a native BigQuery connector you can visualise it directly in Tableau – which Mr Hoffa has indeed done and shared with the world at Tableau Public.

Although you get a certain amount of uploading and usage from BigQuery for free, you will most likely need a paid account to integrate it directly into a Tableau (or equivalent) project like this, as you’ll want to create a BigQuery dataset to connect Tableau to.

However, if you only need to run some SQL on the freely available dataset to get some output – which you can then manually download and integrate into whatever you like – your free monthly allowance of BigQuery usage might well be enough.

Here’s the link to the data in BigQuery – at least one of the tables. You’ll see the rest in the interface on the left as per this screenshot:

BigQuery reddit data

You can then run some BigQuery SQL over it using the web interface – for free, up to a point, and retrieve whichever results you need.

For example:

SELECT * FROM [fh-bigquery:reddit_comments.2007] LIMIT 10

will give you 10 Reddit comments from (surprise surprise) 2007.

BigQuerySQL

As you can see on the bottom right, you can save results into a BigQuery table (this requires a dataset for which you need to enable billing on your BigQuery account) or download as CSV / JSON to do whatever you want with.

Basic text tokenisation with Alteryx

Free text analytics seems a fashionable pastime at present. The most commonly seen form in the wild might be the very basic text visualisation known as the “word cloud”. Here, for instance is the New York Times’ “most searched for terms” represented in such a cloud.

When confronted with a body of human-written text, one of the first steps for many text-related analytical techniques is tokenisation. This is the process where one decomposes the lengthy scrawlings received from a human into suitable units for data mining; for example sentences, paragraphs or words.

Here we will consider splitting text up into individual words with a view to later processing.

My (trivial) sample data is:

ID Free text 1 Free text 2 Free text 3
1 Here is some Free text! But it needs seperating
2 Into
individual words
Can
Alteryx
Do it?
3 Of course ALTERYX can , no problem

Let’s imagine I want to know how many times each word occurs. For instance, “Alteryx” appears in there twice.

Just to complicate things marginally, I’d also like to :

  • remove all the “common” words that tend to dominate such analysis whilst adding little insight – “I”, “You” etc.
  • ignore capitalisation: the word ALTERYX should be considered the same as Alteryx for example.

Alteryx has the tools to do a basic version of this in just a few seconds. Here’s a summary of one possible approach.

First I used an Input tool to connect to my datasource which contained the free text for parsing.

Then use a formula tool to pull all the free text fields into one single field, delimited by a delimiter of my choice. The formula is simply a concatenation:
[Free text 1] + "," + [Free text 2] + "," + [Free text 3]

The key manipulation was then done using the “Text To Columns” tool. Two configuration options were particularly helpful here.

  •  Despite its name, it has a configuration option to split your text to rows, rather than columns.This is great for this type of thing because each field might contains a different, and unknown amount of words in in – and for most analytic techniques it is often easier to handle a table of unknown length than one of unknown width.

    You will still be able to track which record each row originally came from as Alteryx preserves the fields that you do not split on each record, similar to how the “Transpose” tool works.

  • You can enter several delimiters – and it will consider any of them independently. The list I used was ', !;:.?"' , meaning I wanted to consider that a new word started or ended whenever I saw a comma, space, full stop, question mark and so on. You can add as many as you like, according to how your data is formatted. Note also the advanced options at the bottom if you want to ignore delimiters in certain circumstances.Text to columns tool

When one runs the tool, if there are several delimiters next to each other, this will (correctly) cause rows with blank “words” to be generated. These are easily removed with a Filter tool, set to exclude any record where the text IsEmpty().

Next I wanted to remove the most common basic words, such that I don’t end up with a frequency table filled with “and” or “I” for instance. These are often called stopwords. But how to choose them?

In reality, stopwords are best defined by language and domain. Google will find you plenty based on language but you may need to edit them to suit your specific project. Here, I simply leveraged the work of the clever developers of the famous Python Natural Language Toolkit 

This toolkit contains a corpus of default stopwords in several languages. The English ones can be extracted via:

from nltk.corpus import stopwords
stopwords.words('english')

which results in a list of 127 such words – I, me, my, myself etc. You can see the full current list on my spreadsheet: NLTK stopwords.

I loaded these into an Alteryx text input tool, and used a Join tool to connect the words my previous text generated (on the left side) to the words in this stopword corpus (on the right side), and took the left-hand output of the join tool.

Join tool

The overall effect of this is what relational database fans might call a LEFT JOIN: an output that gives me all of the records from my processed text that do not match those in the stopword corpus.

(Alteryx has a nice “translate from SQL to Alteryx” page in the knowledgebase for people looking to recreate something they did in SQL in this new-fangled Alteryx tool).

The output of that Join tool is therefore the end result; 1 record for each word in your original text, tokenised, that is not in the list of common words. This is a useful format for carrying on to analyse in Alteryx or most other relevant tools.

If you wanted to do a quick frequency count within Alteryx for instance, you can do this in seconds by dropping in a Summarise tool that counts each record, grouping by the word itself.

You can download the full workflow shown above here.

The most toxic place on Reddit

Reddit, the “front page of the internet” – and a network I hardly ever dare enter for fear of being sucked in to reading 100s of comments for hours on highly pointless yet entertaining things –  has had its share of controversies over the years.

The site is structurally divided up into “subreddits” , which one can imagine just as simple, quite old-school, forums where anyone can leave links and comments, and anyone else can up or downvote them as to whether they approve or not.

Reddit users were themselves busily engaged in a chat regarding “which popular subreddit has a really toxic community” when Ben Bell of Idibon  (a company big into text analysis)  decided to tackle the same question with a touch of data science.

But what is “toxic”? Here’s their definition.

Ad hominem attack: a comment that directly attacks another Redditor (e.g. “your mother was a hamster and your father smelt of elderberries”) or otherwise shows contempt/disagrees in a completely non-constructive manner (e.g. “GASP are they trying CENSOR your FREE SPEECH??? I weep for you /s”)

Overt bigotry:  the use of bigoted (racist/sexist/homophobic etc.) language, whether targeting any particular individual or more generally, which would make members of the referenced group feel highly uncomfortable

Now, text sentiment analysis isn’t all that perfect as of today. The CTO of Datasift  who has a very cool social-media-data-acquiring-tool was claiming around 70% accuracy as being about the peak possible, a couple of years ago. The CEO of the afore-mention Idibon claimed about 80% was possible today.

No-one is claiming nearly 100%, especially on such subtle determinations such as toxicity, and their chosen opposite, supportiveness. The learning process was therefore a mix of pure machine science and human involvement, with the Idibon sentiment analysis software highlighting, via the Reddit API, the subreddits most likely to be extreme, and humans classifying a subset of the posts into those categories.

But what is a toxic community? It’s not as simple as simply a place with a lot of toxic comments (although that’s probably not a bad proxy). It’s a community where such nastiness is approved of or egged on, rather than ignored, frowned upon or punished. Here Reddit provides a simple mechanism to indicate this, as each user can upvote (approve of) or downvote (disapprove of)  a post.

Their final formula they used to calculate judge the subreddits, as per their blog again, is 

The full results of their analysis are kindly available for interactive visualisation, raw data download and so on here.

But in case anyone is in need of a quick offending, here were the top 5 by algorithmic toxicity. It may not be advisable to visit them on a work computer.

Rank of bigotry Subreddit name Official description
1 TheRedPill Discussion of sexual strategy in a culture increasingly lacking a positive identity for men.
2 Opieandanthony The Opie and Anthony Show
3 Atheism The web’s largest atheist forum. All topics related to atheism, agnosticism and secular living are welcome.
4 Sex r/sex is for civil discussions about all facets of sexuality and sexual relationships. It is a sex-positive community and a safe space for people of all genders and orientations.
5 Justneckbeardthings A subreddit for those who adorn their necks with proud man fur.Neckbeard: A man who is socially inept and physically unappealing, especially one who has an obsessive interest in computing:- Oxford Dictionary

[Edited to correct Ben Bell’s name and column title of table – my apologies!]