data.world: the place to go for your open data needs?

Somewhere in my outrageously long list of data-related links to check out I found “data.world“. Not only is that a nice URL, it also contains a worthy service that I can imagine being genuinely useful in future, if it takes off like it should. At first glance, it’s a platform for hosting data – seemingly biased towards the “open” variant of data although I see they also offer to host private data too – with some social bits and pieces overlaid.

What’s interesting about this particular portal to me over and above a bunch of other sites with a mass of open data available on them is:

  1. Anyone can upload any conventional dataset (well, that they are legally allowed to do) – so right now it contains anything from World Bank GDP info through to a list of medieval battles, and much more besides. Therefore it presumably seeks to be a host for all the world’s useful data, rather than that of a certain topic or producer. Caveat user etc. presumably applies, but the vision is nice.
  2. You can actually do things with the data on the site itself. For example, you can join one set of data to another hosted on the site, even if it’s from a totally different project from a totally different author, directly on the site. You can run queries or see simple visualisations.
  3. It’s very easy to get the data out, and hence use it in other tools should you want to do more complicated stuff later on.
  4. It’ll also host data documentation and sample queries (for example, SQL that you can run live) to provide contextual information and shortcuts for analysts who need to use data that they might not be intimately familiar with.
  5. There’s a social side. It also allows chat conversations between authors, users and collaborators. You can see what’s already been asked or answered about each dataset. You can “follow” individual people, or curated collection of subject-based datasets.

So let’s test a few features out with a simple example.

The overriding concept is that of a dataset. A dataset is more than a table; it can include many tables and a bunch of other sorts of non-data files that aid with the use of the data itself – for instance documentation, notebooks or images. Each user can create datasets, name and describe them appropriately, and decide whether they should be public or private.

Here’s one I prepared earlier (with a month of my Fitbit step count data in as it happens).

Capture

You can make your dataset open or private, attribute a license to be explicit about its re-use, and add tags to aid discovery. You can even add data via a URL, and later refresh if the contents of that URL changes.

As you can see, after import it shows a preview of the data it read in at the bottom of the screen.  If there were multiple files, you’d be able to filter or sort them to find the one you want.

If you hit the little “i” icon next to any field name, you get a quick summary visualisation and data description, dependent on data type. This is very useful to get a quick overview of what your field contains, and if it was read in correctly. In my view, this sort of thing should be a standard feature in most analytical tools (it already is in some).

Capture

I believe tags, field names and descriptions are searchable – so if you do a nice job with those then it’ll help people find what you’re sharing.

Other common actions now available after you’ve uploaded or discovered a data table of  interest would be to:

You can also “explore” the data. This expands the data table to take up most of the screen, enabling easier sorting, filtering and so on. More interestingly, you can open a chart view where you can make basic charts to understand your data in more detail.

Now, this isn’t going to replace your dedicated visualisation tool – it has only the most basic of customisations available at the moment – but it handles simple exploration requirements in a way that is substantially less time consuming than downloading and importing your data into another tool.

It even suggests some charts you might like to make, allowing 1-click creation. On my data, for example, it offered to make me a chart of “Count of records by Date” or “Count of records by Steps”. It seems to take note of the data types, for instance defaulting to a line chart for the count by date, and a histogram for the count by steps.

Here’s the sort of output the 1-click option gives you:

Capture

OK, that’s not a chart you’re going to send to Nature right away, but it does quickly show the range of my data, let me see check for impossible outliers, and gives some quick insights into the distribution. Apparently I commonly do between about 5000 and 7500 steps…and I don’t make the default Fitbit 10k steps target very often. Oops.

These charts can then immediately be downloaded or shared as png or pdf, with automatically generated URLs like https://data.world/api/chart/export/d601307c3e790e5d05aa17773f81bd6446cdd148941b89b243d9b78c866ccc3b.png

Here I would quite like a 1-click feature to save & publish any chart that was particularly interesting with the dataset itself -but I understand why that’s probably not a priority unless the charting aspect becomes more of a dedicated visualisation feature rather than a quick explore mechanic.

For now, you could always export the graphic and include it as, for example, an image file in the dataset. Here for example is  a dataset where the author has taken the time to provide a great description with some findings and related charts to the set of tables they uploaded.

One type of artefact you can save online with the dataset are queries. Yes, you can query your file live onsite, with either (a variant of) SQL or SPARQL. Most people are probably more familiar with SQL, so let’s start there.

Starting a new query will give you a basic SELECT * LIMIT query, but you’re free to use many (but not all) standard SQL features to change up your dataset into a view that’s useful to you.

Let’s see, did I ever make my 10k step goal in December? If so, on which days?

Capture

Apparently I did, on a whopping four days, the dates of which are outlined above. I guess I had a busy Christmas eve.

These results then behave just like a data table, so they can then be exported, linked to or visualised as a chart.

Once you’re happy with your query, if you think it’s useful for the future you can save it, or if it might aid other people, then you can publish it. A published query remains with the dataset, so next time someone comes to look at the dataset, they’ll see a list of the queries saved which they can re-use or adapt for their own needs. No more need for hundreds of people to transform a common dataset in exactly the same way again and again!

Interestingly, you can directly query between different datasets in the same query, irrespective of data table, data set, or author. Specifying the schemas feels a little fiddly at the moment, but it’s perfectly doable once you understand the system (although there’s no doubt room for future UI improvement here).

Imagine for instance that, for no conceivable reason, I was curious as to which celebrities sadly died on the days I met my 10k steps goal. Using a combination of my dataset and the 2016 celebrity deaths list from popculture, I can query like this:

Capture.PNG

…only to learn the sad news that a giant panda called Pan Pan sadly expired during one of my goal-meeting days.

Of course, these query results can be published, shared, saved, explored and so on just like we saw previously.

Now, that’s a silly example but the idea of, not only being able to download open data, but have subject matter experts combine and publish useful data models as a one-time effort for data-consumers to use in future is an attractive feature. Together with the ability to upload documentation, images or even analytical notebooks you may see how this could become an invaluable resource of data and experience – even within a single organisation, let alone as a global repository of open data.

Of course, as with most aggregation or social sites, there’s a network effect: how useful this site ends up being depends on factors such as how many people make active use of it, how much data is uploaded to it and what the quality of the data is.

If one day it grew to the point was the default place to look for public data, without becoming a nightmare to find those snippets of data gold in amongst its prospectively huge collection, it would potentially be an incredibly useful service.

The “nightmare to find” aspect is not a trivial point – there are already several open data portals (for instance government based ones) which offer a whole load of nice datasets, but often it is hard to find the exact content of data at the granularity that you’re after even when you know it exists – and these are on sites that are often quite domain-limited which in some ways makes the job easier. At data.world there is already a global search (which includes the ability to search specifically on recency, table name or field name if you wish), tags and curated collections which I think shows the site takes the issue seriously.

For analyst confidence, some way of understanding data quality would also be useful. The previews of field types and contents are already useful here. Social features, to try and surface a concept similar to “institutional knowledge”, might also be overlaid. There’s already a basic “like” facility. Of course this can be a challenging issue for any data catalogue that, almost by definition, needs to offer upload access to all.

For browser-haters, it isn’t necessary to use the site directly in order to make use of its contents. There’s already an API which gives you the ability to programmatically upload, query and download data. This opens up some interesting future possibilities. Perhaps, if data.world does indeed become a top place to look for the data of the world, your analytics software of choice might in future include a feature such that you can effectively search a global data catalogue from the comfort of your chart-making screen, with a 1-click import once you’ve found your goal. ETL / metadata tools could provide a easy way to publish to the results of your manipulations, and so on.

The site is only in preview mode at present, so it’s not something to stake your life on. But I really like the concept and the execution so far is way beyond some other efforts I’ve seen in the past. If I find I’ve created a public dataset I’d like to share, I would certainly feel happy to distribute it and all supporting documents and queries via this platform. So best of luck to data.world in the, let’s say,”ambitious” mission of bringing together the world’s freely available – yet often sadly undiscoverable – data in a way that encourages people to actually make valuable use of it.

Advertisements

Free up-to-date UK postcode latitude longitude data

Unless your data comes pre geo-encoded, if you’re trying to do some UK-based geospatial analysis you’ll probably need some easy way of translating addresses into latitude/longitude pairs or some similar co-ordinate system.

Whilst full-address geocoders are available, if you don’t actually need that level of precision then looking up a full postcode is often good enough and far faster (note to US readers: UK postcodes are way more precise than US zipcode – BPH Postcodes quotes an average of about 15 properties per postcode)

A few years ago this data could be a little challenging to obtain –  for free anyway. But now there are various sources offering that up sans charge in various formats and levels of up-to-dateness; some more useful than others

My current favourite is probably Doogal. There, Chris Bell has provided all sorts of address-based tools there. Need to generate a random postcode or calculate the elevation of any route? He has it all.

Most interesting to me are the big CSVs of the UK’s postcodes, past and present, obtainable by pressing “download” at the top right of this page.

It’s full postcode to long/lat mapping and includes the below columns too, which allow for some very useful groupings or lookups.

  • Postcode
  • In Use?
  • Latitude
  • Longitude
  • Easting
  • Northing
  • GridRef
  • County
  • District
  • Ward
  • DistrictCode
  • WardCode
  • Country
  • CountyCode
  • Constituency
  • Introduced
  • Terminated
  • Parish
  • NationalPark
  • Population
  • Households
  • Built up area
  • Built up sub-division
  • Lower layer super output area
  • Rural/urban
  • Region
  • Altitude

From his blog and various appreciative comments, it sounds like it’s kept up to date nice and regularly, which is truly a service to mankind.

If you have any problems with his file, then some other alternatives would include the free version of Codepoint Open (although it seems to show Eastings/Northings rather than long/lat – although it is possible to convert between the two), or a UK-Postcode API.

All this is overkill if you just have  <100 such lookups to make – you can find many web-based batch converters that’ll do the job for you very quickly, and often to address-level accuracy, if you don’t have too many to do. Doogal’s is here.

Free interactive data and analysis tools from Public Health England

Public Health England is an agency sponsored by the UK Department of Health whose aim is to “protect and improve the nation’s health and wellbeing, and reduce health inequalities”.

They use, generate or distribute a bunch of interesting health-related data. They’ve a collection of many, many “data and analysis tools”, linked to from this page.

There’s something for everyone there, at least if you’re interested in various health stats based mainly on UK geography. Be careful, it’s easy to lose hours clicking around all the various tools, if you have even a vague interest in such things 🙂

There’s general indicators such as health inequality maps, tools to allow comparison between health practices and a lot of information on specific health-related concerns such as cancer,  maternal mortality, mental health, obesity and many many more.

Quite often the links are to interactive web tools, such as the one pictured below, or to PDF summaries. If you’re after data tables that you can use to integrate in your own analysis, it might be quicker to start at the Public Health England bit of the statistics subsection of the gov.uk website.

Capture

New website launch from the Office of National Statistics

Yesterday, the UK Office of National Statistics, the institution that is “responsible for collecting and publishing statistics related to the economy, population and society”, launched its new website.

As well as a new look, they’ve concentrated on improving the search experience and making it accessible to mobile device users.

The front page is a nice at-a-glance collection of some of the major time series ones sees in the news (employment rate, CPI, GDP growth etc.) . And there’s plenty of help-yourself downloadable data; they claim to offer 35,000 time series which you can explore and download with their interactive time series explorer tool.

Kaggle now offers free public dataset and script combos

Kaggle, a company most famous for facilitating competitions that allow organisations to solicit the help of teams of data scientists to solve their problems in return for a nice big prize, recently introduced a new section useful even for the less competitive types: “Kaggle Datasets“.

Here they host “high quality public datasets” you can access for free. But what is especially nice is that as well as the data download itself, they host any scripts, code and results that people have already written to handle them, plus some general discussion.

For example, on the “World Food Facts” page you can see a script that “ByronVergoesHouwens” wrote to see which countries ate the most sugar, and also a chart that that script produced.  In fact you can even execute scripts online, thanks to their “Kaggle Scripts” product.

It looks like the datasets will be added to regularly, but right now the list is:

  • Amazon Fine Food Reviews
  • Twitter US Airline Sentiment
  • SF Salaries
  • First GOP debate Twitter Sentiment
  • 2013 American Community Survey
  • US Baby Names
  • May 2015 Reddit Comments
  • 2015 Notebook UX Survey
  • NIPS 2015 Papers
  • Iris (yes, the one you will have seen many times already if you’ve read ANY books/tutorials on clustering in R or similar!)
  • Meta Kaggle
  • Health Insurance Marketplace
  • US Dept of Education: College Scoreboard
  • Ocean Ship Logbooks (1750-1850)
  • World Development Indicators
  • World Food Facts
  • Hilary Clinton’s Emails (sounds fun…:-))

Microsoft Academic Graph: paper, journals, authors and more

The Microsoft Academic Graph is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals and conference “venues” and fields of study.

Microsoft have been good enough to structure and release a bunch of web-crawled data around scientific papers, journals, authors, URLs, keywords, references between and so on for free here. Perfect for understanding all sorts of network relationships between these nodes of academia.

The current version is 30gb of downloadable text files. It includes data on the following entities.

  • Affiliations
  • Authors
  • ConferenceSeries
  • ConferenceInstances
  • FieldsOfStudy
  • Journals
  • Papers
  • PaperAuthorAffiliations
  • PaperKeywords
  • PaperReferences
  • PaperUrls

Being webscraped and coming with a warning that it has only been minimally processed, they do instruct users to beware that the quality is not perfect – but it’s apparently the biggest chunk of bibliographic data like this that has been released for the public to do what it will with.

Free dataset: all Reddit comments available for download

As terrifying a thought as it might be, Jason from Pushshift.io has extracted pretty much every Reddit comment from 2007 through to May 2015 that isn’t protected, and made it available for download and analysis.

This is about 1.65 million comments, in JSON format. It’s pretty big, so you can download it via a torrent, as per the announcement on archive.org.

If you don’t need a local copy, Reddit user fhoffa has loaded most of it into Google BigQuery for anyone to use.

If you have an account over there, then as Tableau now has a native BigQuery connector you can visualise it directly in Tableau – which Mr Hoffa has indeed done and shared with the world at Tableau Public.

Although you get a certain amount of uploading and usage from BigQuery for free, you will most likely need a paid account to integrate it directly into a Tableau (or equivalent) project like this, as you’ll want to create a BigQuery dataset to connect Tableau to.

However, if you only need to run some SQL on the freely available dataset to get some output – which you can then manually download and integrate into whatever you like – your free monthly allowance of BigQuery usage might well be enough.

Here’s the link to the data in BigQuery – at least one of the tables. You’ll see the rest in the interface on the left as per this screenshot:

BigQuery reddit data

You can then run some BigQuery SQL over it using the web interface – for free, up to a point, and retrieve whichever results you need.

For example:

SELECT * FROM [fh-bigquery:reddit_comments.2007] LIMIT 10

will give you 10 Reddit comments from (surprise surprise) 2007.

BigQuerySQL

As you can see on the bottom right, you can save results into a BigQuery table (this requires a dataset for which you need to enable billing on your BigQuery account) or download as CSV / JSON to do whatever you want with.

Free data: Constituency Explorer – UK demographics, politics, behaviour

From some combination of the Office of National Statistics, the House of Commons and Durham library comes Constituency Explorer.

Constituency Explorer

Billing itself as “reliable evidence for politicians and journalists – data for everyone”, it allows interactive visualisation of many interesting demographics/behavioural/political attributes by UK political constituency. It’s easy to view distributions and compare between a specific contstituency, the region and the country on topics like

  • 2010 election results (turnout and results)
  • vehicle ownership
  • age
  • ethnicity
  • travel to work
  • household composition
  • qualifications
  • etc. etc.

Each chart has also a “download this data” link at the bottom left, which I would assume should give you a nice integratable spreadsheet/xml/something – but at the time of writing unfortunately one gets a “not found” error…

There’s also a fun “how well do you know your constituency” quiz which is nice for comparing one’s media-fueled perception of a given area to reality.

The most toxic place on Reddit

Reddit, the “front page of the internet” – and a network I hardly ever dare enter for fear of being sucked in to reading 100s of comments for hours on highly pointless yet entertaining things –  has had its share of controversies over the years.

The site is structurally divided up into “subreddits” , which one can imagine just as simple, quite old-school, forums where anyone can leave links and comments, and anyone else can up or downvote them as to whether they approve or not.

Reddit users were themselves busily engaged in a chat regarding “which popular subreddit has a really toxic community” when Ben Bell of Idibon  (a company big into text analysis)  decided to tackle the same question with a touch of data science.

But what is “toxic”? Here’s their definition.

Ad hominem attack: a comment that directly attacks another Redditor (e.g. “your mother was a hamster and your father smelt of elderberries”) or otherwise shows contempt/disagrees in a completely non-constructive manner (e.g. “GASP are they trying CENSOR your FREE SPEECH??? I weep for you /s”)

Overt bigotry:  the use of bigoted (racist/sexist/homophobic etc.) language, whether targeting any particular individual or more generally, which would make members of the referenced group feel highly uncomfortable

Now, text sentiment analysis isn’t all that perfect as of today. The CTO of Datasift  who has a very cool social-media-data-acquiring-tool was claiming around 70% accuracy as being about the peak possible, a couple of years ago. The CEO of the afore-mention Idibon claimed about 80% was possible today.

No-one is claiming nearly 100%, especially on such subtle determinations such as toxicity, and their chosen opposite, supportiveness. The learning process was therefore a mix of pure machine science and human involvement, with the Idibon sentiment analysis software highlighting, via the Reddit API, the subreddits most likely to be extreme, and humans classifying a subset of the posts into those categories.

But what is a toxic community? It’s not as simple as simply a place with a lot of toxic comments (although that’s probably not a bad proxy). It’s a community where such nastiness is approved of or egged on, rather than ignored, frowned upon or punished. Here Reddit provides a simple mechanism to indicate this, as each user can upvote (approve of) or downvote (disapprove of)  a post.

Their final formula they used to calculate judge the subreddits, as per their blog again, is 

The full results of their analysis are kindly available for interactive visualisation, raw data download and so on here.

But in case anyone is in need of a quick offending, here were the top 5 by algorithmic toxicity. It may not be advisable to visit them on a work computer.

Rank of bigotry Subreddit name Official description
1 TheRedPill Discussion of sexual strategy in a culture increasingly lacking a positive identity for men.
2 Opieandanthony The Opie and Anthony Show
3 Atheism The web’s largest atheist forum. All topics related to atheism, agnosticism and secular living are welcome.
4 Sex r/sex is for civil discussions about all facets of sexuality and sexual relationships. It is a sex-positive community and a safe space for people of all genders and orientations.
5 Justneckbeardthings A subreddit for those who adorn their necks with proud man fur.Neckbeard: A man who is socially inept and physically unappealing, especially one who has an obsessive interest in computing:- Oxford Dictionary

[Edited to correct Ben Bell’s name and column title of table – my apologies!]

Free data: data.gov.uk – thousands of datasets from the UK government

Data.gov.uk is the official portal that releases what the UK government deems of as open data.

The government is opening up its data for other people to re-use. This is only about non-personal, non-sensitive data – information like the list of schools, crime rates or the performance of your council.

At the time of writing it has nearly 20k published datasets available of various qualities and in various formats both pleasant and unpleasant  (xml, csv, pdf, html etc.)  surrounding the following list of topics:

  • Environment
  • Mapping
  • Government Spending
  • Society
  • Government
  • Towns & Cities
  • Health
  • Education
  • Transport
  • Business & Economy