data.world: the place to go for your open data needs?

Somewhere in my outrageously long list of data-related links to check out I found “data.world“. Not only is that a nice URL, it also contains a worthy service that I can imagine being genuinely useful in future, if it takes off like it should. At first glance, it’s a platform for hosting data – seemingly biased towards the “open” variant of data although I see they also offer to host private data too – with some social bits and pieces overlaid.

What’s interesting about this particular portal to me over and above a bunch of other sites with a mass of open data available on them is:

  1. Anyone can upload any conventional dataset (well, that they are legally allowed to do) – so right now it contains anything from World Bank GDP info through to a list of medieval battles, and much more besides. Therefore it presumably seeks to be a host for all the world’s useful data, rather than that of a certain topic or producer. Caveat user etc. presumably applies, but the vision is nice.
  2. You can actually do things with the data on the site itself. For example, you can join one set of data to another hosted on the site, even if it’s from a totally different project from a totally different author, directly on the site. You can run queries or see simple visualisations.
  3. It’s very easy to get the data out, and hence use it in other tools should you want to do more complicated stuff later on.
  4. It’ll also host data documentation and sample queries (for example, SQL that you can run live) to provide contextual information and shortcuts for analysts who need to use data that they might not be intimately familiar with.
  5. There’s a social side. It also allows chat conversations between authors, users and collaborators. You can see what’s already been asked or answered about each dataset. You can “follow” individual people, or curated collection of subject-based datasets.

So let’s test a few features out with a simple example.

The overriding concept is that of a dataset. A dataset is more than a table; it can include many tables and a bunch of other sorts of non-data files that aid with the use of the data itself – for instance documentation, notebooks or images. Each user can create datasets, name and describe them appropriately, and decide whether they should be public or private.

Here’s one I prepared earlier (with a month of my Fitbit step count data in as it happens).

Capture

You can make your dataset open or private, attribute a license to be explicit about its re-use, and add tags to aid discovery. You can even add data via a URL, and later refresh if the contents of that URL changes.

As you can see, after import it shows a preview of the data it read in at the bottom of the screen.  If there were multiple files, you’d be able to filter or sort them to find the one you want.

If you hit the little “i” icon next to any field name, you get a quick summary visualisation and data description, dependent on data type. This is very useful to get a quick overview of what your field contains, and if it was read in correctly. In my view, this sort of thing should be a standard feature in most analytical tools (it already is in some).

Capture

I believe tags, field names and descriptions are searchable – so if you do a nice job with those then it’ll help people find what you’re sharing.

Other common actions now available after you’ve uploaded or discovered a data table of  interest would be to:

You can also “explore” the data. This expands the data table to take up most of the screen, enabling easier sorting, filtering and so on. More interestingly, you can open a chart view where you can make basic charts to understand your data in more detail.

Now, this isn’t going to replace your dedicated visualisation tool – it has only the most basic of customisations available at the moment – but it handles simple exploration requirements in a way that is substantially less time consuming than downloading and importing your data into another tool.

It even suggests some charts you might like to make, allowing 1-click creation. On my data, for example, it offered to make me a chart of “Count of records by Date” or “Count of records by Steps”. It seems to take note of the data types, for instance defaulting to a line chart for the count by date, and a histogram for the count by steps.

Here’s the sort of output the 1-click option gives you:

Capture

OK, that’s not a chart you’re going to send to Nature right away, but it does quickly show the range of my data, let me see check for impossible outliers, and gives some quick insights into the distribution. Apparently I commonly do between about 5000 and 7500 steps…and I don’t make the default Fitbit 10k steps target very often. Oops.

These charts can then immediately be downloaded or shared as png or pdf, with automatically generated URLs like https://data.world/api/chart/export/d601307c3e790e5d05aa17773f81bd6446cdd148941b89b243d9b78c866ccc3b.png

Here I would quite like a 1-click feature to save & publish any chart that was particularly interesting with the dataset itself -but I understand why that’s probably not a priority unless the charting aspect becomes more of a dedicated visualisation feature rather than a quick explore mechanic.

For now, you could always export the graphic and include it as, for example, an image file in the dataset. Here for example is  a dataset where the author has taken the time to provide a great description with some findings and related charts to the set of tables they uploaded.

One type of artefact you can save online with the dataset are queries. Yes, you can query your file live onsite, with either (a variant of) SQL or SPARQL. Most people are probably more familiar with SQL, so let’s start there.

Starting a new query will give you a basic SELECT * LIMIT query, but you’re free to use many (but not all) standard SQL features to change up your dataset into a view that’s useful to you.

Let’s see, did I ever make my 10k step goal in December? If so, on which days?

Capture

Apparently I did, on a whopping four days, the dates of which are outlined above. I guess I had a busy Christmas eve.

These results then behave just like a data table, so they can then be exported, linked to or visualised as a chart.

Once you’re happy with your query, if you think it’s useful for the future you can save it, or if it might aid other people, then you can publish it. A published query remains with the dataset, so next time someone comes to look at the dataset, they’ll see a list of the queries saved which they can re-use or adapt for their own needs. No more need for hundreds of people to transform a common dataset in exactly the same way again and again!

Interestingly, you can directly query between different datasets in the same query, irrespective of data table, data set, or author. Specifying the schemas feels a little fiddly at the moment, but it’s perfectly doable once you understand the system (although there’s no doubt room for future UI improvement here).

Imagine for instance that, for no conceivable reason, I was curious as to which celebrities sadly died on the days I met my 10k steps goal. Using a combination of my dataset and the 2016 celebrity deaths list from popculture, I can query like this:

Capture.PNG

…only to learn the sad news that a giant panda called Pan Pan sadly expired during one of my goal-meeting days.

Of course, these query results can be published, shared, saved, explored and so on just like we saw previously.

Now, that’s a silly example but the idea of, not only being able to download open data, but have subject matter experts combine and publish useful data models as a one-time effort for data-consumers to use in future is an attractive feature. Together with the ability to upload documentation, images or even analytical notebooks you may see how this could become an invaluable resource of data and experience – even within a single organisation, let alone as a global repository of open data.

Of course, as with most aggregation or social sites, there’s a network effect: how useful this site ends up being depends on factors such as how many people make active use of it, how much data is uploaded to it and what the quality of the data is.

If one day it grew to the point was the default place to look for public data, without becoming a nightmare to find those snippets of data gold in amongst its prospectively huge collection, it would potentially be an incredibly useful service.

The “nightmare to find” aspect is not a trivial point – there are already several open data portals (for instance government based ones) which offer a whole load of nice datasets, but often it is hard to find the exact content of data at the granularity that you’re after even when you know it exists – and these are on sites that are often quite domain-limited which in some ways makes the job easier. At data.world there is already a global search (which includes the ability to search specifically on recency, table name or field name if you wish), tags and curated collections which I think shows the site takes the issue seriously.

For analyst confidence, some way of understanding data quality would also be useful. The previews of field types and contents are already useful here. Social features, to try and surface a concept similar to “institutional knowledge”, might also be overlaid. There’s already a basic “like” facility. Of course this can be a challenging issue for any data catalogue that, almost by definition, needs to offer upload access to all.

For browser-haters, it isn’t necessary to use the site directly in order to make use of its contents. There’s already an API which gives you the ability to programmatically upload, query and download data. This opens up some interesting future possibilities. Perhaps, if data.world does indeed become a top place to look for the data of the world, your analytics software of choice might in future include a feature such that you can effectively search a global data catalogue from the comfort of your chart-making screen, with a 1-click import once you’ve found your goal. ETL / metadata tools could provide a easy way to publish to the results of your manipulations, and so on.

The site is only in preview mode at present, so it’s not something to stake your life on. But I really like the concept and the execution so far is way beyond some other efforts I’ve seen in the past. If I find I’ve created a public dataset I’d like to share, I would certainly feel happy to distribute it and all supporting documents and queries via this platform. So best of luck to data.world in the, let’s say,”ambitious” mission of bringing together the world’s freely available – yet often sadly undiscoverable – data in a way that encourages people to actually make valuable use of it.

#VisualizeNoMalaria: Let’s all help build an anti-Malaria dataset

As well as just being plain old fun, data can also be an enabler for “good” in the world. Several organisations are clearly aware of this; both Tableau and Alteryx now have wings specifically for doing good. There are whole organisations set up to promote beneficial uses of data, such as DataKind, and a bunch of people write reports on the topic – for example Nesta’s report “Data for good“.

And it’s not hard to get involved. Here’s a simple task you can do in a few minutes (or a few weeks if you have the time) from the comfort of your home, thanks to a collaboration between Tableau, PATH and the Zambian government: Help them map Zambian buildings.

Whyso? For the cause of eliminating of the scourge of malaria from Zambia. In order to effectively target resources at malaria hotspots (and in future to predict where the disease might flare up); they’re

developing maps that improve our understanding of the existing topology—both the natural and man-made structures that are hospitable to malaria. The team can use this information to respond quickly with medicine to follow up and treat individual malaria cases. The team can also deploy resources such as indoor spraying and bed nets to effectively protect families living in the immediate vicinity.

Zambia isn’t like Manhattan. There’s no nice straightforward grid of streets that even a crazy tourist could understand with minimal training. There’s no 3d-Google-Earth-building level type resource available. The task at hand is therefore establishing, from satellite photos, a detailed map of where buildings and hence people are. One day no doubt an AI will be employed for this job, but right now it remains one for us humans.

Full instructions are in the Tableau blog post, but honestly, it’s pretty easy:

  • If you don’t already have an OpenStreetMap user account, make a free one here.
  • Go to http://tasks.hotosm.org/project/1985 and log in with the OpenStreetMap account
  • Click a square of map, “edit in iD editor”, scan around the map looking for buildings and have fun drawing a box on top of them.

It may not be a particularly fascinating activity for you to do over the long term, but it’s more fun than a game of Threes – and you’ll be helping to build a dataset that may one day save a serious amount of lives, amongst other potential uses.

Well done to all concerned for making it so easy! And if you’ve never poked around the fantastic collaborative project that is OpenStreetMap itself, there’s a bunch of interesting stuff available there for the geographically-inclined data analyst.

 

Accessing Adobe Analytics data with Alteryx

Adobe Analytics (also known as Site Catalyst, Omniture, and various other names both past and present) is a service that tracks and reports on how people use websites and apps. It’s one of the leading solutions for organisations who are interested in studying how people are actually using their digital offerings.

Studying real-world usage is often far more insightful, in my view, than surveying people before or after the fact. Competitors to Adobe Analytics would include Google Analytics and other such services that allow you to follow web traffic, and answer questions from those as simple as “how many people visited my website today?” up to “can we predict how many people from New York will sign up to my service after having clicked button x, watched my promo video and spent at least 10 minutes reading the terms and conditions?”

In their own words:

What is Adobe Analytics?
It’s the industry-leading solution for applying real-time analytics and detailed segmentation across all of your marketing channels. Use it to discover high-value audiences and power customer intelligence for your business.

I use it a lot, but until recently have always found that it suffers from a key problem. Please pardon my usage of the 4-letter “s word” but, here, at least, the Adobe digital data has always pretty much remained in a silo. Grrr!

There are various native solutions, some of which are helpful for certain use cases (take a look at the useful Excel addin or the – badly named in my opinion, and somewhat temperamental – “data warehouse” functionality for instance). We have also had various technology teams working on using native functionality to move data from Adobe into a more typical and accessible relational database, but that seems to be a time-consuming and resource-intensive operation to get in place.

So none of the above solutions yet really proved to meet my needs to extract reasonably large volumes of data quickly and easily on an adhoc basis for integration with other datasources in a refreshable manner. And without that, in this world that ever-increasingly moves towards digital interactions, it’s hard to get a true overall view of your customer’s engagement.

So, imagine how the sun shone and the angels sung in my world when I saw the Alteryx version 10.5 release announcement.

…Alteryx Analytics 10.5 introduces new connectors to Google Sheets, Adobe Analytics, and Salesforce – enhancing the scope of data available for analytic insights

I must admit that I had had high hopes that this would happen, insomuch as when looking at the detailed schedule agenda for this year’s Alteryx Inspire conference (see you there?) I noticed that there was mention of Adobe Analytics within a session called “How to process and visualise data in the cloud”. But yesterday it actually arrived!

It must be said that the setup is not 100% trivial, so below I have outlined the process I went through to get a successful connection, in case it proves useful for others to know.

Firstly, the Adobe Analytics data connector is not actually automatically installed, even when you install even the full, latest version of Alteryx. Don’t let this concern you. The trick is, after you have updated Alteryx to at least version 10.5, is to go and download the connector separately from the relevant page of the Alteryx Analytics gallery. It’s the blue “Adobe Analytics install” file you want to save to your computer, there’s no need to press the big “Run” button on the website itself.

(If you don’t already have one, you may have to create a Alteryx gallery user account first, but that’s easy to do and free of charge, even if you’re not an Alteryx customer. And whilst you’re there, why not browse through the manifold other goodies it hosts?).

You should end up with a small file called “AdobeAnalytics.yxi” on your computer. Double click that, Alteryx will load up, and you’ll go through a quick and simple install routine.

Capture

CaptureOnce you’ve gone through that, check on your standard Alteryx “Connectors” ribbon and you should see a new tool called “Adobe Analytics”.

Just like any other Alteryx tool you can drag and drop that into your workflow and configure it in the Configuration pane. Once configured correctly, you can use it in a similar vein to the “Input data” tool.

The first thing you’ll need to configure is your sign-in method, so that Alteryx becomes authorised to access your Adobe Analytics account.

This isn’t necessarily as straightforward as with most other data connectors, because Adobe offers a plethora of different types of account or means of access, and it’s quite possible the one that you use is not directly supported. That was the case for me at least.

Alteryx have provided some instructions as to how to sort that out here. Rather than use my standard company login, instead I created a new Adobe ID (using my individual corporate email address), logged into marketing.adobe.com with it, and used the “Get access” section of the Adobe site to link my company Adobe Analytics login to my new Adobe ID.

That was much simpler than it sounds, and you may not need to do it if you already have a proper Adobe ID or a Developer login, but that’s the method I successfully used.

Then you can log in, via the tool’s configuration  panel.

Capture

CaptureOnce you’re happily logged in (using the “User login” option if you followed the same procedure as I did above), you get to the juicy configuration options to specify what data you want your connector to return from the Adobe Analytics offerings.

Now a lot of the content of what you’ll see here is very dependent on your Adobe setup, so you might want to work with the owner of your Adobe install if it’s not offering what you want, unless you’re also multitasking as the the Adobe admin.

In essence, you’re selecting a Report Suite, the metrics (and dimensions, aka “elements”) you’re interested in, the date range of significance and the granularity. If  you’re at all familiar with the web Adobe Analytics interface, it’s all the same stuff with the same terminology (but, if it offers what you want, so much faster and more flexible).

Leave “Attempt to Parse Report” ticked, unless for some reason you prefer the raw JSON the Adobe API returns instead of a nice Alteryx table.

Once you’ve done that, then Alteryx will consider it as just another type of datasource. The output of that tool can then be fed into any other Alteryx tool – perhaps start with a Browse tool to see exactly what’s being returned from your request. And then you’re free to leverage the extensive Alteryx toolkit to process, combine, integrate, analyse and model your data from Adobe and elsewhere to gain extra insights into your digital world.

Want an update with new data next week? Just re-open your workflow and hit run, and see the latest data flow in. That’s a substantial time and sanity saving improvement on the old-style battle-via-Excel to de-silo this data, and perhaps even one worth buying Alteryx for alone if you do a lot of this!

Don’t forget that with the Alteryx output data tool, and the various enhanced output options including the in-database tools and Tableau output options from the latest version, you could also use Alteryx simply to move data from Adobe Analytics to some other system, whether for visualisation in Tableau or integration into a data warehouse or similar.

A use case might simply be to automatically push up web traffic data to a datasource hosted in Tableau Server for instance, so that any number of your licensed analysts can use it in their own work. You can probably find a way to do a simple version of this for free” using the native Adobe capabilities if you try hard enough, but anything that involves a semblance of transform or join, at least in our setup, seems far easier to do with external tools like Alteryx.

Pro-tip: frustrated that this tool, like most of the native ones, restricts you to pulling data from one Adobe Report Suite at a time? Not a problem – just copy and paste the workflow once for each report suite and use an Alteryx Union tool to combine the results into one long table.

Here’s screenshots of an example workflow and results (not from any real website…) to show that in action. Let’s answer a simple question: how many unique visitors have we had to 2 different websites, each represented by a different report suite, over the past week?

Capture

Capture

Performance: in my experience, although Adobe Analytics can contain a wealth of insightful information, I’ve found the speed of accessing it to be “non-optimal” at times. The data warehouse functionality for instance promises/threatens that:

Because of to the complexity of Data Warehouse reports, they are not immediately available, can take up to 72 hours to generate, and are accessible via email, FTP or API delivery mechanisms.

The data warehouse functionality surely allows complexity that’s an order of magnitude beyond what a simple workflow like this does, but just for reference, this workflow ran in about 20 seconds. Pulling equivalent data for 2 years took about 40 seconds. Not as fast as you’d expect a standard database to perform, but still far quicker than making a cup of tea.

Sidenote: the data returned from this connector appears to come in string format, even when it’s a column of a purely numeric measure. You might want to use a Select tool or other method in order to convert it to a more processable type if you’re using it in downstream tools.

Overall conclusion: HOORAY!

Kaggle now offers free public dataset and script combos

Kaggle, a company most famous for facilitating competitions that allow organisations to solicit the help of teams of data scientists to solve their problems in return for a nice big prize, recently introduced a new section useful even for the less competitive types: “Kaggle Datasets“.

Here they host “high quality public datasets” you can access for free. But what is especially nice is that as well as the data download itself, they host any scripts, code and results that people have already written to handle them, plus some general discussion.

For example, on the “World Food Facts” page you can see a script that “ByronVergoesHouwens” wrote to see which countries ate the most sugar, and also a chart that that script produced.  In fact you can even execute scripts online, thanks to their “Kaggle Scripts” product.

It looks like the datasets will be added to regularly, but right now the list is:

  • Amazon Fine Food Reviews
  • Twitter US Airline Sentiment
  • SF Salaries
  • First GOP debate Twitter Sentiment
  • 2013 American Community Survey
  • US Baby Names
  • May 2015 Reddit Comments
  • 2015 Notebook UX Survey
  • NIPS 2015 Papers
  • Iris (yes, the one you will have seen many times already if you’ve read ANY books/tutorials on clustering in R or similar!)
  • Meta Kaggle
  • Health Insurance Marketplace
  • US Dept of Education: College Scoreboard
  • Ocean Ship Logbooks (1750-1850)
  • World Development Indicators
  • World Food Facts
  • Hilary Clinton’s Emails (sounds fun…:-))

How many teachers do we need? The official Governmental model

How do we know how many teachers are required to keep the UK’s schools in good working order? It’s an interesting question, with obvious implications for Governmental education policy with regards to teacher compensation, incentives, training places and so on.

The “official” requirements are calculated via the Government’s “Teacher Supply Model”, which, happily, in the name of transparency you can get a copy of here.

But rather than have to read through the 61 page user guide and two big fat Excel files, below are some basic notes on what factors go into its calculations. Most of this is summarised or reproduced from their manual (it is hefty, but I appreciate their openness in preparing it!).

Firstly we should define what exactly the model tries to calculate, and then predict.

Target variable

One of key outputs is to define the number of teacher training places required, so it works from a top down approach of “how many teachers do we want overall” to get to that figure.

The “how many teachers do we need to enter the profession each year?” is the focus of this post.

It’s referred to as “model part 1” in the official documentation. Model part 2 works from this to get to an actual number of NQT training places this requires, needed because there are other routes to increase teacher numbers outside of the typical NQT route.

Anyway, being such a model, part 1 necessarily involves a mix of data and assumptions.

It’s set up in a way that you can tweak the assumptions to show for instance the implications if greater or fewer teachers quit than expected. Where figures from this model are mentioned in future posts, from the default “central” scenario unless noted otherwise.

The authors are (very) keen to highlight that any assumptions made in the model do not equate or suggest knowledge of future Government policy! But rather it’s what these domain experts predict is likely to happen.

Model scope:

  • Only applies to England
  • Only concerns itself with qualified teachers
  • Includes state funded schools; primary (+ ones with nurseries attached), secondary, academies and free schools and key stage 5 (aka 6th-form) teaching within secondary schools.
  • Does not include special schools, referral units, independent schools, early years schools or standalone further/6th form colleges

Input variables used:

(*) wastage is the slightly unpleasant sounding term to mean teachers leaving for reasons other than death or retirement.

Assumptions implied:

General

The active stock of teachers in November 2014 (when the census is conducted) will not change significantly by the end of the 2014-15 academic year.

Teachers are categorised into what subject they actually teach, not what they were employed to teach. For example, if they are officially a science teacher, but spend 25% of the week teaching maths, then that’s 0.75 of a science teacher and 0.25 of a maths teacher.

Hours spent teaching PSHE are excluded.

Long term, the rate of change of key stage 5 pupil numbers will match the rate of change as the national 16-19 year old population. Short term, the same increases in post-16 participation from the past 3 years will continue.

Wastage

The proportion of teachers who will leave as wastage going forward (per age group, per gender) is calculated from a weighted average of the wastage rates in the past 4 years worth of data.

This data is also broken down into groups of subjects (but not individual subjects).

The groups are as follows:

  1. Group 1: EBacc Science and Mathematics subjects – including Biology, Chemistry, Computing, Mathematics, and Physics.
  2. Group 2: EBacc non-Science and Mathematics subjects – including Classics, English, Geography, History, and Modern Foreign Languages.
  3. Group 3: All other subjects – including drama, music, PE, and RE among others.

The below table shows the assumed wastage rates based on subject/age. In general, group 1 subject teachers are more likely to leave than group 2, and then group 3 are the least likely to leave.

assumedwasteagerates
Projected wastage rates also factor in economic variables via the “econmetric wasteage model”, e.g. looking at historical relationships between teacher wastage and economic growth, unemployment etc,

Retirement / deaths

Uses weighted historical retirement and deaths in service rates from the past 4 years of data. Rates are calculated by age group by gender .

Model assumes retirement/deaths in service rates are the same in all subjects and over all future time periods (but it does take into account that some subjects tend to have older/different gender teachers than others and takes into account the projected changes in teacher demographics)

Method to estimate future teacher stocks needed

Start by projecting how the pupil teacher ratio will change going forward as pupil numbers change.

This is not as simple as “if the number of pupils doubles, so should the number of teachers”. They show via data that historically when pupil population increases, some of the extra was dealt with by increasing the pupil teacher ratio, as well as getting new teachers.

Therefore, model assumes for every 1% increase in pupil population from now, the pupil teacher ratio will increase by only 0.5 percentage points (primary) or 0.6 percentage points (secondary). It is however capped at a level that relates to previous historic maximums.

Here’s their “historical” chart on the subject:

PTR

Knowing future pupil numbers and future pupil-to-teacher ratios allows calculations of FTE teachers needed.

The model assumes that the ratio of unqualified teachers to qualified teachers will remain constant at today’s rates (by phase and subject). Demand met by unqualified teachers is therefore removed from this model.

FTE teacher requirements are converted into actual physical headcount, via multiplying by the current FTE rate for teachers (with implicit assumption that that ratio remains constant going forward).

Then to calculate teacher need by subject:

Calculate FTE rate based on current needs; e.g. if 10% of teaching time in secondary schools is spent teaching English (irrespective of how many English teachers there are) then 10% of the workforce needs to be English teachers.

Different subjects are more or less popular as pupil options at the distinct Key Stages, and the proportion of pupils in each Key Stage also changes.

The model therefore estimates the quantity of teaching time needed per pupil per subject at KS3, 4, 5 and scales upwards.

Then, add adjustment for anticipated education policies:

If any changes of educational policy might be expected to adjust the need for teachers by more than 100 FTE then it should be added to the model. Right now there are 7 such policies in the Secondary teacher section.

The policies they address are as follows:

  • Hold 2016-17 ITT places for all Ebacc subjects at 2015/16 levels or above (to support “Ebacc for all” policy).
  • Assume increases in Ebacc subject takeup.
  • Remove option to just take Core Science (Core Science is to be replaced by Combined Science, meaning that 10% of KS4 students will need double the science teaching time).
  • Add extra maths teaching time for a new core mathematics policy.
  • Assume continuing increases of uptakes for Maths and Further Maths A-levels due to enhanced further mathematics support progress.
  • Impact of new Maths GCSE will require a greater amount of Mathematics teaching per pupil at KS3 & 4.
  • Impact of new English GCSE will require a greater amount of English teaching per pupil at KS4.

Estimating new teachers needed to enter stock each year

Once one knows the above figures, it’s a simple calculation:

Need for entrant teachers in year x = Teacher need in year x  - Stock of teachers at the end of previous year + No. of teachers expected to leave in year x

This is calculated per subject per phase using variables on the left of the below diagram, iterating for years beyond 2016/17 with the process on the right of the diagram.

estimatednewpersubject

One key assumption is made here when it comes to modelling future years (2017-18 onwards).

The model assumes that if we determined we needed [x] new teachers by the end of year [y] then indeed that many will have been successfully acquired and added to the active stock. That will then be the starting stock for year [x+1].

There’s no consideration of events that would lead to fewer teachers being active in future years that the model says are required, e.g. if recruitment efforts fail.

 
That’s the end of part 1 of the model, which calculates the total and new teachers required.

Possibly more interesting will be to see the actual numbers behind the above calculations, which give indications of trends in KPIs affecting teacher requirements. More on that soon.

Are station toilets profitable?

After being charged 50p for the convenience of using a station convenience, I became curious as to whether the owners were making much money on this most annoying expression of a capitalistic monopoly high on the needs of many humans.

It turns out data on those managed by Network Rail is available in the name of transparency – so please click through and enjoy interacting with a quick viz on the subject.

Train station toilet viz

Microsoft Academic Graph: paper, journals, authors and more

The Microsoft Academic Graph is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals and conference “venues” and fields of study.

Microsoft have been good enough to structure and release a bunch of web-crawled data around scientific papers, journals, authors, URLs, keywords, references between and so on for free here. Perfect for understanding all sorts of network relationships between these nodes of academia.

The current version is 30gb of downloadable text files. It includes data on the following entities.

  • Affiliations
  • Authors
  • ConferenceSeries
  • ConferenceInstances
  • FieldsOfStudy
  • Journals
  • Papers
  • PaperAuthorAffiliations
  • PaperKeywords
  • PaperReferences
  • PaperUrls

Being webscraped and coming with a warning that it has only been minimally processed, they do instruct users to beware that the quality is not perfect – but it’s apparently the biggest chunk of bibliographic data like this that has been released for the public to do what it will with.

Free dataset: all Reddit comments available for download

As terrifying a thought as it might be, Jason from Pushshift.io has extracted pretty much every Reddit comment from 2007 through to May 2015 that isn’t protected, and made it available for download and analysis.

This is about 1.65 million comments, in JSON format. It’s pretty big, so you can download it via a torrent, as per the announcement on archive.org.

If you don’t need a local copy, Reddit user fhoffa has loaded most of it into Google BigQuery for anyone to use.

If you have an account over there, then as Tableau now has a native BigQuery connector you can visualise it directly in Tableau – which Mr Hoffa has indeed done and shared with the world at Tableau Public.

Although you get a certain amount of uploading and usage from BigQuery for free, you will most likely need a paid account to integrate it directly into a Tableau (or equivalent) project like this, as you’ll want to create a BigQuery dataset to connect Tableau to.

However, if you only need to run some SQL on the freely available dataset to get some output – which you can then manually download and integrate into whatever you like – your free monthly allowance of BigQuery usage might well be enough.

Here’s the link to the data in BigQuery – at least one of the tables. You’ll see the rest in the interface on the left as per this screenshot:

BigQuery reddit data

You can then run some BigQuery SQL over it using the web interface – for free, up to a point, and retrieve whichever results you need.

For example:

SELECT * FROM [fh-bigquery:reddit_comments.2007] LIMIT 10

will give you 10 Reddit comments from (surprise surprise) 2007.

BigQuerySQL

As you can see on the bottom right, you can save results into a BigQuery table (this requires a dataset for which you need to enable billing on your BigQuery account) or download as CSV / JSON to do whatever you want with.

Free data: Constituency Explorer – UK demographics, politics, behaviour

From some combination of the Office of National Statistics, the House of Commons and Durham library comes Constituency Explorer.

Constituency Explorer

Billing itself as “reliable evidence for politicians and journalists – data for everyone”, it allows interactive visualisation of many interesting demographics/behavioural/political attributes by UK political constituency. It’s easy to view distributions and compare between a specific contstituency, the region and the country on topics like

  • 2010 election results (turnout and results)
  • vehicle ownership
  • age
  • ethnicity
  • travel to work
  • household composition
  • qualifications
  • etc. etc.

Each chart has also a “download this data” link at the bottom left, which I would assume should give you a nice integratable spreadsheet/xml/something – but at the time of writing unfortunately one gets a “not found” error…

There’s also a fun “how well do you know your constituency” quiz which is nice for comparing one’s media-fueled perception of a given area to reality.

Free data: data.gov.uk – thousands of datasets from the UK government

Data.gov.uk is the official portal that releases what the UK government deems of as open data.

The government is opening up its data for other people to re-use. This is only about non-personal, non-sensitive data – information like the list of schools, crime rates or the performance of your council.

At the time of writing it has nearly 20k published datasets available of various qualities and in various formats both pleasant and unpleasant  (xml, csv, pdf, html etc.)  surrounding the following list of topics:

  • Environment
  • Mapping
  • Government Spending
  • Society
  • Government
  • Towns & Cities
  • Health
  • Education
  • Transport
  • Business & Economy