Beware! Killer robots swim among us

In a further sign of humanity’s inevitable journey towards dystopia, live trials of an autonomous sea-based killer robot made the news recently. If all goes well, it could be released into the wild within a couple of months.

Here’s a picture. Notice it’s cute little foldy-out arm at the bottom, which happens to contain the necessary ingredients to provide a lethal injection to its prey.

 

cotsbot

Luckily for us, this is the COTSbot, which, in a backwards version of nominative determinism, has a type of starfish called “Crown Of Thorns Starfish” as its sole target.

starfish

The issue with this type of starfish is that they have got a bit out of hand around the Great Barrier Reef. Apparently at a certain population level they live in happy synergy with the reef. But when the population increases to the size it is today (the cause of which is quite possibly due to human farming techniques) they start causing a lot of damage to the reef.

Hence the Australian Government  wants rid of them. It’s a bit fiddly to have divers perform the necessary operation, so hence some Queensland University of Technology roboticists have developed a killer robot.

The notable feature of the COTSbot is that it may (??) be the first robot that autonomously decides whether it should kill a lifeform or not.

It drives itself around the reef for up to eight hours per session, using its computer vision and a plethora of processing and data science techniques to look for the correct starfishes, wherever they may be hiding, and perform a lethal injection into them. No human is needed to make the kill / don’t kill decision.

Want to see what it looks like in practice? Check out the heads-up-display:

cotsbotscreenshot

If that looks kind of familiar to you, perhaps you’re remembering this?

terminator1 HUD

Although that one is based on technology from the year 2029 and is part of a machine that looks more like this.

terminator_1

(Don’t panic, this one probably won’t be around for a good 13 years yet – well, bar the time-travel side of things.)

Back to present day: in fact, for the non-squeamish, you can watch a video of the COTS-destroyer in action below.

How does it work then?

A paper by Dayoub et al.  presented at the IEEE/RSJ International Conference on Intelligent Robots and Systems conference  explains the approach.

Firstly it should be noted that the challenge of recognising these starfish is considerable. The papers informs us that, whilst COTS look like starfish when laid out on flat terrain, they tend to wrap themselves around or hide in coral – so it’s not as simple as looking for nice star shapes. Furthermore they vary in colour, look different depending on how deep they are, and have thorns that can have the same sort of visual texture as the coral they live in (go evolution). The researchers therefore attempt to assess the features of the COTS via various clever techniques detailed in the paper.

Once the features have been extracted, a random forest classifier, which has been trained on thousands of known photos of starfish/no starfish is used to determine whether what it can see through its camera should be exterminated or not.

A random forest classifier is a popular data science classification technique, essentially being an aggregation of decision trees.

Decision trees are one of the more understandable-by-humans classification techniques. Simplistically you could imagine a single tree as providing branches to follow dependent on certain variables, which it automatically machine-learns from have previously processed a stack of inputs that it has been told are either one thing (a starfish) or another thing (not a starfish).

Behind the scenes, an overly simple version of a tree (with slight overtones of doomsday added for dramatic effect) might have a form similar to this:

tree

The random forest classifier takes a new image and runs many different decision trees over it – each tree has been trained independently and hence is likely to have established different rules, and potentially therefore make different decisions. The “forest” then looks at the decision from each of its trees, and, in a fit of machine-learning democracy, takes the most popular decision as the final outcome.

The researchers claim to have approached 99.9% accuracy in this detection – to the point where it will even refuse to go after 3D-printed COTS, preferring the product that nature provides.

Although probably not the type of killer robot that the Campaign to Stop Killer Robots campaigns against, or the UN debates the implications of; if it is the first autonomous killer robot it still can conjure up the beginnings of some ethical dilemmas (even outside that of killing the starfish…after all, deliberate eradication/introduction of species to prevent other problems has not always gone well even in the pre-robotic stage of history – but one assumes this has been considered in depth before we got to this point!).

Although 99.9% accuracy is highly impressive, it’s not 100%. It’s very unlikely that many of non-trivial classification models can ever truly claim 100% over the vast amount of complex scenarios that the real world presents. Data-based classifications, predictions and so on are almost always a compromise between the concepts like precision vs recall, sensitivity vs specificity, type 1 vs type 2 errors, accuracy vs power and whatever other names no doubt exist to refer to the general concept that a decision model may:

  • Identify something that is not a COTS as a COTS (and try to kill it)
  • Identify a real COTS as not being a COTS (and leaving it alone to plunder the reef)

Deciding on the accpetable balance between accepting each type of error is an important part of designing models. Without actually knowing the details, here it sounds like the researchers sensibly opted onto the side of caution, such that if the robot isn’t very sure it will send a photo to a human and await a decision.

It’s also the case that the intention is not to have the robot kill every single COTS, which suggests that false negatives might be less damaging than false positives. One should also note that it’s not going to be connected to the internet, making it hard for the average hacker to remotely take it over and go on a tourist-injection mission or similar.

However, given it’s envisaged that one day a fleet of 100 COTSbots, each armed with 200 lethal shots, might crawl the reef for 8 hours per session  it’s very possible a wrong decision may be made at some point.

Happily, it’s unlikely to accidentally classify a human as a starfish and inject it with poison (plus, although I’m too lazy to look it up, I imagine that a starfish dose of starfish poison is not enough to kill a human) – the risk the researchers see is more that the injection needle may be damaged if the COTSbot tries to inject a bit of coral.

Nonetheless, a precedent may have been set for a fleet of autonomous killer robot drones. If it works out well, perhaps it starts moving the needle slightly towards the world of handily-acronymed  “Lethal Autonomous Weapons Systems” that the US Defense Advanced Research Projects Agency  is supposedly working on today.

If that fills you with unpleasant stress, there’s no need to worry for the moment. Take a moment of light relief and watch this video of how good the 2015 entrants to the DARPA robotics challenge were at stumbling back from the local student bar traversing human terrain.

Characteristics of England’s secondary school teachers

In exploring the data behind England’s teacher supply model, it became apparent that the split of teachers by gender and age shows certain patterns by subject. Click through and use the below viz interactively to answer questions such as:

  • How many secondary school teachers are there in the UK?
  • What percentage of all teachers are female?
  • Are there certain subjects where females are over-represented in teachers vs others where males are over-represented? Have we overcome the historic gender stereotypes?
  • What proportion of teachers are below the age of 25? What subjects do they tend to teach?
  • What age-groups are particularly over-represented in females teaching art and design?
  • …and many more.

Use the first tab for a general overview and ranking of subjects on these indices; and the second tab to provide an easy comparison for your chosen subject vs all others.

Capture

 

Quick tip: find location of column in a SQL database

Now we all have nice access to large sprawling relational databases, the next problem is locating in which table you need to look to find the data you’re after. For instance, which of the 100 tables available contain my sales?

Of course the proper way would to consult the comprehensive, accurate, well-governed and accessible data dictionary to understand the entities involved and the precise location and definition of what you need.

However, those of us cursed to live in the real world do not always (haha) have that luxury.

Fortunately, if you can guess roughly what the field name of the data you want might be, then you can often use the database’s metadata to find where to look.

In Microsoft SQL Server for instance, running the below query will show you a table of results that tell you in which schema, table and column names you can find fields that have “sales” in their name.

SELECT TABLE_SCHEMA,TABLE_NAME, COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME LIKE '%sales%'

Notes:

  • I believe it will only show tables to which you have access. If you don’t find what you want, perhaps you don’t have permissions.
  • It looks only for the pattern you type. If you ask for “sales”, but the field is actually called “revenue”, you’re out of luck.
  • Of course it gives no clues as to the definition of the field and whether it matches what you hoped it did. For instance is it net or gross sales, how are vouchers integrated, are records ever obseleted, and so on.
  • The query is just normal SQL, so you can use any valid “WHERE” clause to filter. The % signs there for instance are acting as wildcards; so you could remove them if you only want to see columns that are called precisely “sales” and not “product sales” for instance.
  • Other metadata about the fields is available in that INFORMATION_SCHEMA.COLUMNS table such as datatype etc. Here’s the official documentation on the entirety of that table.
  • This is a table in the same sense as the explicit data tables – so you could always connect tools like Tableau etc. to it if you want to produce some fun, or perhaps even useful, meta-visualisations.

Given the absence of any more proper clues, this might still be better than manually reviewing thousands of tables if you’re willing to accept the caveats. Good luck.

Note that although the above example is from SQL Server, a lot of other mainstream RDBMS’ have similar capabilities. For example, here’s some documentation on equivalents for MySQL and Oracle.

 

Comparing partial week-so-far data with prior year values in Tableau

Happy new year to all! And now that’s over, back to work.

The first drama of 2016 was opening up some spiffy new Tableau dashboards someone made to see how business is at in the first week of the year, only to find apparently catastrophic results. Sales this week were way below sales of the same week in the previous year – unbelievably so in fact. Luckily it was so unbelievable that we checked the calculation and realised it was obviously wrong.

The purpose of the table was to compare sales in the first fiscal week of 2016 with those in the first fiscal week of 2015.

  • The first fiscal week of 2016 runs from 3rd January 2016 to 9th January 2016.
  • The first fiscal week of 2015 runs from th January 2015 to 10th January 2015.
  • We opened the dashboard on the 4th January 2016.

Can you guess what went wrong?

The dashboard was labelled such that it appeared to be comparing all of the first fiscal week of 2016 to all of the first fiscal week of 2015, even though so far only 2 days of the 2016 week had elapsed. It was therefore actually comparing 2 days worth of data this year to 7 days last year – and unless your business has some serious growth, that’s never going to produce a positive result!

What should have been happening, to generate a more useful result, is that the dashboard should have checked how many days had actually elapsed in the week selected for analysis, and then compared it to only the same number of days in the corresponding week of prior year. As only two days of 2016 week had elapsed, it should have compared it to just the first two days of the 2015 week.

To slightly add to the complexity: interactivity is always a crowd-pleaser, so the user is supposed to be able to select any given week, past or present, to compare data from with a Tableau parameter.

If they select a week where we do have 7 days worth of data, e.g. week 52 2015, then it should compare that to the whole 7 days worth of data from week 52 2014. Thus, the dashboard must adapt on the fly based on what data is available at run time, not design time.

As ever, there’s more than one way to have Tableau do this. Below is a quick method one can use which seems to do the trick, relying on my favourite feature of Tableau 9, the “Level Of Detail” calculations.

This method can be realised via approximately 2 calculated fields, but for the sake of clarity I’ll split it into a few more here.

Firstly, here’s some data (fictional, of course). The column on the right is displaying the sales achieved on the given day.

pic1

Note how if I thoughtlessly ask it to show the sum of sales in both years for fiscal week 2, then the 2016 result is worse than the 2015 result.

pic2

This is because it’s comparing the 2 available days for fiscal week 2 2016 with the 7 available days for fiscal week 2 2015; very misleading.

My first step to prevent this happening was to have Tableau establish what weekdays are actually available in the fiscal year/week the user selects. In the dashboard UI, the user selects this week via use of 2 parameters: “Choose year” and “Choose week”.

1: find out which weekdays in the week the user selected actually have data available

Create a formula called “Weekday in selected week”, like this:

IF [Fiscal week] = [Choose week] AND [Fiscal year] = [Choose year] THEN
DATEPART('weekday',[Date])
END

DATEPART returns the part of the date (duh) specified by its first parameter – here the “weekday”, which Tableau encodes such that Sunday = 1, Monday = 2 etc. through to Saturday = 7.

This formula therefore returns the weekday number for every record that is in the year and week the user has selected in the parameters, but – importantly – returns nothing if the record corresponds to a date in any other year/week.

2: Determine the most recent weekday that has data in the week the user selected

The next formula aggregates that function to find the maximum day of week, using the a basic level of detail expression.

Last weekday in selected week:

{ FIXED : MAX([Weekday in selected week])}

This looks for the maximum “Weekday in selected week” over the whole dataset, no matter which filters might be being applied in the view (*).

Given that “Weekday in selected week” field only ever contains information for the week the user has selected, this is equivalent to saying “what is the most recent weekday available in the week the user selected?”.

If the complete week is available, all 7 days of it, then it will return “7” which is equivalent to Saturday. If only two days are present, Sunday and Monday, then it will return “2”, meaning only up to Monday is there.

Whatever it finds, this value will be replicated down the table for every record, due to the “FIXED” level of detail.

(*) actually the special context or datasource filters would affect the results, but we’ll leave that for another time, except to note that so many of the top Tableau Jedi tricks I’ve seen in recent time make use of this fact!

3: Classify each record as to whether it forms part of the selected week, the relevant part of the same fiscal week a year ago or is not at all relevant.

Now we know which week the user selected, and the maximum weekday in that week, we can easily classify every record as to whether it’s in the selected week, or in the relevant part of the same fiscal week a year ago.

Week selection status:

IF [Fiscal year] = [Choose year] AND [Fiscal week] = [Choose week]
THEN 'Selected week'

ELSEIF [Fiscal year] = ([Choose year] -1) AND [Fiscal week] = [Choose week]
AND DATEPART('weekday', [Date]) <= [Last weekday in selected week]
THEN 'Prior year'

ELSE 'Not relevant to analysis'

END

So what?

After this, like-for-like comparisons become easy; you can just use the “week selection status” as a dimension in your analysis, or create measures based on it such as:

Current year sales:

IF [Week selection status]= 'Selected week' THEN [Sales]
ELSE 0
END

Prior year sales:

IF [Week selection status]= 'Prior year' THEN [Sales]
ELSE 0
END

Making a quick dashboard using those measures allows you to see that the sales results are now being compared correctly.

pic4

See how in the bottom table above indeed the current year sales are higher than the prior year sales for the same week once we take into account the fact that only 2 days of sales are available so far this year – in comparison to the top (misleading) table, which was the one that was comparing the 2 days of sales this year to the 7 days of sales in the same fiscal week last year.

How does it work?

Level of expression details are a relatively new Tableau feature and may be confusing if you are not used to tools that allow using data “outside of the current row”. If confused, I often find it helpful to make Tableau produce a text table at the most granular level and have it show the results of these sorts of expressions on every row.

Here’s what it would look like to do that for the above set of formulae, given the example data, when the user has selected to see fiscal week 2 in year 2016.

pic3

Note how:

  • Fiscal year, week, day of date are all in the original data supplied.
  • “Weekday in selected week” is null for every row except those that are in the year/week that the user specified in the parameters.
  • “Last weekday in selected week” is the same for every row, and shows the highest number from the previous column “weekday in selected week”.
  • “Week selection status” shows “not relevant” except where either a) the record has a week / year combination that matches the values the user specified in the parameter, or b) it is of the same fiscal week in the prior year and has a weekday that is no greater than the maximum weekday available in the current year’s version.

You can download the example workbook, including all of the above, here if you wish.

Sidenote: looks like I wasn’t the only person risking shock via the incomplete weeks issue! Vizpainter recently published a nice post on a similar topic; concentrating more on decent ways to visualise partial time series data, rather than calculating comparisons. Recommended.