Basic text tokenisation with Alteryx

Free text analytics seems a fashionable pastime at present. The most commonly seen form in the wild might be the very basic text visualisation known as the “word cloud”. Here, for instance is the New York Times’ “most searched for terms” represented in such a cloud.

When confronted with a body of human-written text, one of the first steps for many text-related analytical techniques is tokenisation. This is the process where one decomposes the lengthy scrawlings received from a human into suitable units for data mining; for example sentences, paragraphs or words.

Here we will consider splitting text up into individual words with a view to later processing.

My (trivial) sample data is:

ID Free text 1 Free text 2 Free text 3
1 Here is some Free text! But it needs seperating
2 Into
individual words
Can
Alteryx
Do it?
3 Of course ALTERYX can , no problem

Let’s imagine I want to know how many times each word occurs. For instance, “Alteryx” appears in there twice.

Just to complicate things marginally, I’d also like to :

  • remove all the “common” words that tend to dominate such analysis whilst adding little insight – “I”, “You” etc.
  • ignore capitalisation: the word ALTERYX should be considered the same as Alteryx for example.

Alteryx has the tools to do a basic version of this in just a few seconds. Here’s a summary of one possible approach.

First I used an Input tool to connect to my datasource which contained the free text for parsing.

Then use a formula tool to pull all the free text fields into one single field, delimited by a delimiter of my choice. The formula is simply a concatenation:
[Free text 1] + "," + [Free text 2] + "," + [Free text 3]

The key manipulation was then done using the “Text To Columns” tool. Two configuration options were particularly helpful here.

  •  Despite its name, it has a configuration option to split your text to rows, rather than columns.This is great for this type of thing because each field might contains a different, and unknown amount of words in in – and for most analytic techniques it is often easier to handle a table of unknown length than one of unknown width.

    You will still be able to track which record each row originally came from as Alteryx preserves the fields that you do not split on each record, similar to how the “Transpose” tool works.

  • You can enter several delimiters – and it will consider any of them independently. The list I used was ', !;:.?"' , meaning I wanted to consider that a new word started or ended whenever I saw a comma, space, full stop, question mark and so on. You can add as many as you like, according to how your data is formatted. Note also the advanced options at the bottom if you want to ignore delimiters in certain circumstances.Text to columns tool

When one runs the tool, if there are several delimiters next to each other, this will (correctly) cause rows with blank “words” to be generated. These are easily removed with a Filter tool, set to exclude any record where the text IsEmpty().

Next I wanted to remove the most common basic words, such that I don’t end up with a frequency table filled with “and” or “I” for instance. These are often called stopwords. But how to choose them?

In reality, stopwords are best defined by language and domain. Google will find you plenty based on language but you may need to edit them to suit your specific project. Here, I simply leveraged the work of the clever developers of the famous Python Natural Language Toolkit 

This toolkit contains a corpus of default stopwords in several languages. The English ones can be extracted via:

from nltk.corpus import stopwords
stopwords.words('english')

which results in a list of 127 such words – I, me, my, myself etc. You can see the full current list on my spreadsheet: NLTK stopwords.

I loaded these into an Alteryx text input tool, and used a Join tool to connect the words my previous text generated (on the left side) to the words in this stopword corpus (on the right side), and took the left-hand output of the join tool.

Join tool

The overall effect of this is what relational database fans might call a LEFT JOIN: an output that gives me all of the records from my processed text that do not match those in the stopword corpus.

(Alteryx has a nice “translate from SQL to Alteryx” page in the knowledgebase for people looking to recreate something they did in SQL in this new-fangled Alteryx tool).

The output of that Join tool is therefore the end result; 1 record for each word in your original text, tokenised, that is not in the list of common words. This is a useful format for carrying on to analyse in Alteryx or most other relevant tools.

If you wanted to do a quick frequency count within Alteryx for instance, you can do this in seconds by dropping in a Summarise tool that counts each record, grouping by the word itself.

You can download the full workflow shown above here.

From restaurant-snobbery to racism: some perils of data-driven decision-making

Wired recently wrote a piece explaining how now OpenTable, a leading “reserve a restuarant over the internet” service, was starting to permit customers to pay for their meal via an app at their leisure, rather than flag down a waiter and awkwardly fiddle around with credit cards.

There’s an obvious convenience to this for the restaurant patron, but, as with most useful “free” services, the cost is ones’ personal data. Right now, it is possible, at least unofficially, for the data-interested to access some OpenTable restaurant data – but soon it may become a lot more personal.

Wired writes:

Among information the company purports to collect: name and contact information, current and prior reservation details, order history, dining preferences, demographics and precise location data. The company pairs such user data with information from “Third Party Platforms.” The wording here is purposefully vague, but it is certainly plausible that the company could use outside research firms to cobble together a whole host of personal information like income, age and spending habits.

For users who make payments via the mobile app, OpenTable reserves the right to share its customer dossier with the restaurant “for its own purposes.”

In a utopian world, this could be great. It might be nice to turn up to a restaurant having them already know who you are, which your favourite table is, what drinks you might want on ice and be given a personalised menu highlighting your favourites and removing any foods to which you have an allergy to?

This sort of technology-experience is already in use in a few places. For instance, in the magical land of Disneyworld, the Disney MagicBand allows you to pre-order your food and have it ready by the time you turn up.

From another Wired article:

If you’re wearing your Disney MagicBand and you’ve made a reservation, a host will greet you at the drawbridge and already know your name—Welcome Mr. Tanner! She’ll be followed by another smiling person—Sit anywhere you like! Neither will mention that, by some mysterious power, your food will find you.

The hostess, on her modified iPhone, received a signal when the family was just a few paces away. Tanner family inbound! The kitchen also queued up: Two French onion soups, two roast beef sandwiches! When they sat down, a radio receiver in the table picked up the signals from their MagicBands and triangulated their location using another receiver in the ceiling. The server—as in waitperson, not computer array—knew what they ordered before they even approached the restaurant and knew where they were sitting.

But the first Wired article highlights also a less delightful side to the experience.

Restaurants have always doled out preferential treatment to the “best” customers—but now, they’ll be able to brand them with a specific dollar sign the second they walk in the door.

A concern is whether knowledge of a customer’s normal habits will be used to score them on some metric that leads to differential treatment. Maybe if they know upon approach that you buy cheap dishes, aren’t likely to have excessive social influence or – most horrific of all – tip badly, you will get a worse level of service than if they scored you in the opposite way.

Some might argue you can see the consequences of this sort of “pre-knowledge” in a very basic way already. Discount voucher sites like Groupon offer cheap deals on meals that often require one to identify to the waiter that you will be using such a voucher in advance.

There are very many anecdotal reports (mild example) that this can lead to worse service (and very many other reports that this might well be because voucher users aren’t understanding that the server probably should get a tip based on the non-voucher price!).

The Brainzooming Group summarises some more formal research:

Additional research revealed a direct link between the use of Groupon and a negative service experience. The above graph is from a study conducted by Cornell researchers who studied over 16,000 Groupon Deals in 20 US cities between January and July this year. The study found, among other things, that Groupon users averaged a 10% lower rating than those who didn’t use Groupon.

However, it’s clearly not the case that “restaurant prejudice” would be a new thing – it existed well before OpenTable thought about opening up its data treasure trove. It happened even in ye olde times before Groupon. In fact the author of the original Wired article quoted at the top was themselves an assistant to a “VIP” of the sort that never had trouble getting even the toughest of tables with his secret phone numbers and possibly a bit of name infamy.

My first boss, an infamous man-about-town, kept a three-ring binder of unpublished phone numbers unlocking some of the toughest tables in the city.

The issue is – as with a lot of big data concerns – not that it introduces a brand new issue but that it allows it at a massively enhanced scale.

Now instead of a judgement that is basically a 2-way 0.1% “top VIP” vs 99.9% “normal person” categorisation, we could have “5* tipper who earns £8k a month and just paid off her mortgage” vs “unlucky downtrodden victim of capitalism who normally only spends a bare minimum and drinks tap water” scored for each and every one of the OpenTable users individually.

But at the end of the day, how much should we care? Having a suboptimal restaurant experience is not the worst experience life can bring to most people’s existence. And some would say that, if the data is reliable and the models are accurate (hmmm…), such businesses have every right to treat their customers in line with their behaviour, so long as they remain in line with consumer law.

Of course though, this type of data-driven behaviour is not limited to restaurant-going though – and there are risks of a far less recreational nature to consider.

Insurance is one such example. Insurance actuaries, analysts and the like have, for more years than most of us, used data to determine the services and prices they will offer to people. The basic theory is: if there is more chance that this person will make an expensive insurance claim than average then they will be charged more (or refused) for the insurance product.

There is obvious business sense in this, but the need to rely on segmentaton, generalised patterns and so on means that someone with little relevant history who happens to fall into a “bad group” may be unfairly penalised.

The classic example bandied around here in the UK (where we have no mass-market need yet of health insurance) is car insurance.

Gender was traditionally an influencing factor in the insurance premiums. Men drivers would be charged more than women drivers based on the premise that they may on average take more risks than women, apparently get into more accidents, are less likely to wear seat belts, are more likely to get caught driving drunk and so on.

In a population this therefore might seem fair. At the individual level, less so in some cases – some men are safer drivers than some women, but of course the insurer has no way at first to determine which ones, so charges all men more. Well, they did until it became illegal.

As per the BBC in 2012:

the European Court of Justice has decided that insurers will no longer be allowed to take the gender of their customers into account when setting their insurance premiums.

There are many arguments as to the fairness of this ruling which we won’t go into here, other than to note that it did not please insurance companies overly – so some seem to be using a different sort of data to semi-bypass that ruling.

According to Stephen McDonald from the Newcastle University Business School, it looks like insurance companies are now using a proxy for gender: occupation.

Some jobs may legitimately have higher risks regarding driving than others. Also, even in 2015, it still happens that some jobs have a higher proportion of males or females in than others. Can you see where we’re going with this?

Some examples from Stephen’s study:

The occupations represent different mixes of males to females, with social workers and dental nurses being mostly female, civil engineers and plumbers mostly male, and solicitors and sports and leisure assistants being gender neutral.

And how is this relevant to insurance premiums? Well:

Comparing prices from before and after the change of law, he finds that after the ban:

  • Prices are the same for men and women.
  • But prices become relatively lower for young (21 year old) drivers in occupations mostly held by women, but higher for those in predominantly male occupations. For example, premiums increased by an estimated 13% for 21 year old civil engineers, but decreased by 10% for dental nurses of the same age.

To summarise: (especially young) people who tell insurance companies that they have a job which is commonly done by males are charged more than those who inform the same company that they have a job commonly done by females.

So, for anyone who really has no idea what career they’d like to follow – if you like driving and like money, perhaps dental nurse is the way to go!

Before we leave driving, let’s have a look at those dangerous motorist obstacles: potholes. A few years ago, a (somewhat dubiously sourced) survey estimated that hitting potholes caused nearly £500 million worth of damage to the UK’s car population in a year. The associated swerving and other consequential manoeuvres can be even a matter of life and death in some cases.

So they surely need fixing – but how can we locate where they are, so the requisite council can move in and fix?

In reality, few motorists are likely bothered enough to take the time to call or visit the authorities, even if they know who the relevant one is. Being the year 2015, the fashionable answer in many cases of course is to use an app. “Fill that hole” is one example (now backed by the Government) where you can quickly and easily report any pothole dangers.

Street Bump goes one step further, and tries to use a combination of your smartphone’s accelerometer and GPS sensors to automatically detect when you hit a pothole and then report it directly, avoiding the need for a user to manually log it.

But how you really want to be doing this doesn’t involve faffing around with a smartphone at all. Why not just get one of Jaguar’s latest developments – a Land Rover that not only detects and reports potholes automatically whilst driving but also could potentially alert the driver of the vehicle in advance to give them time to avoid them, tell other nearby cars about their existence and even – as driving becomes ever more automated – take over the controls and safely navigate around them automatically.

So let’s all get brand new Jaguars and pothole problems will be forever solved! Except we won’t, will we? And why not? Well, the #1 reason may be that cutting-edge car tech from Jaguar is not going to be cheap. It will probably be beyond the price range of most of us (for now). Just as smartphones are out of the price, knowledge or “interested in learning about” range of a still-significant proportion of the population. Which leads to a further “data risk”, especially when we are talking about important public services.

As Tim Harford notes in the Financial Times

Yet what Street Bump really produces, left to its own devices, is a map of potholes that systematically favours young, affluent areas where more people own smartphones. Street Bump offers us “N = All” in the sense that every bump from every enabled phone can be recorded. That is not the same thing as recording every pothole.

There’s an obvious loop here. Areas that are more economically deprived, or where there is less digital literacy, are less likely to have people busily reporting potholes on their fancy smartphones, let alone their cutting edge Jags. But there’s little evidence that they have fewer potholes – just fewer reports.

To steal a famous saying, here’s a case where “absence of evidence is not evidence of absence”.

If, however, we wrongly assume that areas with the most digital pothole reports are those with the most potholes and prioritise repairs accordingly, then one can imagine a situation where the most deprived areas get the least maintenance and become ever more troubled and undesirable, as their reputation as “bad areas” spirals further out. Those advocates of the broken windows theory might believe it could even lead to higher crime.

Potholes might seem like a relatively trivial example (if you’ve never hit one), but in an era where even the most vulnerable are being forced by the Government to magic up a computer and learn how to use the internet to access the welfare they desperately need to live – despite some surveys suggesting this will range between hard and impossible for many of those who most need them – it’s not hard to imagine more serious reinforcements of inequality along these lines.

Or even matters of life and death.

3 years ago, Hurricaine Sandy struck New York. It was big and bad enough that many people dialling 911 to request urgent help could not get through. Luckily though, the NY Fire Department had a Twitter account and a member of staff to monitor it for urgent requests. Although, in this case, they explicitly requested people not tweet for emergency help, this fine employee did only what anyone with some humanity would do – and passed on these urgent requests to the front-line rescuers.

“My friends’ parents who are trapped on #StatenIsland are at 238 Weed Ave. Water almost up to 2nd flr.,” posted Michael Luo, an investigative reporter for The New York Times.

I have contacted dispatch,” Rahimi responded within minutes. “They will try to send help as soon as they can.”

Fantastic stuff, social media activity potentially actually saving lives – but for those people unfortunate enough to not have a smartphone suitably fancy enough to tweet, or perhaps lacking the knowledge or desire needed to do so – it potentially introduces an innate bias in resource deployment that favours some people over others for no legitimate reason.

Moving on, regarding the same subject of reinforcing inequalities: it is the case many predictive data models are in fact deriving their answers by learning and adapting to what happened in the past. It’s the obvious way to test such systems if nothing else: if they can predict what did happen properly, they may stand a good chance of predicting what will happen.

This means however that data-driven efforts are not immune from reinforcing historical prejudice either

Let’s take a couple of serious biggies: race and gender discrimination.

Surely big, open, accessible data is exactly the tool needed to combat the horrible biases of some in society? Not necessarily; it depends entirely upon application.

As long ago as 1988, the Commission for Racial Inequality (now merged into the Equality and Human Rights Commission) found St George’s Hospital Medical School guilty of racial and gender discrimination in choosing who to admit to its school. Some rogue racist recruiter? No, not directly anyway.

‘…a computer program used in the initial screening of applicants for places at the school unfairly discriminated against women and people with non-European sounding names’.

Whyso?

‘The program was written after careful analysis of the way in which staff were making these choices and was modified until…it was giving a 90-95% correlation with the gradings of the [human] selection panel’

The data and ruleset that led to this bias wasn’t as explicit as one might imagine at first sight. The computer was never given the race of the applicants – it wasn’t even recorded on the application form. However, it seemingly “learned” to effectively proxy it based on surname and place of birth. The overall effect was to reduce the chances of those applicants who seemed female or foreign to be included on the “invite to interview” list.

‘Women and those from racial minorities had a reduced chance of being interviewed independent of academic considerations’

This wasn’t a programmer or data scientist (or whatever they were called in 1988!) secretly typing in a prejudiced line of code, and nor did the computer suddenly decide it didn’t like foreigners or women. The computer was not racist; but by the nature of the algorithms it ran reflected a process that probably was.

‘This point is important: the program was not introducing new bias but merely reflecting that already in the system’

Fast forward more than quarter of a decade, and now machine learning algorithms can run over outrageously large amounts of data, trying to test and learn which of thousands of variables can automate accurate decision making that in the past took a slow, expensive, human to do.

As shown above, it’s not as simple as removing “race” or other likely prejudices from the variable set.

2 years ago, Kosinksi et al. published a paper in the Proceedings of the National Academy of Science that simply looked at what people had pressed ‘Like’ on in Facebook. This is information that is often publicly available, attributable to an individual, and accessible for data-related uses.

Using nearly 60,000 volunteers, they produced a regression model which produced some very personal insights.

‘The model correctly discriminates between homosexual and heterosexual men in 88% of cases, African Americans and Caucasian Americans in 95% of cases, and between Democrat and Republican in 85% of cases. For the personality trait “Openness”, prediction accuracy is close to the test–retest accuracy of a standard personality test’

Yes, from you pressing ‘Like’ a few times on Facebook, this model purports to be able to determine with reasonable accuracy your physical and mental traits.

The researchers were kind enough to publish a supplement that showed some of the more predictive “likes” – including such unintuitive gems as an association between high IQ and liking curly fries.

But are businesses, employers and other authorities really going to use Facebook data for important decisions? Undoubtedly. They already do.

Loan companies are an obvious example. A quick Google will reveal many organisations purporting to credit score, or have a model for credit scoring, that depends at least partially on your digital data stream.

To pick one at random, Venturebeat reports on a company called Kreditech.

‘By analyzing a pool of publicly available online information, Kreditech can predict how creditworthy you are in a matter of seconds.

The big data technology doesn’t require any external credit bureau data. Instead, it relies on Facebook, e-commerce shopping behavior, mobile phone usage, and location data.’

And whilst I have no idea about the workings of their model, even if race is not a specific variable involved in the decision, then should if choose to segment based on, for instance, people who have liked “Bonfires” on Facebook, then, in accordance with the Facebook Likes study above, they will de-facto be adjusting for race (liking bonfires being apparently more predictive of being white than black).

Why pick the credit score example? Because some big finance companies have bad form on this sort of prejudice – the US DOJ for instance reached a $335 million settlement a few years ago because:

‘Countrywide discriminated by charging more than 200,000 African-American and Hispanic borrowers higher fees and interest rates than non-Hispanic white borrowers in both its retail and wholesale lending. The complaint alleges that these borrowers were charged higher fees and interest rates because of their race or national origin, and not because of the borrowers’ creditworthiness or other objective criteria related to borrower risk.’

Whatever the motivating factor behind those original prejudiced decisions was, if
this dataset of successful lending to applicants in the past is used to teach a machine how to credit-score automatically then one can see the same risk of unjust outcomes being created, just like St George’s inadvertently reproduced racism and sexism in its data-driven recruitment filter.

Even for those cold, hard capitalists with no interest in social conscience – by not taking the time to consider what your model is actually doing to generate its scoring you could jeopardise your own profits.

As noted above, some of the potential problems come from the classic ‘use the past to predict the future’ methods that underpin a lot of predictive work. Those familiar with Clayton Christensen’s much heralded book regarding the “Innovator’s Dilemma” will immediately see a problem.

From Wikipedia:

‘Christensen’s book suggests that successful companies can
put too much emphasis on customers’ current needs, and fail to adopt new technology or business models that will meet their customers’ unstated or future needs.’

Limiting your marketing efforts to approach people as customers just because they are the same sort of people that previously were customers may artificially restrict your business to a homogeneous, perhaps ever-declining, population of people. That is probably not really your only opportunity for radical growth.

So what’s to be done about this?

Unfortunately, when presented with “computer says no” type consumer scoring, most people are not necessarily going to understand why the computer said no. Sometimes it is literally impossible to determine that through any realistic method. Most of the responsibility therefore has to lie with those developing such models.

I recently had the pleasure of attending a fascinating talk by Hilary Mason, the Founder of Fast Forward Labs and previously a data scientist with bit.ly.

In it, and also in the book ‘Data Driven‘ she co-authored, she went through a set of questions that she likes to be asked about any data science project. One was ‘What’s the most evil thing that can be done with this?’

Partly this was to encourage more open thinking – and they do advise not to ask it if you actually work with people who are evil! – but also noted that:

‘One of the challenges with data is the power that it can unleash for both good and bad, and data scientists may find themselves making decisions that have ethical consequences. It is essential to recognize that just because you can, doesn’t mean you should.

Openess is an important, if tricky, issue. Where possible, good practice should mandate that predictive model builders should provide clear, understandable documentation as to how exactly their products work.

There are at least two tricky factors to take into account here though:

  1. If this is a commercial model, companies are not going to want to reveal the details of their secret sauce. Experian, for instance, may publish “some factors that may affect credit scores“, but they are never going to publish the full model workings for obvious reasons. However, this does not mean that Experian’s data scientists should not be producing at least clear and explicit documentation for their fellow employees, under NDAs if necessary.
  2. Some types of model are simply more impenetrable than others. A decision tree is quite easy to represent in a way that non-data-scientists can understand. A neural network is very much harder. But sometimes neural networks may produce far more accurate models. But either way it’s not hard to document what went in to the model, even if you can’t fully explain what came out!

Although this is somewhat idealistic, it would be nice if users of such models also were enabled, and proactively made an attempt, to understand it as far as possible.

The report on St. George’s prejudiced recruitment model above made this clear:

‘A major criticism of the staff at St. George’s was that many had no idea of the contents of the program and those who did failed to report the bias.’

In reality, one can’t determine from the article whether this is a fair criticism of the users, or actually something that should be aimed elsewhere. But it would not be impossible for any given organisation to internally promote understanding of such systems.

They also imply one possible monitoring solution. It’s good practice to regularly monitor the output of your model for scoring accuracy – but one can also monitor the consequence of the decisions it’s making, irrespective of accuracy. This is especially possible if you are, for legal or ethical reasons, particularly concerned about certain biases, even over pure model accuracy.

Worried if your model is outputting racist decisions? Well, if you can, why not correlate its results with “race” and look for patterns? Even before designing it you could remove any variables that you have determined to have certain degree of cross-correlation with race if you wanted to be really careful. But know that – depending on the task at hand – this might jeopardise model accuracy and hence be a hard sell in many environments.

No-one can argue that the main point of a lot of these predictive models is to optimise for accuracy. But then be aware of exactly what “accuracy” means. History is full of prejudice, and models can certainly reproduce it.

If you’re a luxury brand trying to target the customers with the most spare money then an obvious variable might be “pay received”. But don’t forget, according to the Fawcett Society:

In modern Britain, despite the 1970 Equal Pay Act, women still earn less than men. The gender pay gap remains the clearest and most dramatic example of economic inequality for women today.

So optimising for pay may have some correlation with favouring men over women. Does this matter to you? If, in your context, it does then tread with care!

If you’re a police department, trying to target people likely to commit the most serious of crimes, then “length of prison sentence” might be a sensible input. But beware – the Wall Street Journal reports:

Prison sentences of black men were nearly 20% longer than those of white men for similar crimes in recent years, an analysis by the U.S. Sentencing Commission found.

So optimising for length of prison sentence may be unfairly biasing your “crime score” to select black people over white people.

So, try and evade potential “judge bias” by just using the fact of arrest? Sorry, that probably won’t work much better. Here’s one example why from Jonathon Rothwell of the Brookings Institute:

Blacks remain far more likely than whites to be arrested for selling drugs (3.6 times more likely) or possessing drugs (2.5 times more likely).

Here’s the real shock: whites are actually more likely than blacks to sell drugs and about as likely to consume them.

So optimising for historical arrest statistics may also entail unfairly selecting black people over white people, because – before big data came near to police departments – some already were unfairly selecting black people over white people for other reasons. Does this matter to you? If so, once more, tread with care!

There’s probably no simple, generic, solution to avoid these issues. We are easily already at the point where most human users don’t really understand what creates the output of most data models in any depth.

The point can only be to never assume that even if a model uses a mathematically rigorous algorithm, an unbiased methodology and data that is considered to be an fair view of reality, that its outcome will respect “equal rights”. There is an implicit conflict in some cases between optimising for model accuracy – which may involve internally recreating parts of the world that we don’t especially like – and using data for the common good.

Exporting CSV data from SQL Server Management Studio

SQL Server Management Studio is a commonly-used bit of the Microsoft SQL Server install, and a decent enough tool for browsing, querying and managing the data.

Sometimes though you might have the urge to extract a big chunk of data – most often I do this to generate a big text-file dump for import into other data analysis tools. Even when said tools can link and query the SQL server directly themselves I often find it useful to do this to ensure my extract is guaranteed to remain static, portable, fast, local and not subject to the whims of far-away database admins, especially for the larger side of extracts.

There are at least 3 methods to do this. They can all produce the same results but some have more options than others. For simple CSV exports though, take your pick. They work nicely on biggish extracts too – they regularly serve me up datafiles with hundreds of millions of rows and a bunch of columns with minimal stress.

Option 1: Right clicking the data output grid

If you already ran your query such that the results are in the grid output at present, it’s possible to simply right click anywhere on the resulting data and choose “Save results as”.

Save results as

In the following box you get an option to save what was shown in the grid as a CSV or a tab delimited file.

Quick and easy, but a couple of downsides:

  • no option here to customise the delimiter or use fixed width text file format. This is fine with nice regular numeric data, but beware of anything with text in that might have the dreaded extra commas or equivalent within the field contents.
  • you need to have already got the results of the query output to the grid. This is fine for small result sets but I don’t especially like to make this happen if it’s going to generate millions of rows (and if it’s really small, then copy and paste to Excel or wherever is another simple option!).

Option 2: Output the query to a file instead of the grid

If you have no need to see the results of the query in the interactive grid, why waste your time? Instead output the results of the SQL directly to a file.

The first thing to do is to go to Query -> Query Options menu. Find the Results -> Text part of the options and check to make sure you are happy with the delimiters, including column headers, max characters etc.

Query options

Then, OK on that, and go back to the main SSMS screen.

Then press the toolbar button “Results to file” (or Ctrl + Shift + F)  Results to file

Now when you execute your query it will prompt you for a filename. Enter in the name/location of where you want your text file and it will happily churn away until it’s finished generating it.

Of course you can open a new query tab and carry on working if you want to get stuff done whilst waiting for it to complete.

Warning: SQL server is apt to add an annoying “count of records” line at the bottom of extracts performed in this way. This creates a mostly-blank and invalid record as far as any software importing the output is concerned. To stop this happening, make the first line of your SQL query:
SET NOCOUNT ON“.

Option 3: Use the “data export” task 

For the maximum levels of configuration, you can use the SQL Import and Export Wizard to get data out into CSV (or a lot of other things).

For this, in the Object Explorer window, right click on the name of the database concerned. It must be the database name; right clicking the table, columns, keys or elsewhere will not give you the correct options.

From this choose Tasks -> Export Data

Export data

It will then open up an Import/Export wizard.

Fill in the name of your SQL server and the database concerned if it isn’t already selected.

Database details

Choose the destination for your data. There are many options here, including copying it to another SQL server database if you have one dedicated to analysis or similar. For a CSV style output pick “Flat File Destination”.

Browse to tell it the filename you want it to output. Then make sure the format, column names options and so on are set as you wish. Don’t worry that you can’t change the delimiter itself on this screen, that comes later.

Destination

The next screen gives you the chance to copy data from a set of tables of views. But if you have already written some custom SQL then pick the “Write a query to specify the data to transfer” option.

You’ll get blank box which you can type / paste your SQL query in, exactly as you’d run it in the main SSMS environment.

Once that’s done, you’ll get to a screen where you can pick both a row and column delimiter, assuming you plumped for the delimited output type. Leave “source query” as “Query”.

Destination file configuration

Here you have buttons to quickly preview your data. If you want to perform some basic manipulations, then hit the “Edit Mappings” button first.

Column mappings

Here you can set options as to whether you want to create a new CSV, or delete/append to an existing one if you happened to pick a filename that already exists.

More interestingly though, you can see all the output fields and their characteristics. Click into the grid to change anything that you don’t like – or if it got the data type wrong (which it does sometimes). If you don’t need the column at all in your output, drop down the “destination” field for that column and chose “”.

Once you’re done there, you get to the save/run screen. Here you can either chose to run it immediately, or save the job as an SSIS package if it’s something you need to store and run or re-run in future.

Run or save option

Hitting “next” will give you a nice summary of the job you set up. Finally “Finish” will start the job going, assuming that’s the option you asked for. It will list the stages it needs to go through and you can follow whereabouts it is. Note of course if it is a large and complicated query, this stage may take a long, long time.

Hopefully you’ll end up with something like the below, and the file you wanted has been created.

Successful export

If things aren’t good though, you may get some sort of error message. Of course Google remains everyone’s best friend should these pop up if it is not self-explanatory, but my experience to date has been that most often it is the case that the incorrect field type has been selected (often automatically…) in the above “Edit mappings” stage. Go back and correct it if so.

Which option is best?

Surprise, surprise – it depends on your exact task.

I find option 2 is the one I use the most, as a nice compromise between speed (minimal configuration, mapping etc.) and customisation (you can pick your delimiter etc.).

But if you need any re-mapping or the ability to save the job for the future, you’ll need option 3.

1 is super easy, if you just want to export data that is already on your screen into a classic CSV format.

Bonus pro-tip…make sure your delimited field text doesn’t contain delimiters within its legitimate content!

Most of us have probably had the joy of receiving data in supposed CSV format that started off looking something like this:

ID Comment
1 Your site is nice
2 Your site is horrible, and I don’t like it

and was later translated into “CSV” something like:

ID,Comment
1,Your site is nice
2,Your site is horrible, and I don't like it

where the third row above has 2 commas in it – which many programs will interpret as it have 3 columns when in fact it has 2.

People have a surprising propensity to fill their text-fields with commas or even pipes and dollars even when you don’t expect them to – and none of the above export methods have by default (in my experience) reliably escaped them properly to the satisfaction of other analytical tools.

If you are not bothered about keeping those extra characters, then you can just process them out within the query SQL, in which case you’re guaranteed to be safe from this CSV curse.

This for example will replace any commas with spaces.
SELECT REPLACE([name_of_your_field],',',' ') AS name_of_your_field
FROM ....

Of course this solution is no good if you need to keep the delimiter characters as part of your analysis – but if you can live with removing them it works a treat.

Stephen Few’s new book “Signal” is out

Stephen Few’s latest, “Signal: Understanding what matters in a world of noise” has just been released – or at least it has in the US, seems to be stuck on pre-order on Amazon UK at present.

Not many reviews seem to be floating around just yet, but the topic is ultra-fascinating:

In this age of so-called Big Data, organizations are scrambling to implement new software and hardware to increase the amount of data they collect and store.

However, in doing so they are unwittingly making it harder to find the needles of useful information in the rapidly growing mounds of hay.

If you don’t know how to differentiate signals from noise, adding more noise only makes things worse. When we rely on data for making decisions, how do we tell what qualifies as a signal and what is merely noise?

In and of itself, data is neither. Assuming that data is accurate, it is merely a collection of facts. When a fact is true and useful, only then is it a signal. When it’s not, it’s noise. It’s that simple.

In Signal, Stephen Few provides the straightforward, practical instruction in everyday signal detection that has been lacking until now. Using data visualization methods, he teaches how to apply statistics to gain a comprehensive understanding of one’s data and adapts the techniques of Statistical Process Control in new ways to detect not just changes in the metrics but also changes in the patterns that characterize data.

Data science vs rude Lego

Data science moves onwards each day, helping (perhaps) solve more and more of the world’s problems. But apparently there’s at least one issue for which we don’t have a great machine-learning/AI solution for just yet – identifying penises made out of Lego.

Indeed this is apparently the problem that plagued the potential-Minecraft-beater “Lego Universe” nearly 5 years ago.

The internet is awash with re-tweets of ex-Lego-Universe developer Megan Fox’s amusing stories from yesteryear. Thanks to Exquisite Tweets for collecting.

Funny story – we were asked to make dong detection software for LEGO Universe too. We found it to be utterly impossible at any scale.

Players would hide the dongs where the filtering couldn’t see, or make them only visible from one angle / make multi-part penis sculptures…

They actually had a huge moderation team that got a bunch of screenshots of every model, every property. Entirely whitelist-based building.

YOU could build whatever you wanted, but strangers could never see your builds until we’d had the team do a penis sweep on it.

It was all automated, but the human moderators were IIRC the single biggest cost center for LEGO Universe’s operational costs. Or close to.

To be fair, this was a few years ago and progress on image recognition data science did not stop.

Lego itself just released “Lego Worlds” recently which seems to be a similar type of thing – whether they have solved the problem I do not know.

Humanity does seem to be making decent progress on such tasks in general. Microsoft Research recently published a paper “Delving deep into rectifiers” wherein they detail their algorithmic achievement in being perhaps the first program that classifies images within the Imagenet Large Scale Visual Recognition Challenge 2012 more accurately than the competitor human managed.

In the consumer space, both Flickr, and very recently, Google have opened up features that allow anyone to upload large numbers (or in Google’s case, apparently infinite) photographs and then keyword search for “dog”, “Billy”, “Paris” etc. to show all your photos of dogs, Billy or taken in Paris without you having to provide any manual tagging or contextual information.

Flickr’s attempt has been around a bit longer and has caused a little controversy – as all in the field of data will know, the sort of machine learning and classification processes this extremely hard problem requires do not have any inbuilt sense of politeness or decency.

Misclassifying this photo of Auschwitz as “sport”, as reported by the Guardian, is surely just a confused algorithm rather than a deliberate attempt to offend.

Flickr staff are open that mistakes will be made and that there is an inbuilt process to learn from them – but it’s obvious why a “normal” viewer can find these classification errors offensive, especially when they might relate to photos of their children for instance.

This surely poses a dilemma for the sort of companies that provide these services. The idea behind these services is a great one, and pretty essential in these days where we all take thousands of photos a year and need some way to retrieve the few ones we are particularly interested in – but how understanding present-day consumers are towards the mistakes inherent in the process – particularly at the start of any such efforts – remains to be seen.

In any case I’m sure it won’t be long before someone tests how good Google Photo is at autotagging Lego genitalia (or much worse…).