30 May 2016

Poisonous statistics: How bad numbers could influence a generation's future

£350m per week. According to some, that's the amount of money the UK gives to the EU. Of course, it's not. We instead pay around £250m per week, due to the rebate that reduces the amount we pay - but that doesn't stop people saying and believing the first number.

I've spent the last few weeks working with Full Fact to check some of the statistics in the EU referendum. From household income to immigration, jobs to red tape, we haven't yet found a claim that we can fully endorse - they're either completely wrong or at least misleading.

These claims are coming from major politicians with huge followings. Prime Minister David Cameron; ex-London Mayor Boris Johnson; Labour Leave leader Alan Johnson; Ukip leader Nigel Farage.

All of these people are getting away with twisting numbers to suit their own ends. Politicians have always done this - and they will always do so.

But there's something wrong when campaigns can keep on repeating the same incorrect statistic - that the UK sends £350 million a week to the EU - without any consequences.

A quick explanation

Just to quickly explain why this figure is plainly wrong and misleading. The UK’s rebate, or discount, reduces what we would otherwise pay. In 2015, we paid the EU £13 billion - working out at £250 million a week.

But then there's the EU payments given to the government, which makes our net contribution around £8.5 billion, or £160 million a week. This is the UK's net contribution: still a big cost, but less than half the figure that many people now believe is true.

This can be balanced against other ways in which the EU contributes to the UK: grants to British researchers, for example. The remain camp would then argue that it can also be weighed against advantages in business, trade and employment. Full Fact's guide to EU contributions goes into all of this in more detail.

Our chart showing how much the UK actually sends the EU annually (Telegraph Graphics)

Why does it matter?

The number's been featured on the side of the Vote Leave bus for weeks. It's been repeated by numerous public figures and campaigners, plastered all over social media. My own friends and family have repeated the number at me when the subject arises. It's become a fact for people.

But the problem is that it's not a fact. The UK Statistics Authority itself has said so. Sir Andrew Dilnot, chair of the UK Statistics Authority, said he was disappointed by the Brexit campaign's repetition of the claim, branding it "misleading and undermines trust in official statistics".
And yet the leave campaign are still going around saying it without any consequences. Every time it's repeated, "£350m per week" gains traction. It gets spread around more people and slowly becomes reality. Just this week, the figure was repeated live on TV during a BBC EU debate, allowing thousands of people to be persuaded by a dodgy statistic.
Where is the accountability for politicians and campaigns using poisonous statistics? They could influence the history of the United Kingdom - based on the misuse of numbers.

Tim Harford has previously written a piece on how politicians have poisoned statistics, and his points are only made more clear by what we're seeing in the EU campaign. Still, he gives us a gleam of light in the face of his misuse of statistical 'evidence'. He concludes:

But despite all this despair, the facts still matter. There isn’t a policy question in the world that can be settled by statistics alone but, in almost every case, understanding the statistical background is a tremendous help.

So the facts do still matter. That's reassuring. We just have to figure out which facts matter - and hopefully before the EU referendum vote on 23 June.

And for the future, there needs to be accountability for politicians and their use of statistics. They can't get away, as the Leave campaign might, by altering the history of a country through the misuse of data.

21 Feb 2016

Mapping with CartoDB: Solutions to problems faced by the journalist user

CartoDB is a great mapping tool for journalists. I've used it personally and professionally, to help tell the data-driven stories I produce.

It can be used to visually improve the stories you seek to tell, using interactive maps to help readers engage with your stories. You can produce these maps with no coding knowledge, all with the simple upload of an Excel file to the website, which will do all the hard work for you.

As soon as you upload your data, CartoDB will often show you a map of it immediately, which can be customised easily to improve it. This customisation can vary from simply changing the map type or information window, to editing the CSS to play with how the map shows your data.

But CartoDB is not perfect. There are drawbacks with using this tool, as I'll try to explain below, as well as highlight some ways to get around these issues.

Area names not matching

In the cleaning process, data journalists know that they have to look at their data's consistency - and this applies for area names.

Unless you're mapping large, major areas, such as countries, the chances are that you'll have to merge datasets - matching up area names in your dataset with shape files you've had to download yourself (often in the form of .kml files).

Often, when mapping UK constituencies or local authorities (shape file here), area names can prove an issue, as there are different ways to spell or present them. The text strings that you have in your data (even if t's downloaded from government websites) may not match up with the text strings in the shape file you downloaded - and so you won't be able to merge them.

This can be the difference between having "York Central" and "Central York"; "Wyre and Preston North" and "Wyre & Preston North"; "Weston-Super-Mare" and "Weston Super Mare". Any of these inconsistencies are present between your two spreadsheets, when you could to merge them in CartoDB, you will have gaps in your map.

This means you have to be careful when you are preparing your dataset, before you upload it to CartoDB. Look at your shape file and your dataset, and see two match up for area names. Checking will save you time later. The if function could be useful here, to ask Excel if your two columns are the same (once you have brought the two datasets together for your test).

Alternatively, using area codes overlooks this altogether - avoiding the possibility of variations in human input and making your work more reliable.

Area codes are individual and less susceptible to human error, and therefore are the best options to use in geolocating

Regional differences

Living in the UK? In England or Wales? Scotland? Northern Ireland? Thanks to devolution, each of these regions have different statistical agencies and so maps comparing a variable across the whole of the UK are rare.

This can be an issue if you're writing for a British newspaper, wanting to show differences across the whole country. You may get shape files and data for English regions, and perhaps Scotland and Wales - but often Northern Ireland may be missing.

Feel free to email me if you can't get a shape file for Northern Ireland - I have one somewhere. You may have to add this as a different layer on your map, if you don't wish to merge the files yourself (this can have implications for the map's size).

Even if you manage to get the shape files, consistent data across the whole of the UK can be rare. Be wary of comparing regional differences from different spreadsheets. These datasets may not be comparable, and so it's always best to try and seek the data you want all from one source for reliability.

As Simon Rogers says in his book Facts are Sacred:
"It is often easier to get across statistics from European countries via Eurostat than it is to get figures for the whole of the UK at a local level. That is because Eurostat has a single operation to combine data from across the European Union into single accessible datasets by coordinating all the national statistics agencies.
"The UK, with increasingly disparate data sources, needs that now. And it's kind of what we expect from the Office for National Statistics. The title says it all."
A choropleth map of the UK without Northern Ireland is an all too common sight

Mobile responsiveness

CartoDB maps are easy to embed in frames on your website, but they can be tricky to view on mobile. A map can often take over the whole screen, making it hard to scroll past to get to the rest of your story.

The map itself can also show too much information, or have the wrong focus, which means it's actually just confusing for your mobile audience, instead of helping them understand your story.

Advice to combat this would include:
  • Check the dimensions that are described in your iframe code. If the height is too much, the map can be hard to scroll past on mobile.
  • Keep the map free of clutter, such as lots of shapes, lines or dots. On mobile, too many of these can make the map unusable. 
  • The same goes for CartoDB's optional add-ons, such as sharing options and a search map. Unless it's important for your story, cut it. 
  • If your information windows have lots of information, they can dominate the screen when the reader clicks on them. This can crowd out what is underneath the window, which may be important context, and can hinder the reader. Omit all but essential information, and reserve what else you wish to tell for elsewhere in your story. 
CartoDB has now teamed up with Nutiteq, described as"pioneers in native mobile mapping", which could see developments in how their maps are viewed and engaged with on our smaller handheld screens.

Other geocoding issues

Other visual journalists I have spoken to have mentioned that it can sometimes be tricky to geocode data in CartoDB. There are options available for this in the tool, but they can be difficult.

I would suggest you geocode your data before entering it into the CartoDB platform. Either line up your shape file and check its suitability with your data, or create longitude and latitude columns before uploading your dataset.

This geocoding resource is invaluable: you can input a list of addresses into it, and it will automatically return the longitude and latitude of each point for you. It will also present them on a map for you, so you can quickly check the geographical distribution (and have a quick, first check to make sure it's accurate).

When merging spreadsheets by a column in order to geolocate your data against a shape file, make sure that the area names match up exactly

The restrictions of a third-party tool

Using a tool you haven't created yourself has obvious restrictions, in that you haven't planned and developed it with your specific needs in mind. You won't be able to do everything you want with its browser tool.

There are ways to get around this however. There are several ways to improve your maps in the tool, such as using HTML to make the data inside the information windows flow better, as well as filter and SQL query options.

CartoDB is also much more than just a browser tool and is available open source so you can make Carto more honed for your own ends. There is the CartoDB.js library, which can have several uses, such as wrap its APIs into complete visualisations.

7 Feb 2016

"Data for data's sake": When is it the right time to invest hours in big visualisation projects?

When I started as a data journalist at The Telegraph, I was warned of "doing data for data's sake".

At the time, I wasn't exactly sure what this meant. Was I supposed to not take lots of time analysing complicated datasets, in favour of quick, snappy data-lead pieces?

Now I understand better: it means that, as a data journalist, there's a vast array of storytelling approaches for you to select. There's a choice between longer investigative or explanatory pieces, or quicker, news-driven posts, or approaches in between the two.

Among these, there are visualisations that can be produced in a matter of seconds with chart builders, and then there are others that have been specially designed and honed for one particular story.

Selecting the right approach is incredibly important. "Data for data's sake" means not taking the latter of these options, spending hours or days on complex graphical representations, when the story doesn't require it.

Why is it important to really think about how you are presenting your data?

News organisations are businesses. And, as an employee of that business, you need to be an efficient and productive employee. 

This can pose an issue as a data journalist, who by definition will often spend longer on stories - in order to check statistics, produce authoritative context and present stories in quality ways. We analyse  vast quantities of information, process the important bits and show it to our readers. And this takes time.

When it comes to presenting the story to your readers, data journalists have to think about the methods they use. They have to consider the amount of time they're spending on one specific project; if the hours they're using to create one visualisation is helping the story.

Does it assist the readers' understanding? Is it too complex? Is there a valuable return for this resource-intensive project? Is there a simpler, or reusable, option which tells the same story in an equally effective way?

Ultimately, as journalists, we need to write for our audience

As fun as they can be, complex visualisations - no matter how well designed they are or how many hours have been invested in them - can actually hinder a reader's experience.

Sometimes, a "data for data's sake" approach can lead to data journalists producing a resource-intensive interactive because it's attractive, and this can lead to journalists getting carried away and forgetting the core principle: is this helping to tell the story to the reader?

If a reader's first reaction to a visualisation is "that's confusing" or "that's complex", it probably isn't aiding their experience.

Investing time is important - for the right project

Of course, there is always the right moment to invest lots of time in data sourcing, analysis and visualisation.

The presentation of data shouldn't be "dumbed down". I am not saying that every data-led story can be reduced to simplistic bar charts.

It's just about assessing the specific project you're working on and working out the suitable strategy for it. This may be a comprehensive visualisation that shows the reader the full picture on the topic at hand; it could be a couple of line charts highlighting a couple of the story's main aspects; or it may be a more traditional report, in the form of words, based on the data you've found.

If a story is complicated and important to your reader, a sophisticated interactive may be the best approach. If a graphic helps communicate a story and makes it more accessible to the reader, then data visualisation is worth the time.

Some interactives, however, risk being overly sophisticated, intricate or complex. While none of these features are bad on their own, if they hamper the ability to tell the story, then they're not doing their job. 

When is it right to not visualise data?

A common mistake for students of data journalism is to get over-excited by visualisations. They are, after all, often the most fun, interactive and creative part of data-led projects.

But sometimes the best way to tell a data-led story is to write it. With words.

When it comes to the later stages of working out how to present your story, design for your audience and focus on what you're trying to tell. Ask yourself whether your visualisation aids the story, no matter how much you've personally invested in one particular idea.

If the answer is no, then perhaps the most suitable way to present your data is to write about it in an accessible manner. This approach to data journalism is often overlooked, but can be a powerful way to communicate data-led stories.

Avoid data for data's sake, and instead use data for what it's good at: opening up new possibilities in journalism and storytelling, in order to improve the reader's experience.

6 Nov 2015

Five Things I learned at Web Summit 2015

Data scientists (and journalists) need to "do code"

Wes McKinney’s talk at the data summit was all about data scientists - and the shortage of them. It’s a job title that means different things to different people, but he helped define what the specific title means.

One thing that is certain: Code is key. You can’t analyse lots of data unless you’re embedded in the process of continually learning the newest code languages. Be that languages for sourcing and sifting information, such as R and SQL, or visualisation tools such as D3.

These set you apart and are the skills that take you further in data and, while Wes was talking about data scientists and not journalists, I’ll certainly be continuing to learn about them.

News brands have a space in the future: By telling truth

Platforms such as Facebook Instant and Apple News pull in content from publishers. It limits brand's potential for ad revenue, as it can pull people away from their own platforms (and ads).

There is a silver lining, according to the Washington Post's Stephen Hills. These platforms give you content, but they fail in another area. Readers can't trust content when there’s no visible brand from whom they’re consuming the information. News brands can provide this.

Several conversations I had at the Web Summit, with journalists and non-journalists alike, touched on the necessity for brands to build rapport with readers. A person needs to trust (and enjoy) a news brand so they’ll go back again, which is how publications can thrive in the future.

Big data is… big

PJ Hagerty said: "Big data is one of the biggest buzzwords out there... It's overused: chances are, you don't have it".

He’s right. "Big data" is another buzzword in the jungle of data-related buzzwords. And there’s a lot of them.

Just what is big data? What is the threshold when "data" becomes “big”? This Forbes piece does a good job at defining its complexities, but if you want a quick stab at a definition, I'll give it a shot as extremely large datasets that computational processes are needed to analyse them.  I've never delved into this realm yet, with datasets that are that large, but I am keen to in the future.

Privacy is even more important - and under threat - than you think

Tim Budden gave a talk on "wrangling the world’s largest data set". Namely: getting data from social media.

You can learn a lot of useful information from social media. It contains demographic, geographic, lexical, emotional, personal information - just to name a few types.

My friend Clara Guibourg recently let herself get hacked by ethical hackers in order to learn about cyber security. They found out a lot from her social media presence, which serves a good reminder: You are publishing information when you’re on social media, and this information can be accessed by anyone.

Dublin’s great

It’s not part of the summit itself - but the event really is improved by its Irish setting.

The friendly hospitality of the Irish, twinned with the picturesque environment of the city, really do enhance the whole event. Dublin's pubs are great for the events - with such a variety of settings that cater for everyone’s needs.

It’s certainly made me want to come back to Ireland - next time on holiday.

A photo posted by Ashley Kirk (@ashley_j_kirk) on