11 Jun 2018

How we made our first data-led game to forecast the World Cup

Last week, my team at The Telegraph launched our first data-led game. It allowed our readers to forecast the Russia 2018 World Cup by scoring six factors by their importance in the world’s biggest footballing event.


The result would be a personalised projection of the competition based on what the reader thinks is important, as opposed to an austere "The Telegraph Predicts The World Cup" style of forecasting.

To do this, myself and Patrick Scott had to collect a variety of data on six key areas for each team. Each of these six areas took in anything from one to six metrics to ensure that the final summary figure was a robust and accurate representation of how good that team is in that area.

The data behind the interactive

You can now see this data here. It gives each of the 32 teams a summarising score for each of the six areas, and then scales it to provide a score between zero and one.

We’ve written a bit of detail on the figures we had to dig into to get a final score for each of the six factors below. This includes the relative weightings we applied to each of the individual metrics, as some areas were deemed to be more important than others.

Form
  • Qualifying record: Points won out of possible total in World Cup qualifiers (1.7x weighting)
  • Elo rating: Similar to a FIFA ranking, Elo ratings are a measure of a team’s strength based on their results. Teams gain or lose points after each game and are given more points for getting positive results against teams who are ranked higher than them. Home advantage and scoreline are also factored in (1.1x)
  • Weighted qualifying results: Qualifying record ranked by net Elo gains/losses in these games. This ensures that teams with tougher routes to the World Cup are rewarded more (1.2x)
  • Momentum: 12-month net Elo rating change (0.3x)

History
  • World Cup pedigree: Finishing positions at previous World Cups (2x weighting)
  • Performance vs expectation: How teams have over- or under-performed at World Cups. This is determined by the stage each team should have reached based on pre-tournament Elo ratings. Within this measure we’ve also factored in how teams from different continents fare in European-based World Cups, given Russia some home advantage and penalised the holders, Germany (1x)

Players
  • Transfer value: Estimated transfer value of each squad (1.5x weighting)
  • Club quality: Players are ranked based on the strength of the league in which they play their domestic football (based on Club Elo rankings) (1x)
  • Experience: Total caps and previous World Cup appearances (0.5x)
  • FIFA 18 rating: The average Overall Rating score for each squad (0.8x)
  • Star player: The FIFA 18 Overall Rating for each team’s best player (1x)
  • Room to grow: The difference between a squad’s average Overall Rating on FIFA 18 and its average Potential rating (1.2x)

Manager
  • Honours won: Trophies won weighted by tournament Elo ratings (0.8x weighting)
  • World Cup experience: How each manager has fared at previous world cups (0.7x)
  • Record: Results with their current national side over the course of their tenure, including length of time in charge (1.5x)
  • Signs of improvement: Elo points gain/loss under current manager (2x)

Odds
  • Latest odds: Who should win based on what the bookmakers think (only metric)

Luck
  • A random number: A randomly generated number. Teams can have good or bad luck (only metric)

How it worked

Once we’d worked out all of these weightings, we had one final score for each of the six factors across the 32 teams.

This is where the reader comes into play. We asked the reader to score the importance of each of these factors from one to five, which then provides a multiplier. This introduces the “game” element of the forecaster.

To generate the winner of the tournament, we add the six newly-weighted figures together and get a final score. Each weighting has a random margin of five percentage points, to help mimic the randomness of any one World Cup game (this is in addition to the randomness of the actual “random luck” category).

Once this is done, we simulate the progression of the World Cup, with the team with the highest score in any one game beating the lower-scoring team until we get to the final.

So far, Germany have come out on top in 42 per cent of simulations, while Brazil are second-best on 25 per cent. England come out at zero per cent, but the Three Lions have still had several dozen wins.

Making a game: An element of randomness

Even though Germany and Brazil collectively share two third of the outcomes in our forecasting game, we worked hard to make sure that these two favourites didn’t dominate the game.

After all, countries like Spain, France, Portugal or Argentina all have a decent shot at the tournament. And with luck on their side, any team technically could win their next seven games and lift the trophy.

This is why we added two elements of randomness: first, the actual randomness score, and secondly, the randomness margin of ten percentage points on any one weighting.

We also ensured that we had categories that favoured other teams. While favourites Brazil and Germany dominate odds and manager pedigree, France performs best at player pedigree, and Spain's recent good run means that their form score is highest.

So Germany will indeed likely win if you score the importance of manager pedigree and odds categories highly. But you’d likely get France (the winner in 10 per cent of simulations) winning if you gave player pedigree the highest weighting.


You get the idea. We wanted this to be a game, so it was important that readers could go back and, by changing a couple of key metrics, there would be noticeable and interesting changes in how the World Cup progressed. Currently the metrics are indicating that readers are indeed running the simulation multiple times to see who else they can get winning the competition.

After every stage of the tournament, we’ll update our data to reflect the latest picture. This will most likely impact the form and player categories the most. If, for example, Argentina score five goals in their first game against Iceland, or equally if Messi gets injured in that game, their odds of progressing in both the World Cup and our simulations will be impacted. This way we ensure that our game stays relevant throughout the competition.

If you’re interested in finding out more about how this all works, check out our data or contact myself or Patrick.


2 Jan 2018

Why cartograms are great - but not always necessary

Cartograms are great - but they’re not the only way of visualising geographical data.

Geographically accurate maps, the same ones that humans have been charting for hundreds of years, are still incredibly useful tools.

I can’t really believe that these words need to be typed - but I got thinking about this topic following a tweet from the The Weekly Standard’s David Byler‏, and the fact that geographical maps are so derided got me thinking.

My own opinion is simple, but still needs saying: despite the popularity of cartograms within the data community due to their clear and accurate nature, geographical maps are often still the best way to help people engage with your geographical data.

Why cartograms are needed

If we look at the following maps (two results maps that we created for the UK’s 2017 General Election), there is a clear contrast between them.

The one on the left is a map that most of us are familiar with: a visual representation of the landmass of Great Britain.

The one on the right is a cartogram: a map that seeks to make the visualisation more accurate and representative by making each of the areas proportionate to a certain metric. In this case, it’s Parliamentary seats, and therefore each area is equal - but we could make them representative of population, for example.



The differences between the two maps are clear. In the one on the right, London is inflated to a greatly vaster size than on the geographical map. Conversely, Scotland has been effectively halved.

This is important as the two visualisations tell two different stories: the geographical map on the right overstates how well the Conservative Party (in blue) did. The Labour Party (in red) tends to do well in urban constituencies, which by definition have a higher population density and are therefore smaller on a geographic map.

While London has a total of 73 constituencies (24 more than Scotland), we can barely see it on a normal map due to the population density of the capital city.




But if we make each of its constituencies the same size, we weight each one equally - regardless of the geographical size. Such a cartogram makes sense in this case, because in the UK’s parliamentary system, each seat is worth one MP in Parliament.

Seats are the only thing that matters in this vote, and each one is equally important in Parliament, and so therefore making each one be the same-sized diamond helps people understand this story. They are all the same, and should be presented as such.

But why they aren’t always necessary

So far, so good. We should all be able to understand why weighting each constituency to be the same size is useful.

Likewise, weighting areas by population size can also be handy when we are comparing geographical patterns and need to show smaller, dense areas.

But there are many reasons why geographically-representative maps are also important.

Going back to David’s original tweet, I understand why he says that this opinion is “unpopular”, but it really shouldn’t be.

It simply comes down to audience.

The simple fact is that many people get maps.

This isn’t only because people can find themselves on a geographical map - but because people understand the visual representation of space shown to them in a graphic that resembles maps that they’ve been looking at for their entire lives.

FT data journalist John Burn-Murdoch said in reply to David that it “depends entirely on how you're prioritising the many different functions of a choropleth map” - which is correct.

Many times, the primary function of our maps is to help people understand the world around them. If they do not understand a cartogram, we are not doing our job.

Instead of a recognisable geographical map, cartograms are harder to comprehend. This doesn't mean that we shouldn't make them - but to profess that we only create cartograms can be elitist, catering merely for data literate audience.

Compromises are needed in mapping

In the long term, better education is needed to help boost comprehension of statistics, visualisation and general data literacy. While we don’t have that, it’s important to make a compromise.

That compromise is to both assist engagement and understanding of graphics, while also producing accurate and honest visualisations.

Which is exactly what we did at The Telegraph when it came to the General Election. We had a toggle option to switch from a geographical map to a “proportional view” on our results page.

The default was a geographical map, as all our readers would understand and (hopefully) engage with that. Then, if they wished, they could click a button to morph the map into a cartogram - allowing them to understand the wider context of the election and how close it actually was.

29 Jul 2017

How each UK newsroom visualised the General Election in different ways

General Elections are a time when all UK - and many international - newsrooms get to roll out the best of their data, graphics, development and innovation teams to cover a key political event.

We all create some graphics that are the same, or at least very similar, to cover the key statistics of the results night. These include some form of bar chart to show the total number of seats; a map to show the geographical distribution of the vote; and usually another bar chart or similar graphic to show the total vote share and swing between parties.

But the most interesting things for me, as a data journalist, are what the newsrooms do differently.

How did each individual team think to innovate and change how they cover the election? How did they learn from what every team did in the last election? And how have we all collectively, in the data and graphics community, developed political reporting further?

And also - how did each newsroom do all of this on the strict deadline of weeks that a snap election gives us?

The FT's slope charts were a great way of showing swing

The FT's team created several slope charts in their General Election results page, which helped tell the individual stories of constituencies: who won them; whether they swung; and how big the movement was in them.

Not only did they use slope charts to highlight the swing at a national level, but the graphics were also used for individual constituencies.

Their results page immediately highlights some of the key highlights of the night, with slope charts accurately and quickly portraying, for example, Kensington's shock swing from Conservative to the Labour.

It shows the huge gap between the two major parties and the razor-thin gap between them after the election in Kensington. The slope chart highlights this movement accurately in an easy-to-understand way, and is probably the best way I saw of visualising such movement in the election.


Ternary from The Times

The Times' digital team has already written about their results page, telling us the decisions they went through in order to achieve the final outcome.

An election graphic I valued from them was the ternary the team used to show "how Britain was more polarised than ever". The gif of the plot was used on social media to show the movement between elections.

From 2005, the gif plots the movement between each General Election. In the movement between 2015 and 2017, for example, the ternary is used effectively to highlight the big picture of the night - the movement towards Labour all over the board, the movement towards the Tories in certain seats, and the big swings towards the Conservatives in Scottish seats.

Using multiple axes in this way shows the multiple angles of an election: we are not living in a two-party political system, and so a ternary was a good call to show the complexity and multiplicity of movements in vote share between the parties.


Bloomberg's visualisation immediately shows how close the Tories were to a majority

Instead of using a bar chart or stacked bar chart, which most newsrooms used, Bloomberg opted for a grid plot.

It is a simple visualisation, which fulfils a very similar function to the bar charts (with a majority target line) that we used at the Telegraph.

However, the way in which they broke up each bar into boxes helps portray the composition of Parliament and its 650 constituencies. It shows - as it should - that the election is actually 650 individual elections, complete of 650 first-past-the-post races, which then built the new Parliament.

With a line in the middle of the boxes, we can see the exact number of seats each party needed to claim a majority - the Tories were just nine seats short of a majority (excluding the Speaker).

The Economist's electoral dashboard goes back to 2010

The Economist's stand-out quality was the tabulated format of their results dashboard, which allowed the reader to toggle between the General Elections of 2017, 2015 and 2010 - as well as the 2016 EU referendum.

This is a simple but effective way of contextualising the story of the 2017 General Election, which allowed the reader to easily compare it with previous electoral events in the UK. You could, for example, easily know the voting history of your own constituency, and find out how much of a shock its 2017 result was.

It also now acts as a great dashboard for anyone wanting to find out what happened nationally or in any constituency for the last four national votes in the UK, giving people the facts and stories they need to make sense of the modern political landscape.


The BBC's coverage emphasised localised content

The BBC's visual team are always among the best when it comes to electoral results visualisations - and again their were consistent, accurate and to-the-point.

Their individual constituency pages simply and effectively communicates the results of each seat - for example, how Labour's Stephen Kinnock gained 22,662 votes in Aberavon. I really enjoyed how they married together such graphical content with a localised live blog for each constituency, really bringing the General Election home to each of their readers.

The BBC's visual teams didn't do anything incredibly out of the visualisation box for the election, but this is to be commended, as they didn't need to do anything differently - all their graphics and their website structure worked effectively and successfully at telling the story of the night.


The Guardian's battleground checklist told the story very well by the end

The Guardian's results page used an interesting visualisation to group constituencies together based on their relatively 'safety' - their potential to swing between parties.




Each parties' seats were grouped in small multiple gridplots - split up between vulnerable (majority of 0-10%), safer (majority of 10-20%) and safest (majority of 20-30%) constituencies.

One difficulty I had with this graphic was trying to make sense of it as it was auto-updating throughout the night, as most were grey and so it was hard to see the point, but by the morning it was a great way to show the story of the night.

The graphic perfectly shows how, contrary to expectations, Labour held most of its seats; how the Tories unexpectedly lost lots of their vulnerable seats but even, startlingly, one of their safest seats; and how the SNP suffered a bloodbath and lost seats of majorities of over 20%.

It was the best way I saw of communicating such information.

The Telegraph's box plots told the story of the geographical splits

Our own results dashboard at The Telegraph also used small multiple gridplots like the Guardian, but this was to group constituencies by their geography instead of safety.

Using our own Telegraph-style diamonds, our idea was to show the geographical divides that dominate the UK: the Tories' dominance in the south; the supposedly-under-siege Labour strongholds of the north; and Scotland's backing of the SNP.

Of course, not all of these stories came to pass, and our visualisation show these stories. For example, we can easily see that the Scotland gridplot is a lot more diverse - and blue - than expected, meaning that the night was a bad one for the yellow of Nicola Sturgeon.

I think grouping constituencies in this way was effective, as it pushed the geographical divisions further - it have readers answers that they'd have to spend a lot of time searching for in a map or even a cartogram. By grouping constituencies by region in their own grid, the reader is immediately able to see which party could claim dominance in specific areas in a way that would take much longer in a map.

For the next election, I'd like to go further with this type of visualisation. It would be good to show the swings of certain seats in these regions, like The Times and Economist did effectively, while also integrating the Guardian's idea of grouping constituencies by the safety of their previous majority.


4 Apr 2017

What I learned from teaching data journalism

I have been teaching an Introduction to Data Journalism module to a group of MA Journalism students at City University London this year.

They were a varied group of Interactive, Investigative, Finance and Erasmus students - of many different backgrounds, from many different countries and with many different interests. What united them was their interest in using data to help find new stories and improve their current storytelling.

Through teaching them some data-led techniques to help achieve this aim, I also learned about how best to start understanding the process of data journalism. It helped me clarify some thoughts I already had about data journalism, while also challenging some other assumptions I had settled into.


Build the foundation


It's important to build the groundwork first, and this involves finding data. How to source information, where to find open data and where to go if you can't find information on your subject.

There's no point giving your students data that's already been sourced and cleaned unless they understand how the data got to this point in the first place. They need to know what the starting point of data journalism is, and like all other strands of journalism, that's your source.

This means taking them through the process of using open data portals such as the World Bank's, as well as talking about how we can find our own data through scraping, information requests and other means.

Take it slow


Analysing data is the bread and butter of the practice. This is where we find our stories, how we prise valuable and engaging information from otherwise untapped and uninteresting data.

It's the essential - and fun - part where we find out what our story is. And so it's important that we invest the necessary time to look into this section of the data journalism process.

The variety of my students' skillsets when they started the course was surprising. While a minority had used statistical programmes such as R to crunch data, many more had come from arts-based degrees and were daunted by the prospect of "lots of numbers in a spreadsheet".

Accommodating the two to find a common ground in data analysis was key, and I thought it was the right decision to spend many weeks on different statistical analysis platforms in order to provide a variety of tools with which to analyse data.

Don't go straight in with the fun stuff


Almost everyone, when they want to get into data journalism, wants to learn how to visualise data. This usually involves wanting to create a pretty choropleth map or a complex interactive as soon as possible.

This, of course, is a mistake. It is useless in itself, unless it's twinned with visual and data literacy.

There are plenty of bad visualisations online and this is partly because of people who have the skills to build graphics but don't have the understanding of how to communicate statistics. I therefore spent the whole first term trying to help my students understand the best practices for visualising the data, without really going into detail on many different tools that could be used for doing so.

Simple and complete is better than complicated and unrefined


While, as said above, it's important to emphasise statistical literacy before going in with the "fun stuff", that's not to say aspiring data journalists cannot tell visual-led stories.

There's a lot that can be told through simple visualisations such as bar charts and line charts - and minor adaptations of these. And so while I focused on visual literacy instead of an arsenal of tools in my first term of teaching, I still highlighted some platforms for creating basic visualisation tools.

As then-Guardian Data Editor Simon Rogers has previously said, "anyone can do" data journalism. Introducing students - all keen to create many different types of visualisations - to free visualisation platforms such as Highcharts and Datawrapper allowed them to start producing data-led stories without getting carried away with overly complicated - and potentially flawed - visualisations.

This allowed them to practice with the basics first, learning the best visualisation practices while doing so, before moving onto more advanced (and fun) stuff.

What this says about data-driven journalism


None of these points above are ground-breaking, but they do reinforce an important point: you have to get the basics right first.

It's important to remember that the core of data-led journalism is in the analysis. In the finding of stories that other reporters couldn't find. In uncovering stories in vast quantities of information that the ordinary population does not have time to discover for themselves.

Data journalism can often be beautiful, attractive and technically brilliant, but none of this matters if the foundation isn't there.

Students will be biting at the bit to work on the huge interactive visualisations that are probably the reason they're interested the practice in the first place. But I think it's important to focus on the sourcing and analysis of data for first - as this is our starting point and the way that we discover groundbreaking stories.