Leicester City: Need for Speed?

Originally published on StatsBomb.

Leicester City’s rise to the top of the Premier League has led to many an analysis by now. Reasons for their ascent have mainly focused on smart recruitment and their counter-attacking style of play, as well as a healthy dose of luck. While their underlying defensive numbers leave something to be desired, their attack is genuinely good. The pace and directness of their attack has regularly been identified as a key facet of their style by writers with analytical leanings.

Analysis by Daniel Altman has been cited in both the Economist and the Guardian, with the crux being that the ‘key’ to stopping Leicester is to ‘slow them down’. Using slightly different metrics, David Sumpter illustrated this further at the recent Opta Pro Forum and on the Sky Sports website, where his analysis surmised that:

For Leicester, it’s about the speed of the attack.

An obvious and somewhat unaddressed question here is whether the pace of Leicester’s attack is the key to their increased effectiveness this season? Equating style with success in football is often a fraught exercise; the often tedious and pale imitations of Guardiola’s possession-orientated approach being a recent example across football.

Below are a raft of numbers comparing various facets of Leicester’s style and effectiveness this season with last season.


Comparison between Leicester City’s speed of attack and shot profile from ‘fast’ possessions. A possession is a passage of play where a team maintains unbroken control of the ball. Possessions moving at greater than 5 m/s on average are classed as ‘fast’. All values are for open-play possessions only. Data via Opta.

The take home message here is that the average pace of Leicester’s play has barely shifted this season compared to last. Only Burnley in 2014/15 and Aston Villa in 2013/14 have attacked at a greater pace than Leicester this season over the past four years.

The proportion of their shots generated via fast paced possessions has risen this year (from 27.5% to 32.1%) and Leicester currently occupy the top position by this metric over this period. In terms of counter-attacking situations, their numbers have barely changed this season (20.1%) compared to last season (20.8%), with only the aforementioned Aston Villa having a greater proportion (21.3%) than them in my dataset.

What has altered is the effectiveness of their attacks this season, as we can see that their expected goal figures have risen. Below are charts comparing their shots from counter-attacking situations, where we can see more shots in the central zone of the penalty area this season and several better quality chances.


Comparison of Leicester City’s shots from ‘fast’ and ‘deep’ attacks in 2014/15 and 2015/16. Points are coloured by their expected goal value (red = higher xG, lighter = lower xG). Any resemblance to the MK Shot Maps is entirely intentional. Data via Opta.

Their improvement this year sees them currently rank first and second in expected goals per game from fast-attacks and counter-attacks respectively over the past four season (THAT Liverpool team rank second and first). Based on my figures, Leicester’s goals from these situations are closely in line with expectations also (N.B. my expected goal model doesn’t explicitly account for counter-attacking moves).

The figure below shows how this has evolved over the past two seasons, where we see fast-attacks helping drive their improved attack at the end of 2014/15, which continued into this season. There has been a gradual decline since an early-season peak, although their expected goals from fast-attacks has reduced more than their overall attacking output in open-play, indicating some compensation from other forms of attack.


Rolling ten-match samples of Leicester City’s expected goals for in 2014/15 and 2015/16. All data is for open-play shots only. Data via Opta.

The effectiveness of these attacks has gone a long way to improving Leicester’s offensive numbers. According to my expected goal figures in open-play, they’ve improved from 0.70 per game to 0.94 per game this season. About half of that improvement has come from ‘fast’ paced possessions, with many of these possessions starting from deep areas in their own half.

Examining the way these chances are being created highlights that Leicester are completing more through-balls during their build-up play this season. The absolute numbers are small, with an increase from 11 to 17 through-balls during ‘fast’ possessions and from 6 to 12 during ‘fast’ possessions from their own half, but they do help to explain the increased effectiveness of their play. Approximately 27% of their shots from counter-attacks include a through-ball during their build-up this season, compared to just 11% last season. Through-balls are an effective means of opening up space and increasing the likelihood of scoring during these fast-paced moves. Leicester’s counter-attacks are also far less reliant on crosses this season, with just 2 of these attacks featuring a cross during build-up compared to 9 last season, which will further increase the likelihood of scoring.

Speed is an illusion. Leicester’s doubly so.

Overall, attacking at pace is a difficult skill to master but the rewards can be high. The pace and verve of Leicester’s attack has been eye-catching but it is the execution of these attacks, rather than the actual speed of them that has been the most important factor. Slowing Leicester down isn’t the key to stopping them, rather the focus should be either on denying them those potential counter-attacking situations or diluting their impact should you find yourself on the receiving end of one.

Whether they can sustain their attacking output from these situations is a difficult question to answer. If we examine how well output is maintained from one year to the next, the correlation for expected goals from counter-attacks is reasonable (0.55), while goal expectation per shot is lower (0.30). Many factors will determine the values here, not least the relatively small number of shots per season of this type, as well as a host of other intrinsic football factors. For fast-attacks, the correlations rise to 0.59 for expected goals and 0.52 for expected goals per shot. For comparison, the values for all open-play shots in my data-set are 0.91 and 0.63.

Examining the data in a little more depth suggests that the better counter-attacking and/or fast-paced teams tend to maintain their output, particularly if they retain managerial and squad continuity. Leicester have a good attack overall that is excellent at exploiting space with fast-attacking moves.

Retaining and perhaps even supplementing their attacking core over the summer would likely go a long way to maintaining a style of play that has brought them rich rewards.


Counting counters

Over on StatsBomb, I’ve written about Leicester’s attacking exploits this season, specifically focusing on the style and effectiveness of their attack. That required a fair amount of research into various aspects relating to the speed and directness of teams attacks, which I’ve looked into since I started looking at possessions and expected goals.

One output of all that is a bunch of numbers at the team and player level stretching back over the past four seasons about fast-attacks and counter-attacks, some of which I will post below along with some comments.

As a brief reminder, a possession is a passage of play where a team maintains unbroken control of the ball. I class a possession moving at greater than 5 m/s on average as ‘fast’ based on looking at a bunch of diagnostics relating to all possessions i.e. not just those ending with a shot. The final number is fairly arbitrary as I just went with a round number rather than a precisely calculated one but the interpretation of the results didn’t shift much when altering the boundary. Looking at the data, there is probably some separation into slow attacks (<2 m/s), medium-paced attacks (2-5 m/s) and then the fast attacks (>5 m/s). Note that some attacks go away from goal, so they end up with a negative speed (technically I’m calculating velocity here but I’ll leave that for another time), so these are attacks towards the goal.

Counter-attacks are when these fast-paced moves begin in a teams own half. Again this is fairly arbitrary from a data point-of-view but it at least fits in with what I think most would consider to be a counter-attack and it’s very easy to split the data into narrower bands in future.

I should add that Michael Caley has published analysis and data relating to counter-attacking, although he is apparently in the process of revising these.

All of the numbers below are based on my expected goals model using open-play shots only. I don’t include a speed of attack or counter-attacking adjustment in my model.

So, without further ado, here are some graphs…

Top-20 offensive fast-attacking teams


Top 20 teams in terms of fast-attacking expected goals for over the past four seasons.

Champions Elect Leicester City sit atop the pile with a reasonable gap on THAT Liverpool team, with a fairly big drop to the chasing pack behind. Arsenal and Manchester City are quite well represented here illustrating the diversity of their attacks – while both are typically among the slowest teams on average, they can step it up effectively when presented with the opportunity.

Top-20 offensive counter-attacking teams


Top 20 teams in terms of counter-attacking expected goals for over the past four seasons.

Number one isn’t a huge shock, with this years Leicester City narrowly ahead of the 12/13 iteration of Liverpool. A lot of the same teams are found in both the fast-attacking and counter-attacking brackets, which isn’t a great surprise perhaps.

Southampton this year are perhaps a little surprising and it is a big shift from previous seasons (0.056-0.075 per game), although I’ll admit I haven’t paid them that much attention this year. Their defense is the 6th worst in this period on counter-attacks also (3rd worst on fast-attacks). When did Southampton become a basketball team?

What is particularly noticeable is the prevalence of teams from the past two seasons in the top-10. A trend towards more-transition orientated play? Something to examine in more detail at another time perhaps.

Top-20 defensive fast-attacking teams


Top 20 teams in terms of fast-attacking expected goals against over the past four seasons.

Most of the best performances on the defensive side are from the 12/13 and 13/14 seasons, which might give some credence to a greater emphasis more recently on transitions along with an inability to cope with them.

The list overall is populated by the relative mainstays of Manchester City, Liverpool and West Brom along with various fingerprints from Mourinho, Warnock and Pulis

Top-20 defensive counter-attacking teams


Top 20 teams in terms of counter-attacking expected goals against over the past four seasons.

Interestingly there is a greater diversity between the counter-attacking and fast-attacking metrics on the defensive side of the ball than on the offensive side, which might point to potential strengths and/or weaknesses in certain teams.

Spurs last season rank as the worst defensive side in terms of counter-attacking expected goals against, and are narrowly beaten into second spot for fast-attacks by the truly awful 2012/13 Reading team.

Top-20 fast-attacking players


Top 20 players in terms of fast-attacking expected goals per 90 minutes over the past four seasons. Minimum 2,700 minutes played.

Lastly, we’ll take a quick look at players. For now, I’m just isolating the player who took the shot, rather than those who participated in the build-up to the goal. A lot of this will be tied up in playing style and team effects.

Jamie Vardy is clearly the standout name here, followed by Daniel Sturridge and Danny Ings. Sturridge leads the chart in terms of actual goals with 0.21 goals per 90 minutes, with Vardy third on 0.18.

Vardy’s overall open-play expected goals per 90 minutes stands at 0.26 by my numbers over the past two seasons, so over half of his xG per 90 comes from getting on the end of fast-attacking moves. He sits in 16th place over all for those with over 2,700 minutes played, which is respectable but he is clearly elite when it comes to faster-paced attacks.

Top-20 counter-attacking players


Top 20 players in terms of counter-attacking expected goals per 90 minutes over the past four seasons. Minimum 2,700 minutes played.

Danny Ings sits on top when it comes to counter-attacking, which bodes well for his future under Jürgen Klopp at Liverpool, providing his injury hasn’t unduly affected him. Again, Sturridge leads the list in terms of actual goals with 0.13 per 90 minutes, with Vardy second on 0.12. The sample sizes are lower here, so we would expect a greater degree of variance in terms of the comparison between reality and expectation.

One of the interesting things when comparing these lists is the divergence and/or similarities between the overall goal scorer chart. For example, Edin Džeko and Wilfried Bony sit in first and fourth place respectively in the overall table for this period but lie outside the top-20 when it comes to faster-paced attacks. A clear application of this type of work is player profiling to fit the particular style and needs of a prospective team, which Paul Riley has previously shown to be a useful method for evaluating forwards.

Moving forward

I wanted to post these as a starting point for discussion before I drill down further into the details in the future. The data presented here and that underlying it are very rich in detail and potential applications, which I have already started to explore. In particular, there is a lot of spatial information encapsulated in the data that can inform how teams attack and defend, which can help to build further descriptive elements to team styles along side measures of their effectiveness.

I’ll keep you posted.

Fools Gold: xG +/-

Football is a complex game that has many facets that are tough to represent with numbers. As far as public analytics goes, the metrics available are best at assessing team strength, while individual player assessments are strongest for attacking players due to their heavy reliance on counting statistics relating to on-the-ball numbers. This makes assessing defenders and goalkeepers a particular challenge as we miss the off-ball positional adjustments and awareness that marks out the best proponents of the defensive side of the game.

One potential avenue is to examine metrics from a ‘top-down’ perspective i.e. we look at overall results and attempt to untangle how a player contributed to that result. This has the benefit of not relying on the incomplete picture provided by on-ball statistics but we do lose process level information on how a player contributes to overall team performance (although we could use other methods to investigate this).

As far as football is concerned, there are a few methods that aim to do this, with Goalimpact being probably the most well-known. Goalimpact attempts to measure ‘the extent that a player contributes to the goal difference per minute of a team’ via a complex method and impressively broad dataset. Daniel Altman has a metric based on ‘Shapley‘ values that looks at how individual players contribute to the expected goals created and conceded while playing.

Outside of football, one of the most popular statistics to measure player contribution to overall results is the concept of plus-minus (or +/-) statistics, which is commonly used within basketball, as well as ice hockey. The most basic of these metrics simply counts the goals or points scored and conceded while a player is on the pitch and comes up with an overall number to represent their contribution. There are many issues with such an approach, such as who a player is playing along side, their opponent and the venue of a match; James Grayson memorably illustrated some of these issues within football when WhoScored claimed that Barcelona were a better team without Xavi Hernández.

Several methods exist in other sports to control for these factors (basically they add in a lot more maths) and some of these have found their way to football. Ford Bohrmann and Howard Hamilton had a crack at the problem here and here respectively but found the results unsatisfactory. Martin Eastwood used a Bayesian approach to rate players based on the goal difference of their team while they are playing, which came up with more encouraging results.

Expected goals

One of the potential issues with applying plus-minus to football is the low scoring nature of the sport. A heavily influential player could play a run of games where his side can’t hit the proverbial barn door, whereas another player could be fortunate to play during a hot-streak from one of his fellow players. Goal-scoring is noisy in football, so perhaps we can utilise a measure that irons out some of this noise but still represents a good measure of team performance. Step forward expected goals.

Instead of basing the plus-minus calculation on goals, I’ve used my non-shot expected goal numbers as the input. The method splits each match into separate periods and logs which players are on the pitch at a given time. A new segment starts when a lineup changes i.e. when a substitution occurs or a player is sent off. The expected goals for each team are then calculated for each period and converted to a value per 90 minutes. Each player is a ‘variable’ in the equation, with the idea being that their contribution to a teams expected goal difference can be ‘solved’ via the regression equation.

For more details on the maths side of plus-minus, I would recommend checking out Howard Hamilton’s article. I used ridge regression, which is similar to linear regression but the calculated coefficients tend to be pulled towards zero (essentially it increases bias while limiting huge outliers, so there is a tradeoff between bias and variance).

As a first step, I’ve calculated the plus-minus figures over the previous three English Premier League seasons (2012/13 to 2014/15). Every player that has appeared in the league is included as I didn’t find there was much difference when excluding players under a certain threshold of minutes played (this also avoids having to include such players in some other manner, which is typically done in basketball plus-minus). However, estimates for players with fewer than approximately 900 minutes played are less robust.

The chart below shows the proportion of players with a certain plus-minus score per 90 minutes played. As far as interpretation goes, if we took a team made up of 11 players, each with a plus-minus score of zero, the expected goal difference of the team would add up to zero. If we then replaced one of the players with one with a plus-minus of 0.10, the team’s expected goal difference would be raised to 0.10.


Distribution of xG plus-minus scores.

The range of plus-minus scores is from -0.15 to 0.15, so replacing a player with a plus-minus score of zero with one with a score of 0.15 would equate to an extra 5.7 goals over a Premier League season. Based on this analysis by James Grayson, that would equate to approximately 3.5-4.0 points over a season on average. This is comparable to figures published relating to calculations based on the Goalimpact metric system discussed earlier. That probably seems a little on the low side for what we might generally assume would be the impact of a single player, which could point towards the method either narrowing the distribution too much (my hunch) or an overestimate in our intuition. Validation will have to wait for another day

Most valuable players

Below is a table of the top 13 players according to the model. Vincent Kompany is ranked the highest by this method; on one hand this is surprising given the often strong criticism that he receives but then on the other, when he is missing, those replacing him in Manchester City’s back-line look far worse and the team overall suffers. According to my non-shots xG model, Manchester City have been comfortably the best team over the previous three seasons and are somewhat accordingly well-represented here.


Top 13 players by xG plus-minus scores for the 2012/13-2014/15 Premier League seasons. Minimum minutes played was 3420 i.e. equivalent to a full 38 match season.

Probably the most surprising name on the list is at number three…step forward Joe Allen! I doubt even Joe’s closest relatives would rate him as the third best player in the league but I think that what the model is trying to say here is that Allen is a very valuable cog who improves the overall performance level of the team. Framed in that way, it is perhaps slightly more believable (if only slightly) that his skill set gets more out of his team mates. When fit, Allen does bring added intelligence to the team and as a Liverpool fan, ‘intelligence’ isn’t usually a word I associate with the side. Highlighting players who don’t typically stand-out is one of the goals of this sort of analysis, so I’ll run with it for now while maintaining a healthy dose of skepticism.

I chose 13 as the cutoff in the table so that the top goalkeeper on the list, Hugo Lloris, is included so that an actual team could be put together. Note that this doesn’t factor in shot-stopping (I’ve actually excluded rebound shots, which might have been one way for goalkeepers to influence the scores more directly), so the rating for goalkeepers should be primarily related to other aspects of goalkeeping skills. Goalkeepers are probably still quite difficult to nail down with this method due to them rarely missing matches though, so there is a fairly large caveat with their ratings.

Being as this is just an initial look, I’m going to hold off on putting out a full list but I definitely will do in time once I’ve done some more validation work and ironed out some kinks.

Validation, Repeatability & Errors

Fairly technical section. You’ve been warned.

One of the key facets of using ridge regression is choosing a ‘suitable’ regularization parameter, which is what controls the bias-to-variance tradeoff; essentially larger values will pull the scores closer to zero. Choosing this objectively is difficult and in reality, some level of subjectivity is going to be involved at some stage of the analysis. I did A LOT of cross-validation analysis where I split the match segments into even and odd sets and ran the regression while varying a bunch of parameters (e.g. minutes cutoff, weighting of segment length, the regularization value). I then looked at the error between the regression coefficients (the player plus-minus scores) in the out-of-sample set compared to the in-sample set to choose my parameters. For the regularization parameter, I chose a value of 50 as that was where the error reached a minimum initially with relatively little change for larger values.

I also did some repeatability testing comparing consecutive seasons. As is common with plus-minus, the repeatability is very limited. That isn’t much of a surprise as the method is data-hungry and a single season doesn’t really cut it for most players. The bias introduced by the regularization doesn’t help either here. I don’t think that this is a death-knell for the method though, given the challenges involved and the limitations of the data.

In the table above, you probably noticed I included a column for errors, specifically the standard error. Typically, this has been where plus-minus has fallen down, particularly in relation to football. Simply put, the errors have been massive and have rendered interpretation practically impossible e.g. the errors for even the most highly rated players have been so large that statistically speaking it has been difficult to evaluate whether a player is even ‘above-average’.

I calculated the errors from the ridge regression via bootstrap resampling. There are some issues with combining ridge regression and bootstrapping (see discussion here and page 18 here) but these errors should give us some handle on the variability in the ratings.

You can see above that the errors are reasonably large, so the separation between players isn’t as good as you would want. In terms of their magnitude relative to the average scores, the errors are comparable to those I’ve found published for basketball. That provides some level of confidence as they’ve been demonstrated to have genuine utility there. Note that I’ve not cherry-picked the players above in terms of their standard errors either; encouragingly the errors don’t show any relationship with minutes played after approximately 900 minutes.

The gold road’s sure a long road

That is essentially it so far in terms of what I’m ready to share publicly. In terms of next steps, I want to expand this to include other leagues so that the model can keep track of players transferring in and out of a league. For example, Luis Suárez disappears when the model reaches the 2014/15 season, when in reality he was settling in quite nicely at Barcelona. That likely means that his rating isn’t a true reflection of his overall level over the period.

Evaluating performance over time is also a big thing I want to be able to do; a three year average is probably not ideal, so either some weighting for more recent seasons or a moving two season window would be better. This is typically what has been done in basketball and based on initial testing, it doesn’t appear to add more noise to the results.

Validating the ratings in some fashion is going to be a challenge but I have some ideas on how to go about that. One of the advantages of plus-minus style metrics is that they break-down team level performance to the player level, which is great as it means that adding the players back up into a team or squad essentially correlates perfectly with team performance (as represented by expected goals here). However, that does result in a tautology if the validation is based on evaluating team performance unless there are fundamental shifts in team makeup e.g. a large number of transfers in and out of a squad or injuries to key personnel.

This is just a start, so there will be more to come over time. The aim isn’t to provide a perfect representation of player contribution but to add an extra viewpoint to squad and player evaluation. Combining it with other data analysis and scouting would be the longer-term goal.

I’ll leave you with piano carrier extradionaire, Joe Allen.


Joe Allen on hearing that he is Liverpool’s most important player over the past three years.

Not quite the same old Arsenal

The narrative surrounding Arsenal has been strong this week, with their fall to fourth place in the table coming on Groundhog Day no less. This came despite a strong second half showing against Southampton, with Fraser Forster denying them. Arsenal’s season has been characterised by several excellent performances in terms of expected goals but the scoreline hasn’t always reflected their statistical dominance. Colin Trainor illustrated their travails in front of goal in this tweet.

I wrote in this post on how Arsenal’s patient approach eschews more speculative shots in search of high quality chances and that this was seemingly more pronounced this season. Arsenal are highly rated by expected goal models this season but traditional shot metrics are nowhere near as convinced.

Analytical folk will point to the high quality of Arsenal’s shots this season to explain the difference, where quality is denoted by the average probability that a shot will be scored. For example, a team with an average shot quality of 0.10 would ‘expect’ to score around 10% of their shots taken.

In the chart below, I’ve looked at the full distribution of Arsenal’s shots in open-play this season in terms of ‘shot quality’ and compared them with their previous incarnations and peers from the 2012/13 season through to the present. Looking at shot quality in this manner illustrates that the majority of shots are of relatively low quality (less than 10% chance of being scored) and that the distribution is heavily-skewed.


Proportion of total shots in open-play according to the probability of them being scored (expected goals per shot). Grey lines are non-Arsenal teams from the English Premier League from 2012/13 to the present. Blue lines are previous Arsenal teams, while red is Arsenal from this season. Data via Opta.

In terms of Arsenal, what stands out here is that their current incarnation are taking a smaller proportion of ‘low-quality’ shots (those with an expected goal estimate from 0-0.1) than any previous team by a fairly wide margin. At present, 59% of Arsenal’s shots reside in this bracket, with the next lowest sitting at 64%. Their absolute number of shots in this bracket has also fallen compared to previous seasons.

Moving along the scale, Arsenal reside along the upper edge in terms of these higher quality shots and actually have the largest proportion in the 0.2-0.3 and 0.3-0.4 ranges. As you would expect, they’ve traded higher quality shots for lower quality efforts according to the data.

Arsenal typically post above average shot quality figures but the shift this season appears to be significant. The question is why?

Mesut Özil?

One big change this season is the sustained presence (and excellence) of Mesut Özil; so far this season he has made 22 appearances (playing in 88% of available minutes) compared to 22 appearances last season (54%) and 26 matches in his debut season (63%). According to numbers from the Football in the Clouds website, his contribution to Arsenal’s shots while he is on the pitch is at 40% compared to 30% in 2014/15. Daniel Altman also illustrated Özil’s growing influence in his post in December.

Özil is the star that Arsenal’s band of attacking talent orbits, so it is possible that he is driving this focus on quality via his creative skills. His attacking contribution in terms of shots and shot-assists is among the highest in the league but is heavily-skewed towards assisting others, which is unusual among high-volume contributors.

Looking at the two previous seasons though, there doesn’t appear to be any great shift in Arsenal’s shot quality during the periods when Özil was out of the team through injury. His greater influence and regular presence in the side this season has probably shifted the dial but quantifying how much would require further analysis.


Another potential driver could be that Wenger and his coaching staff have attempted to adjust Arsenal’s tactics/style with a greater focus on quality.

Below is a table of Arsenal’s ‘volume’ shooters over the past few seasons, where I’ve listed their number of shots from outside of the box per 90 minutes and the proportion of their shots from outside the box. Note that these are for all shots, so set-pieces are included but it shouldn’t skew the story too much.

Arsenal_OoB_Shots_TableThe general trend is that Arsenal’s players have been taking fewer shots from outside of the box this season compared to previous and that there has been a decline proportionally for most players also. Some of that may be driven by changing roles/positions in the team but there appears to be a clear shift in their shot profiles. Giroud for example has taken just 3 shots from outside the box this season, which is in stark contrast to his previous profile.

Given the data I’ve already outlined, the above isn’t unexpected but then we’re back to the question of why?

Wenger has mentioned expected goals on a few occasions now and has reportedly been working more closely with the analytics team that Arsenal acquired in 2012. Given his history and reputation, we can be relatively sure that Wenger would appreciate the merits of shot quality; could the closer working relationship and trust developed with the analytics team have led to him placing an even greater emphasis on seeking better shooting opportunities?

The above is just a theory but the shift in emphasis does appear to be significant and is an interesting feature to ponder.

Adjusted expectations?

Whatever has driven this shift in Arsenal’s shot profile, the change is quite pronounced. From an opposition strategy perspective, this presents an interesting question: if you’re aware of this shift in emphasis, whether through video analysis or data, do you alter your defensive strategy accordingly?

While Arsenal’s under-performance in terms of goals versus expected goals currently looks like a case of variance biting hard, could this be prolonged if their opponents adjust? It doesn’t look like their opponents have altered tactics thus far based on examining the data but having shifted the goalposts in terms of shot quality, could this be their undoing?

Shooting the breeze

Who will win the Premier League title this season? While Leicester City and Tottenham Hotspur have their merits, the bookmakers and public analytics models point to a two-horse race between Manchester City and Arsenal.

From an analytics perspective, this is where things get interesting, as depending on your metric of choice, the picture painted of each team is quite different.

As discussed on the recent StatsBomb podcast, Manchester City are heavily favoured by ‘traditional’ shot metrics, as well as by combined team ratings composed of multiple shooting statistics (a method pioneered by James Grayson). Of particular concern for Arsenal are their poor shot-on-target numbers.

However, if we look at expected goals based on all shots taken and conceded, then Arsenal lead the way: Michael Caley has them with an expected goal difference per game of 0.98, while City lie second on 0.83. My own figures in open-play have Arsenal ahead but by a narrower margin (0.69 vs 0.65); Arsenal have a significant edge in terms of ‘big chances’, which I don’t include in my model, whereas Michael does include them. Turning to my non-shots based expected goal model, Arsenal’s edge is extended (0.66 vs 0.53). Finally, Paul Riley’s expected goal model favours City over Arsenal (0.88 vs 0.69), although Spurs are actually rated higher than both. Paul’s model considers shots on target only, which largely explains the contrast with other expected goal models.

Overall, City are rated quite strongly across the board, while Arsenal’s level is more mixed. The above isn’t an exhaustive list of models and metrics but the differences between how they rate the two main title contenders is apparent. All of these metrics have demonstrated utility at making in-season predictions but clearly assumptions about the relative strength of these two teams differs between them.

The question is why? If we look at the two extremes in terms of these methods, you would have total shots difference (or ratio, TSR) at one end and non-shots expected goals at the other i.e. one values all shots equally, while the other doesn’t ‘care’ whether a shot is taken or not.

There likely exists a range of happy mediums in terms of emphasising the taking of shots versus maximising the likelihood of scoring from a given attack. Such a trade-off likely depends on individual players in a team, tactical setup and a whole other host of factors including the current score line and incentives during a match.

However, a team could be accused of shooting too readily, which might mean spurning a better scoring opportunity in favour of a shot from long-range. Perhaps data can pick out those ‘trigger-happy’ teams versus those who adopt a more patient approach.

My non-shots based expected goal model evaluates the likelihood of a goal being scored from an individual chain of possession. If I switch goals for shots in the maths, then I can calculate the probability that a possession will end with a shot. We’ll refer to this as ‘expected shots’.

I’ve done this for the 2012/13 to 2014/15 Premier League seasons. Below is the data for the actual versus expected number of shots per game that each team attempted.


Actual shots per game compared with expected shots per game. Black line is the 1:1 line. Data via Opta.

We can see that the model does a reasonable job of capturing shot expectation (r-squared is at 0.77, while the mean absolute error is 0.91 shots per game). There is some bias in the relationship though, with lower shot volume teams being estimated more accurately, while higher shot volume sides typically shoot less than expected (the slope of the linear regression line is 0.79).

If we take the model at face value and assume that it is telling a reasonable approximation of the truth, then one interpretation would be that teams with higher expected shot volumes are more patient in their approach. Historically these have been teams that tend to dominate territory and possession such as Manchester City, Arsenal and Chelsea; are these teams maintaining possession in the final third in order to take a higher value shot? It could also be due to defenses denying these teams shooting opportunities but looking at the figures for expected and actual shots conceded, the data doesn’t support that notion.

What is also clear from the graph is that it appears to match our expectations in terms of a team being ‘trigger-happy’ – by far the largest outlier in terms of actual shots minus expected shots is Tottenham Hotspurs’ full season under André Villas-Boas, a team that was well known for taking a lot of shots from long-range. We also see a decline as we move into the 2013/14 season when AVB was fired after 16 matches (42% of the full season) and then the 2014/15 season under Pochettino. Observations such as these that pass the ‘sniff-test’ can give us a little more confidence in the metric/method.

If we move back to the season at hand, then we see some interesting trends emerge. Below I’ve added the data points for this current season and highlighted Arsenal, Manchester City, Liverpool and Tottenham (the solid black outlines are for this season). Throughout the dataset, we see that Arsenal have been consistently below expectations in terms of the number of shots they attempt and that this is particularly true this season. City have also fallen below expectations but to a smaller extent than Arsenal and are almost in line with expectations this year. Liverpool and Tottenham have taken a similar number of shots but with quite different levels of expectation.


Actual shots per game compared with expected shots per game. Black line is the 1:1 line. Markers with solid black outline are for the current season. Data via Opta.

None of the above indicates that there is a better way of attempting to score but I think it does illustrate that team style and tactics are important factors in how we build and assess metrics. Arsenal’s ‘pass it in the net’ approach has been known (and often derided) ever since they last won the league and it is quite possible that models that are more focused on quality in possession will over-rate their chances in the same way that focusing on just shots would over-rate AVB’s Spurs. Manchester City have run the best attack in the league over the past few seasons by combining the intricate passing skills of their attackers with the odd thunder-bastard from Yaya Touré.

The question remains though: who will win the Premier League title this season? Will Manchester City prevail due to their mixed-approach or will Arsenal prove that patience really is a virtue? The boring answer is that time will tell. The obvious answer is Leicester City.

Unexpected goals

A sumptuous passing move ends with the centre forward controlling an exquisite through-ball inside the penalty area before slotting the ball past the goalkeeper.


A sumptuous passing move ends with the centre forward controlling an exquisite through-ball inside the penalty area before the goalkeeper pulls off an incredible save.


A sumptuous passing move ends with the centre forward controlling an exquisite through-ball inside the penalty area before falling on his arse.


Source: Giphy


Events in football matches can take many turns that will affect the overall outcome, whether it be a single event, a match or season. In the above examples, the centre forward has received the ball in a super-position but what happens next varies drastically.

Were we to assess the striker or his team, traditional analysis would focus on the first example as goals are the currency of football. The second example would appeal to those familiar with football analytics, which has illustrated that the scoring of goals is a noisy endeavour that can be potentially misleading; focusing on shots and/or the likelihood of a shot being scored is the foundation of many a model to assess players and teams. The third example will often be met with a shrug and a plethora of gifs on social media.

This third example is what I want to examine here by building a model that accounts for these missed opportunities to take a shot.

Expected goals

Expected goals are a hugely popular concept within football analytics and are becoming increasingly visible outside of the air-conditioned basements frequented by analysts. The fundamental basis of expected goals is assigning a value to the chances that a team create or concede.

Traditionally, such models have focused on shots, building upon earlier work relating to shots and shots on target. Many models have sprung up over the past few years with Michael Caley and Paul Riley models being probably the most prominent, particularly in terms of publishing their methods and results.

More recently, Daniel Altman presented a model that went ‘beyond shots‘, which aimed to value not just shots but also attacking play that moved the ball into dangerous areas. Various analysts, including myself, have looked at the value of passing in a similar vein e.g. Dustin Ward and Sam Gregory have looked at dangerous passing here and here respectively.

Valuing possession

The model that I have built is essentially a conversion of my dangerous possession model. Each sequence of possession that a team has is classified according to how likely a goal is to be scored.

This is based on a logistic regression that includes various factors that I will outline below. The key thing is that this is based on all possessions, not just those ending with shots. The model is essentially calculating the likelihood of a shot occurring in a given position on the field and then estimating the probability of a potential shot being scored. Consequently, we can put a value on good attacking (or poor defending) that doesn’t result in a shot being taken.

I’ve focused on open-play possessions here and the data is from the English Premier League from 2012/13-2014/15..

Below is a summary of the major location-related drivers of the model.


Probability of a goal being scored based on the end point of possession (top panel) and the location of the final pass or cross during the possession (bottom panel).

By far the biggest factor is where the possession ends; attacks that end closer to goal are valued more highly, which is an intuitive and not at all ground-breaking finding.

The second panel illustrates the value of the final pass or cross in an attacking move. The closer to goal this occurs, the more likely a goal is to be scored. Again this is intuitive and has been illustrated previously by Michael Caley.

Where the possession starts is also factored into the model as I found that this can increase the likelihood of a goal being scored. If a team builds their attack from higher up the pitch, then they have a better chance of scoring. I think this is partly a consequence of simply being closer to goal, so the distance to move the ball into a dangerous position is shortened. The other probable big driver here is that the likelihood of a defence being out of position is increased e.g. a turnover of possession through a high press.

The other factors not related to location include through-ball passes, which boost the chances of a goal being scored (such passes will typically eliminate defenders during an attacking move and present attackers with more time and space for their next move). Similarly, dribbles boost the likelihood of a goal being scored, although not to the same extent as a through-ball. Attacking moves that feature a cross are less likely to result in a goal. These factors are reasonably well established in the public analytics literature, so it isn’t a surprise to see them crop up here.

How does it do?

Below are some plots and a summary table comparing actual goals to expected goals for each team in the dataset. The correlation is stronger for goals for than against, although the bias is larger also as the ‘best’ teams tend to score more than expected and the ‘worst’ teams score fewer than expected. Looking at goal difference, the relationship is very strong over a season.

I also performed several out-of-sample tests to test the regressions by spitting the data-set into two sets (2012/13-2013/14 and 2014/15 only) and ran cross-validation tests on them. The model performed well out-of-sample, with the summary statistics being broadly similar when compared to the in-sample tests.


Comparison between actual goals and expected goals. Red dots are individual teams in each season. Dashed black line is 1:1 line and solid black line is the line of best fit.


Comparison between actual goals and expected goals. MAE refers to Mean Absolute Error, while slope and intercept are calculated from a linear regression between the actual and expected totals.

I also ran the regression on possessions ending in shots and the results were broadly quite similar, although I would say that the shot-based expected goal model performed slightly better overall. Overall, the non-shots based expected goals model is very good at explaining past performance and is comparable to more traditional expected goal models.

On the predictive side, I ran a similar test to what Michael Caley did here as a quick check of how well the model did. I looked at each clubs matches in chronological order and calculated how well the expected goal models predicted actual goals in their next 19 matches (half a season in the Premier League) using an increasing number of prior matches to base the prediction on. For example, for a 10 match sample, I started at matches 1-10 and calculated statistics for matches 11-30, followed by matches 2-11 for matches 12-31 and so on.

Note that the ‘wiggles’ in the data are due to the number of teams changing as we move from one seasons worth of games to another i.e. some teams have only 38 games worth of matches, while others have 114. I also ran the same analysis for the next 38 matches and found similar features to those outlined below. I also did out-of-sample validation tests and found similar results, so I’m just showing the full in-sample tests below.

Capability of non-shot based and shot-based expected goals to predict future goals over the next 19 matches using differing numbers of previous matches as the input. Actual goals are also shown for reference. R-squared is shown on the left, while the mean absolute error is shown on the right.

I’m not massively keen on using r-squared as a diagnostic for predictions, so I also calculated the mean absolute errors for the predictions. The non-shots expected goals model performs very well here and compares very favourably with the shots-based version (the errors and correlations are typically marginally better). After around 20-30 matches, expected goals and actual goals converge in terms of their predictive capability – based on some other diagnostic tests I’ve run, this is around the point where expected goals tends to ‘match’ quite well with actual goals i.e. actual goals regresses to our mean expectation, so this convergence here is not too surprising.

The upshot is that the expected goal models perform very well and are a better predictor of future goals than goals themselves, particularly over small samples. Furthermore, they pick up information about future performance very quickly as the predictive capability tends to flat-line after less than 10 matches. I plan to expand the model to include set-play possessions and perform point projections, where I will do some more extensive investigation of the predictive performance of the model but I would say this is an encouraging start.

Bonus round

Below are the current expected goal difference rankings for the current Premier League season. The numbers are based on the regression I performed on the 2012/13-2014/15 dataset. I’ll start posting more figures as the season continues on my Twitter feed.

Open-play expected goal difference totals after 19 games of the 2015/16 Premier League season.

Open-play expected goal difference totals after 19 games of the 2015/16 Premier League season.

On single match expected goal totals

It’s been a heady week in analytics-land with expected goals hitting the big time. On Friday, they appeared in the Times courtesy of Rory Smith, Sunday saw them crop up on bastion of proper football men, Sunday Supplement, before again featuring via the Times’ Game Podcast. Jonathan Wilson then highlighted them in the Guardian on Tuesday before dumping them in a river and sorting out an alibi.

The analytics community promptly engaged in much navel-gazing and tedious argument to celebrate.

Expected goals

The majority of work on the utility of expected goals as a metric has focused on the medium-to-long term; see work by Michael Caley detailing his model here for example (see his Twitter timeline for examples of his single match expected goal maps). Work on expected goals over single matches has been sparser, aside from those highlighting the importance of accounting for the differing outcomes when there are significant differences in the quality of chances in a given match; see these excellent articles by Danny Page and Mark Taylor.

As far as expected goals over a single match are concerned, I think there are two overarching questions:

  1. Do expected goal totals reflect performances in a given match?
  2. Do the values reflect the number of goals a team should have scored/conceded?

There are no doubt further questions that we could add to the list but I think these relate most to how these numbers are often used. Indeed, Wilson’s piece in particular covered these aspects including the following statement:

According to the Dutch website 11tegen11, Chelsea should have won 2.22-0.77 on expected goals.

There are lots of reason why ‘should’ is problematic in that article but ignoring the probabilistic nature and uncertainties surrounding these expected goal estimates, let’s look at how well expected goals matches up over various numbers of shots.

You’ve gotta pick yourself up by the bootstraps

Below are various figures exploring how well expected goals matches up with actual goals. These are based on an expected goal model that I’ve been working on, the details of which aren’t too relevant here (I’ve tested this on various models with different levels of complexity and the results are pretty consistent). The figures plot the differences between the total number of goals and expected goals when looking at certain numbers of shots. These residuals are calculated via bootstrap resampling, which works by randomly extracting groups of shots from the data-set and calculating actual and expected goal totals and then seeing how large the difference is.

The top plot is for 500 shot samples, which equates to the number of shots that a decent shots team might take over a Premier League season. The residuals show a very narrow distribution, which closely resembles a Gaussian or normal distribution, with the centre of the peak being very close to zero i.e. goal and expected goal values are on average very similar over these shot sample sizes. There is a slight tendency for expected goals to under-predict goals here, although the difference is quite minor over these samples (2.6 goals over 500 shots). The take home from this plot is that we would anticipate expected and actual goals for an average team being approximately equivalent over such a sample (with some level of randomness and bias in the mix).

The middle plot is for samples of 50 shots, which would equate to around 3-6 matches at the team level. The distribution is quite similar to the one for 500 shots but the width is quite a lot wider; we would therefore expect random variation to play a larger role over this sample than the 500 shot sample, which would manifest itself in teams or players over or under-performing their expected goal numbers. The other factor at play will be aspects not accounted for by the model, which may be more important over smaller samples but even out more over larger ones.

One of these things is not like the others

The bottom plot is for samples of 13 shots, which equates to the approximate average number of shots by a team in an individual match. This is where expected goals starts having major issues; the distributions are very wide and it also has multiple local maximums. What that means is that over a single match, expected goal totals can be out by a very large amount (routinely exceeding more than one goal) and that the total estimates are pretty poor over these small samples.

Such large residuals aren’t entirely unexpected but the multiple peaks make reporting a ‘best’ estimate extremely troublesome.

I tested these results using some other publicly available expected goal estimates (kudos to American Soccer Analysis and Paul Riley for publishing their numbers) and found very similar results. I also did a similar exercise using whole match totals rather than individual shots and found similar.

I also checked that this wasn’t a result of differing scorelines when each shot was taken (game state as the analytics community calls it) by only looking at shots when teams were level – the results were the same, so I don’t think you can put this down to differences in game state. I suspect this is just a consequence of elements of football that aren’t accounted for by the model, which are numerous; such things appear to even out over larger samples (over 20 shots, the distributions look more like the 50 and 500 shot samples). As a result, teams/matches where the number of shots is larger will have more reliable estimates (so take figures involving Manchester United with a chip-shop load of salt).

Essentially, expected goal estimates are quite messy over single matches and I would be very wary of saying that a team should have scored or conceded a certain number of goals.


So, is that it for expected goals over a single match? While I think there are a lot of issues based on the results above, it can still illuminate upon the balance of play in a given match. If you’ve made it this far then I’m assuming you agree that metrics and observations that go beyond the final scoreline are potentially useful.

In the figure below, I’ve averaged actual goal difference from individual matches into expected goal ‘buckets’. I excluded data beyond +/- two expected goals as the sample size was quite small, although the general trends continues. Averaging like this hides a lot of details (as partially illustrated above) but I think it broadly demonstrates how the two match up.

Actual goals compared to expected goals for single matches when binned into 0.5 xG buckets.

Actual goals compared to expected goals for single matches when binned into 0.5 xG buckets.

The figure also illustrates that ‘winning’ the expected goals (xG difference greater than 1) doesn’t always mean winning the actual goal battle, particularly for the away team. James Yorke found something similar when looking at shot numbers. Home teams ‘scoring’ with a 1-1.5 xG advantage outscore their opponents around 66% of the time based on my numbers but this drops to 53% for away teams; away teams have to earn more credit than home teams in order to translate their performance into points.

What these figures do suggest though is that expected goals are a useful indicator of quality over a single match i.e. they do reflect the balance of play in a match as measured by the volume and quality of chances. Due to the often random nature of football and the many flaws of these models, we wouldn’t expect a perfect match between actual and expected goals but these results suggest that incorporating these numbers with other observations from a match is potentially a useful endeavour.


Don’t say:

Team x should have scored y goals today.

Do say:

Team x’s expected goal numbers would typically have resulted in the following…here are some observations of why that may or may not be the case today.

Liverpool Looking Up? EPL 2015/16 Preview

Originally published on StatsBomb.

After the sordid love affair that culminated in a strong title challenge in 2013/14, Liverpool barely cast a furtive glance at the Champions League places in 2014/15. Their underlying numbers over the whole season provided scant consolation either, with performance levels in line with a decent team lacking the quality usually associated with a top-four contender. Improvements in results and underlying performance will therefore be required to meet the club’s stated aim of Champions League football.

Progress before a fall

Before looking forward to the coming season, let’s start with a look back at Liverpool’s performance over recent seasons. Below is a graphic showing Liverpool’s underlying numbers over the past five seasons, courtesy of Paul Riley’s Expected Goal numbers.

Expected goal rank over the past 5 seasons of the English Premier League. Liverpool seasons highlighted in red.

Expected goal rank over the past 5 seasons of the English Premier League. Liverpool seasons highlighted in red.

From 2010/11 to 2012/13, there was steady progress with an impressive jump in 2013/14 to the third highest rating over the past five years. Paul’s model only evaluates shots on target, so Liverpool’s 2013/14 rating is potentially biased a little high given their unusual/unsustainable proportion of shots on target that year. However, the quality was clear, particularly in attack. Not to be outdone, 2014/15 saw another impressive jump but unfortunately the trajectory was in the opposite direction. Other metrics such as total shots ratio and shots on target ratio tell a similar story, although 2013/14 isn’t quite as impressive.

The less charitable among you may ascribe Liverpool’s trajectory with the presence and performance of one Luis Suárez; when joining in January 2010, Suárez was an erratic yet gifted performer who went on to become a genuine superstar before departing in the summer of 2014. Suárez’s attacking wizardry in 13/14 was remarkable and he served as a vital multiplier in the sides’ pinball style of play. Clearly he was a major loss but there were already reasons to suspect that some regression was due with or without him: Andrew Beasley wrote about the major and likely unsustainable role of set piece goals, while James Grayson and Colin Trainor highlighted the unusually favourable proportions of shots on target and blocked shots respectively during their title challenge. I wrote about how Liverpool’s penchant for early goals had led to an incredible amount of time spent winning over the season (a handy circumstance for a team so adept at counter-attacking), which may well have helped to explain some of their unusual numbers and that it was unlikely to be repeated.

These mitigating and potentially unsustainable factors notwithstanding, the dramatic fall in underlying performance, points (22 in all) and goals scored (an incredible 49 goal decline) is where Liverpool find themselves ahead of the coming season. Such a decline sees Brendan Rodgers go into this season under pressure to justify FSG’s backing of him over the summer, particularly with a fairly nightmarish run of away fixtures to start the season and the spectre of Jürgen Klopp on the horizon.

So, where do Liverpool need to improve this season?

Case for the defence

With the concession of six goals away at Stoke fresh in the memory, the narrative surrounding Liverpool’s defence is strong i.e. the defence is pretty horrible. Numbers paint a somewhat different story with Liverpool’s shots conceded (10.9 per game) standing as the joint-fifth lowest in the league last year according to statistics compiled by the Objective-Football website (rising to fourth lowest in open play). Shots on target were less good (3.8 per game and a rank of joint-seventh) although the margins are fairly small here. By Michael Caley’s and Paul Riley’s expected goal numbers, Liverpool ranked fourth and sixth respectively in expected goals against. Looking at how effective teams were at preventing their opponents from getting the ball into dangerous areas in open-play, my own numbers ranked Liverpool fifth best in the league.

It should be noted that analytics often has something of a blind spot when it comes to analysing defensive performances; metrics which typically work very well on the offensive side often work less well on the defensive side. Liverpool also tend to be a fairly dominant team and their opponents typically favour a deep defence and counter strategy against them, which will limit the number of chances they create.

One area where their numbers (courtesy of Objective-Football again) were noticeably poor was at set-pieces where they conceded on 11.6% of the shots by their opponents, which was 3rd worst in the league, compared to a league average conversion of 8.7%. Set-piece conversion rates are notoriously unsustainable year-on-year though, so some regression towards more normal conversion rates could potentially bring down Liverpool’s goal per game average compared to last season.

While Liverpool’s headline numbers were reasonable, their tendency to shoot themselves in the foot and concede some daft goals was impressive in its ineptitude at times. Culprits typically included combinations of Rodgers’ tactics, Dejan Lovren’s ‘whack a mole’ approach to defending and the embers of Steven Gerrard’s Liverpool career. The defensive structure of the team should be improved now that Gerrard no longer needs to be accommodated at the heart of midfield, while Glen Johnson’s prolonged audition for an extra role in the Walking Dead will continue at Stoke. Nathaniel Clyne should be a significant upgrade at full back, with youngsters Ilori and Gomez presently with the squad and aiming to compete for a first team role.

Broadly speaking though, Liverpool’s defensive numbers were reasonable but with room for improvement. Their numbers looked ok for a Champions League hopeful rather than a title challenger. A more mobile midfield should enhance the protection afforded to the central defence, however it should line up. Whether the individual errors were a bug and not a feature of this Liverpool team will likely determine how the narrative around the defence continues this year.

Under-powered attack

Liverpool’s decline in underlying performance in 2014/15 was driven by a significant drop-off in their attacking numbers. The loss of Suárez was compounded by Daniel Sturridge playing just 750 minutes in the league all season; Sturridge isn’t at the same level as Suárez (few are) but he does represent a truly elite forward and the alternatives at the club weren’t able to replace him.

The loss of Suárez and Sturridge meant that Coutinho and Sterling were now the principal conduits for Liverpool’s attack. Both performed admirably and were among the most dangerous attackers in the division. The figure below details Liverpool’s players according to the number of dangerous passes per 90 minutes played, which is related to my pass-danger rating score. In terms of volume, Coutinho and Sterling were way ahead of their teammates and both ranked in the top 15 in the league (minimum of 900 minutes played). James Milner actually ranked seventh by this metric, so he could well provide an additional source of creativity and link well with Liverpool’s forward players.

Dangerous passes per 90 minutes played metric for Liverpool players in 2014/15. Right hand side shows total number of completed passes per 90 minutes.

Dangerous passes per 90 minutes played metric for Liverpool players in 2014/15. Right hand side shows total number of completed passes per 90 minutes.

As good as Coutinho and Sterling were from a creative perspective, they did lag behind the truly elite players in the league by these metrics. As with many of Liverpool’s better players, you’re often left with the caveat of stating how good they are for their age. That’s not a criticism of the players themselves, merely a recognition of their overall standing relative to their peers.

What didn’t help was the lack of attacking contribution from Liverpool’s peak-age attacking players; Lallana’s contribution was decidedly average, Sturridge is obviously capable of making a stellar contribution but injuries curtailed him, while Balotelli certainly provided a high shot volume powered by a predilection for shooting from range but a potential dose of bad luck meant his goal-scoring record was well below expectation.

While there were clearly good elements to Liverpool’s attack, they were often left shooting from long range. According to numbers published by Michael Caley, Liverpool took more shots from outside the box than any other team last year and had the fourth highest proportion of shots from outside the box (48%). Unsurprisingly, they had the third lowest proportion of shots from the central region inside the penalty area (34%), which is the so-called ‘danger zone’ where shots are converted at much greater rates than wide in the box and outside the area. With their shot volumes being pretty good last season (third highest total shots and fourth highest shots on target), shifting the needle towards better quality chances would certainly improve Liverpool’s prospects. The question is where will that quality come from?

Bobby & Ben

With Sturridge not due back until the autumn coupled with his prior injury record, Liverpool moved to sign Christian Benteke as a frontline striker with youngsters Ings and Origi brought in to fill out the forward ranks. Roberto Firmino was added before Sterling’s departure but the expectation is that he will line-up in a similar role as the dynamic attacking midfielder/forward.

Firmino brings some impressive statistical pedigree with him: elite dribbler, dangerous passer, a tidy shot profile for a non-striker and stand-out tackling numbers for his position. If he can replicate his Bundesliga form then he should be a more than adequate replacement for Sterling, while also having the scope to develop over coming seasons.

Benteke brings a good but not great goal-scoring record, with his record in open-play being particularly average. Although there have been question marks regarding his stylistic fit within the team, Liverpool have seemingly been pursuing a physical forward to presumably act as a ‘reference point’ in their tactical system over the past few years; Diego Costa was a target in 2013, while Wilfred Bony was linked in 2014. Benteke brings that to the table alongside a more diverse range of skills than he is given credit for having been seemingly cast as an immobile lump of a centre forward by some.

Whether he has the necessary quality to improve this Liverpool team is the more pertinent question. From open-play, Benteke averages 2.2 shots per 90 minutes and 0.34 goals per 90 minutes over the past three seasons, which is essentially the average rate for a forward in the top European leagues. For comparison, Daniel Sturridge averages 4.0 shots per 90 minutes and 0.65 goals per 90 minutes over the same period. Granted, Sturridge has played for far greater attacking units than Aston Villa over that period but based on some analysis of strikers moving clubs that I’ve done, there is little evidence that shot and goal rates rise when moving to a higher quality team. Benteke does provide a major threat from set-pieces, which has been a productive source of goals for him but I would prefer to view these as an added extra on top of genuine quality in open-play, rather than a fig leaf.

Benteke will need to increase his contribution significantly if he is to cover for Sturridge over the coming season, otherwise Liverpool may find themselves in the good but not great attacking category again.


So where does all of the above leave Liverpool going into the season? Most of the underlying numbers for last season suggested that Chelsea, Manchester City and Arsenal were well ahead of the pack and I don’t see much prospect of one of them dropping out of the top four. Manchester United, Liverpool and Southampton made up the trailing group, with these three plus perhaps Tottenham in a battle to be the ‘best of the rest’ or ‘least crap’ and claim the coveted fourth place trophy.

When framed this way, Liverpool’s prospects look more viable, although fourth place looks like the ceiling at present unless the club procure some adamantium to alleviate Sturridge’s injury woes. While Liverpool currently operate outside the financial Goldilocks zone usually associated with a title challenge, they should have the quality to mount a concerted challenge for that Champions League spot in what could be a tight race. They did put together some impressive numbers during the 3-4-3 phase of last season that was in-line with those expected of a Champions League contender; replicating and sustaining that level of quality should be the aim for the team this coming season.

Prediction: 4-6th, most likely 5th.

P.S. Can Liverpool to be more fun this year? If you can’t be great, at least be fun.

Uncertain expectations

In this previous post, I describe a relatively simple version of an expected goals model that I’ve been developing recently. In this post, I want to examine the limitations and uncertainties relating to how well the model predicts goals.

Just to recap, I built the model using data from the Premier League from 2013/14 and 2014/15. For the analysis below, I’m just going to focus on non-penalty shots with the foot, so it includes both open-play and set piece shot situations. Mixing these will introduce some bias but we have to start somewhere. The data amounts to over 16,000 shots.

What follows is a long and technical post. You have been warned.

Putting the boot in

One thing to be aware of is how the model might differ if we used a different set of shots for input; ideally the answer we get shouldn’t change if we only used a subset of the data or if we resample the data. If the answer doesn’t change appreciably, then we can have more confidence that the results are robust.

Below, I’ve used a statistical technique known as ‘bootstrapping‘ to assess how robust the regression is for expected goals. Bootstrapping belongs to a class of statistical methods known as resampling. The method works by randomly extracting shots from the dataset and rerunning the regression many times (1000 times in the plot below). Using this, I can estimate a confidence interval for my expected goal model, which should provide a reasonable estimate of goal expectation for a given shot.

For example, the base model suggests that a shot from the penalty spot has an xG value of 0.19. The bootstrapping suggests that the 90% confidence interval gives an xG range from 0.17 to 0.22. What this means is that on 90% of occasions that Premier League footballers take a shot from the penalty spot, we would expect them to score somewhere between 17-22% of the time.

The plot below shows the goal expectation for a shot taken in the centre of the pitch at varying distances from the goal. Generally speaking, the confidence interval range is around ±1-2%. I also ran the regressions on subsets of the data and found that after around 5000 shots, the central estimate stabilised and the addition of further shots in the regression just narrows the confidence intervals. After about 10,000 shots, the results don’t change too much.


Expected goal curve for shots in the centre of the pitch at varying distances from the goal. Shots with the foot only. The red line is the median expectation, while the blue shaded region denotes the 90% confidence interval.

I can use the above information to construct a confidence interval for the expected goal totals for each team, which is what I have done below. Each point represents a team in each season and I’ve compared their expected goals vs their actual goals. The error bars show the range for the 90% confidence intervals.

Most teams line up with the one-to-one line within their respective confidence intervals when comparing with goals for and against. As I noted in the previous post, the overall tendency is for actual goals to exceed expected goals at the team level.

Expected goals vs actual goals for teams in the 2013/14 and 2014/15 Premier League. Dotted line is the 1:1 line, the solid line is the line of best fit and the error bars denote the 90% confidence intervals based on the xG curve above.

Expected goals vs actual goals for teams in the 2013/14 and 2014/15 Premier League. Dotted line is the 1:1 line, the solid line is the line of best fit and the error bars denote the 90% confidence intervals based on the xG curve above.

As an example of what the confidence intervals represent, in the 2013/14 season, Manchester City’s expected goal total was 59.8, with a confidence interval ranging from 52.2 to 67.7 expected goals. In reality, they scored 81 non-penalty goals with their feet, which falls outside of their confidence interval here. On the plot below, Manchester City are the red marker on the far right of the expected goals for vs actual goals for plot.

Embracing uncertainty

Another method of testing the model is to look at the model residuals, which are calculated by subtracting the outcome of a shot (either zero or one) from its expected goal value. If you were an omnipotent being who knew every aspect relating to the taking of a shot, you could theoretically predict the outcome of a shot (goal or no goal) perfectly (plus some allowance for random variation). The residuals of such a model would always be zero as the outcome minus the expectation of a goal would equal zero in all cases. In the real world though, we can’t know everything so this isn’t the case. However, we might expect that over a sufficiently large sample, the residual will be close to zero.

In the figure below, I’ve again bootstrapped the data and looked at the model residuals as the number of shots increases. I’ve done this 10,000 times for each number of shots i.e. I extract a random sample from the data and then calculate the residual for that number of shots. The red line is the median residual (goals minus expected goals), while the blue shaded region corresponds to the standard error range (calculated as the 90% confidence interval). The residual is normalised to a per shot basis, so the overall uncertainty value is equal to this value multiplied by the number of shots taken.


Goals-Expected Goals versus number of shots calculated via bootstrapping. Inset focusses on the first 100 shots. The red line is the median, while the blue shaded region denotes the 90% confidence interval (standard error).

The inset shows how this evolves up to 100 shots and we see that over about 10 shots, the residual approaches zero but the standard errors are very large at this point. Consequently, our best estimate of expected goals is likely highly uncertain over such a small sample. For example, if we expected to score two goals from 20 shots, the standard error range would span 0.35 to 4.2 goals. To add a further complication, the residuals aren’t normally distributed at that point, which makes interpretations even more challenging.

Clearly there is both a significant amount of variation over such small samples, which could be a consequence of both random variation and factors not included in the model. This is an important point when assessing xG estimates for single matches; while the central estimate will likely have a very small residual, the uncertainty range is huge.

As the sample size increases, the uncertainty decreases. After 100 shots, which would equate to a high shot volume for a forward, the uncertainty in goal expectation would amount to approximately ±4 goals. After 400 shots, which is close to the average number of shots a team would take over a single season, the uncertainty would equate to approximately ±9 goals. For a 10% conversion rate, our expected goal value after 100 shots would be 10±4, while after 400 shots, our estimate would be 40±9 (note the percentage uncertainty decreases as the number of shots increases).


Same as above but with individual teams overlaid.

Above is the same plot but with the residuals shown for each team over the past two seasons (or one season if they only played for a single season). The majority of teams fall within the uncertainty envelope but there are some notable deviations. At the bottom of the plot are Burnley and Norwich, who significantly under-performed their expected goal estimate (they were also both relegated). On the flip side, Manchester City have seemingly consistently outperformed the expected goal estimate. Part of this is a result of the simplicity of the model; if I include additional factors such as how the chance is created, the residuals are smaller.

How well does an xG model predict goals?

Broadly speaking, the central estimates of expected goals appear to be reasonably good; the residuals tend to zero quickly and even though there is some bias, the correlations and errors are encouraging. When the uncertainties in the model are propagated through to the team level, the confidence intervals are on average around ±15% for expected goals for and against.

When we examine the model errors in more detail, they tend to be larger (around ±25% at the team level over a single season). The upshot of all this is that there appears to be a large degree of uncertainty in expected goal values when considering sample sizes relevant at the team and player level. While the simplicity of the model used here may mean that the uncertainty values shown represent a worst-case scenario, it is still something that should be considered when analysts make statements and projections. Having said this, based on some initial tests, adding extra complexity doesn’t appear to reduce the residuals to any great degree.

Uncertainty estimates and confidence intervals aren’t sexy and having spent the last 1500ish words writing about them, I’m well aware they aren’t that accessible either. However, I do think they are useful and important in the real world.

Quantifying these uncertainties can help to provide more honest assessments and recommendations. For example, I would say it is more useful to say that my projections estimate that player X will score 0.6-1.4 goals per 90 minutes next season along with some central value, rather than going with a single value of 1 goal per 90 minutes. Furthermore, it is better to state such caveats in advance – if you just provided the central estimate and the player posted say 0.65 goals per 90 and you then bring up your model’s uncertainty range, you will just sound like you’re making excuses.

This also has implications regarding over and under performance by players and teams relative to expected goals. I frequently see statements about regression to the mean without considering model errors. As George Box wisely noted:

Statisticians, like artists, have the bad habit of falling in love with their models.

This isn’t to say that expected goal models aren’t useful, just that if you want to wade into the world of probability and modelling, you should also illustrate the limitations and uncertainties associated with the analysis.

Perhaps those using expected goal models are well aware of these issues but I don’t see much discussion of it in public. Analytics is increasingly finding a wider public audience, along with being used within clubs. That will often mean that those consuming the results will not be aware of these uncertainties unless you explain them. Speaking as a researcher who is interested in the communication of science, I can give many examples of where not discussing uncertainty upfront can backfire in the long run.

Isn’t uncertainty fun!


Thanks to several people who were kind enough to read an initial draft of this article and the proceeding method piece.

Great Expectations

One of the most popular metrics in football analytics is the concept of ‘expected goals’ or xG for short. There are various flavours of expected goal models but the fundamental objective is to assess the quality of chances created or conceded by a team. The models are also routinely applied to assessing players using various techniques.

Michael Caley wrote a nice explanation of the what and the why of expected goals last month. Alternatively, you could check out this video by Daniel Altman for a summary of some of the potential applications of the metric.

I’ve been building my own expected goals model recently and I’ve been testing out a fundamental question regarding the performance of the model, namely:

How well does it predict goals?

Do expected goal models actually do what they say on the tin? This is a really fundamental and dumb question that hasn’t ever been particularly clear to me in relation to the public expected goal models that are available.

This is a key aspect, particularly if we want to make statements about prior over or under-performance and any anticipated changes in the future. Further to this, I’m going to talk about uncertainty and how that influences the statements that we can make regarding expected goals.

In this post, I’m going to describe the model and make some comparisons with a ‘naive’ baseline. In a second post, I’m going to look at uncertainties relating to expected goal models and how they may impact our interpretations of them.

The model

Before I go further, I should note that the initial development closely resembles the work done by Michael Caley and Martin Eastwood, who detailed their own expected goal methods here and here respectively.

I built the model using data from the Premier League from 2013/14 and 2014/15. For the analysis below, I’m just going to focus on non-penalty shots with the foot, so it includes both open-play and set piece shot situations. Mixing these will introduce some bias but we have to start somewhere. The data amounts to over 16,000 shots.

I’m only including distance from the centre of the goal in the first instance, which I calculated in a similar manner to Michael Caley in the link above as the distance from the goal line divided by the relative angle. I didn’t raise the relative angle to any power though.

I then calculate the probability of a goal being scored with the adjusted distance of each shot as the input; shots are deemed either successful (goal) or unsuccessful (no goal). Similarly to Martin Eastwood, I found that an exponential decay formula represented the data well. However, I found that there was a tendency towards under-predicting goals on average, so I included an offset in the regression. The equation I used is below:

xG = exp(-Distance/α) + β

Based on the dataset, the fit coefficients were 6.65 for α and 0.017 for β. Below is what this looks like graphically when I colour each shot by the probability of a goal being scored; shots from close to the goal line in central positions are far more likely to be scored than long distance shots or shots from narrow angles, which isn’t a new finding.


Expected goals based on shot location using data from the 2013/14 and 2014/15 Premier League seasons. Shots with the foot only.

So, now we have a pretty map and yet another expected goal model to add to the roughly 1,000,001 other models in existence.


In the figure below, I’ve compared the expected goal totals with the actual goals. Most teams are close to the one-to-one line when comparing with goals for and against, although the overall tendency is for actual goals to exceed expected goals at the team level. When looking at goal difference, there is some cancellation for teams, with the correlation being tighter and the line of best fit passing through zero.


Expected goals vs actual goals for teams in the 2013/14 and 2014/15 Premier League. Dotted line is the 1:1 line, the solid line is the line of best fit. Click on the graph for an enlarged version.

Inspecting the plot more closely, we can see some bias in the expected goal number at the extreme ends; high-scoring teams tend to out-perform their expected goal total, while the reverse is true for low scoring teams. The same is also true for goals against, to some extent, although the general relationship is less strong than for goals for. Michael Caley noted a similar phenomenon here in relation to his xG model. Overall, it looks like just using location does a reasonable job.


The table above includes R2 and mean absolute error (MAE) values for each metric and compares them to a ‘naïve’ baseline where just the average conversion rate is used to calculate the xG values i.e. the location of the shot is ignored. The Rvalue assesses the strength of the relationship between expected goals and goals, with values closer to one indicating a stronger link. Mean absolute error takes an average of the difference between the goals and expected goals; the lower the value the better. In all cases, including location improves the comparison. ‘Naïve’ xG difference is effectively Total Shot Difference as it assumes that all shots are equal.

What is interesting is that the correlations are stronger in both cases for goals for than goals against. This could be a fluke of the sample I’m using but the differences are quite large. There is more stratification in goals for than goals against, which likely helps improve the correlations. James Grayson noted here that there is more ‘luck’ or random variation in goals against than goals for.

How well does an xG model predict goals?

Broadly speaking, the central estimates of expected goals appear to be reasonably good. Even though there is some bias, the correlations and errors are encouraging. Adding location into an xG model clearly improves our ability to predict goals compared to a naïve baseline. This obviously isn’t a surprise but it is useful to quantify the improvements.

The model can certainly be improved though and I also want to quantify the uncertainties within the model, which will be the topic of my next post.