Thinking about goalkeepers

Goalkeepers have typically been a tough nut to crack from a data analytics point-of-view. Randomness is an inherent aspect of goal-scoring, particularly over small samples, which makes drawing robust conclusions at best challenging and at worst foolhardy. Are we identifying skill in our ratings or are we just being sent down the proverbial garden path by variance?

To investigate some of these issues, I’ve built an expected save model that takes into account shot location and angle, whether the shot is a header or not and shot placement. So a shot taken centrally in the penalty area sailing into the top-corner will be unlikely to be saved, while a long-range shot straight at the keeper in the centre of goal should usually prove easier to handle.

The model is built using data from the past four seasons of the English, Spanish, German and Italian top leagues. Penalties are excluded from the analysis.

Similar models have been created by new Roma analytics guruStephen McCarthyColin Trainor & Constantinos Chappas and Thom Lawrence in the past.

The model thus provides an expected goal value for each shot that a goalkeeper faces, which we can then compare with the actual outcome. In a simpler world, we could easily identify shot-stopping skill by taking the difference between reality and expectation and then ranking goalkeepers by who has the best (or worst) difference.

However, this isn’t a simple world, so we run into problems like those illustrated in the graphic below.

Keeper_Funnel_Plot.png

Shot-stopper-rating (actual save percentage minus expected save percentage) versus number of shots faced. The central black line at approximately zero is the median, while the blue shaded region denotes the 90% confidence interval. Red markers are individual players. Data via Opta.

Each individual red marker is a player’s shot-stopper rating over the past four seasons versus the number of shots they’ve faced. We see that for low shot totals, there is a huge range in the shot-stopper-ranking but that the spread decreases as the number of shots increases, which is an example of regression to the mean.

To illustrate this further, I used a technique called boot-strapping to re-sample the data and generate confidence intervals for an average goalkeeper. This re-sampling is done 10,000 times to create a probability distribution built by randomly extracting groups of shots from the data-set and calculating actual and expected save percentages and then seeing how large the difference is. We see a strong narrowing of the blue uncertainty envelope up to around 50 shots, with further narrowing up to about 200 shots. After this, the narrowing is less steep.

What this effectively means is that there is a large band of possible outcomes that we can’t realistically separate from noise for an average goalkeeper. Over a season, a goalkeeper faces a little over 100 shots on target (119 on average according to the data used here). Thus, there is a huge opportunity for randomness to play a role and it is therefore of little surprise to find that there is little repeatability year-on-year for save percentage.

Things do start to settle down as shot totals increase though. After 200 shots, a goalkeeper would need to be performing more than ± 4% on the shot-stopper-rating scale to stand up to a reasonable level of statistical significance. After 400 shots, signal is easier to discern with a keeper needing to register more than ± 2% to emerge from the noise. That is not to say that we should be beholden to statistical significance but it is certainly worth bearing in mind in any assessment plus an understanding of the uncertainty inherent in analytics can be a powerful weapon to wield.

What we do see in the graphic above are many goalkeepers outside of the blue uncertainty envelope. This suggests that we might be able to identify keepers who are performing better or worse than the average goalkeeper, which would be pretty handy for player assessment purposes. Luckily, we can employ some more maths courtesy of Pete Owen who presented a binomial method to rank shot-stopping performance in a series of posts available here and here.

The table below lists the top-10 goalkeepers who have faced more than 200 shots over the past four seasons by the binomial ranking method.

GK-Top10.png

Top-10 goalkeepers as ranked by their binomial shot-stopper-ranking. Post-shot refers to expected save model that accounts for shot placement. Data via Opta.

I don’t know about you but that doesn’t look like too shabby a list of the top keepers. It may be that some of the names on the list have serious flaws in their game aside from shot-stopping but that will have to wait another day and another analysis.

So where does that leave us in terms of goalkeeping analytics? On one hand, we have noisy unrepeatable metrics from season-to-season. On the other, we appear to have some methods available to extract the signal from the noise over larger samples. Even then, we might be being fooled by aspects not included in the model or the simple fact that we expect to observe outliers.

Deficiencies in the model are likely our primary concern but these should be checked by a skilled eye and video clips, which should already be part of the review process (quit sniggering at the back there). Consequently, the risks ingrained in using an imperfect model can be at least partially mitigated against.

Requiring 2-3 seasons of data to get a truly robust view on shot-stopping ability may be too long in some cases. However, perhaps we can afford to take a longer-term view for such an important position that doesn’t typically see too much turnover of personnel compared to other positions. The level of confidence you might want when short-listing might well depend on the situation at hand; perhaps an 80% chance of your target being an above average shot-stopper would be palatable in some cases?

All this is to say that I think you can assess goalkeepers by the saves they do or do not make. You just need to be willing to embrace a little uncertainty in the process.

Square pegs for square holes: OptaPro Forum Presentation

At the recent OptaPro Forum, I was delighted to be selected to present to an audience of analysts and representatives from the football industry. I presented a technique to identify different player types using their underlying statistical performance. My idea was that this would aid player scouting by helping to find the “right fit” and avoid the “square peg for a round hole” cliché.

In the presentation, I outlined the technique that I used, along with how Dani Alves made things difficult. My vision for this technique is that the output from the analysis can serve as an additional tool for identifying potential transfer signings. Signings can be categorised according to their team role and their performance can then be compared against their peers in that style category based on the important traits of those player types.

The video of my presentation is below, so rather than repeating myself, go ahead and watch it! The slides are available here.

Each of the player types is summarised below in the figures. My plan is to build on this initial analysis by including a greater number of leagues and use more in-depth data. This is something I will be pursuing over the coming months, so watch this space.

Some of my work was featured in this article by Ben Lyttleton.

Forward player types.

Forward player types

Midfielder player types.

Midfielder player types.

Defender player types.

Defender player types.

Help me rondo

In my previous post, I looked at the relationship between controlling the pitch (territory) and the ball (possession). When looking at the final plot in that post, you might infer that ‘good’ teams are able to control both territory and possession, while ‘bad’ teams are dominated on both counts. There are also teams that dominate only one metric, which likely relates to their specific tactical make-up.

When I calculated the territory metric, I didn’t account for the volume of passes in each area of the pitch as I just wanted to see how things stacked up in a relative sense. Territory on its own has a pretty woeful relationship with things we care about like points (r2=0.27 for the 2013/14 EPL) and goal difference (r2=0.23 for the 2013/14 EPL).

However, maybe we can do better if we combine territory and possession into one metric.

To start with, I’ve plotted some heat maps (sorry) showing pass completion percentage based on the end point of the pass. The completion percentage is calculated by adding up all of the passes to a particular area on the pitch and comparing that to the number of passes that are successfully received. I’ve done this for the 2013/14 season for the English Premier League, La Liga and the Bundesliga.

As you would expect, passes directed to areas closer to the goal are completed at lower rates, while passes within a teams own half are completed routinely.

Blah.

Heat map of pass completion percentage based on the target of all passes in the 2013/14 English Premier League, La Liga and Bundesliga. Data via Opta.

What is interesting in the below plots is the contrast between England and Germany; in the attacking half of the pitch, pass completion is 5-10% lower in the Bundesliga than in the EPL. La Liga sits in-between for the most part but is similar to the Bundesliga within the penalty area. My hunch is that this is a result of the contrasting styles in these leagues:

  1. Defences often sit deeper in the EPL, particularly when compared to the Bundesliga, which results in their opponents completing passes more easily as they knock the ball around in front of the defence.
  2. German and Spanish teams tend to press more than their English counter-parts, which will make passing more difficult. In Germany, counter-pressing is particularly rife, which will make passing into the attacking midfield zone more challenging.

From the above information, I can construct a model* to judge the difficulty of a pass into each area of the pitch and given the differences between the leagues, I do this for each league separately.

I can then use this pass difficulty rating along with the frequency of passes into that location to put a value on how ‘dangerous’ a pass is e.g. a completed pass received on the penalty spot in your opponents penalty area would be rated more highly than one received by your own goalkeeper in his six-yard box.

Below is the resulting weighting system for each league. Passes that are received in-front of the goal within the six-yard box would have a rating close to one, while passes within your own half are given very little weighting as they are relatively easy to complete and are frequent.

There are slight differences between each league, with the largest differences residing in the central zone within the penalty area.

Blah.

Heat map of pass weighting model for the 2013/14 English Premier League, La Liga and Bundesliga. Data via Opta.

Using this pass weighting scheme, I can assign a score to each pass that a team completes, which ‘rewards’ them for completing more dangerous passes themselves and preventing their opponents from moving the ball into more dangerous areas. For example, a team that maintains possession in and around the opposition penalty area will increase their score. Similarly, if they also prevent their opponent from moving the ball into dangerous areas near their own penalty area, this will also be rewarded.

Below is how this Territorial-Possession Dominance (TPD) metric relates to goal difference. It is calculated by comparing the for and against figures as a ratio and I’ve expressed it as a percentage.

Broadly speaking, teams with a higher TPD have a better goal difference (overall r2=0.59) but this varies across the leagues. Unsurprisingly, Barcelona and Bayern Munich are the stand-out teams on this metric as they pin teams in and also prevent them from possessing the ball close to their own goal. Manchester City (the blue dot next to Real Madrid) had the highest TPD in the Premier League.

In Germany, the relationship is much stronger (r2=0.87), which is actually better than both Total Shot Ratio (TSR, r2=0.74) and Michael Caley’s expected goals figures (xGR, r2=0.80). A major caveat here though is that this is just one season in a league with only 18 teams and Bayern Munich’s domination certainly helps to strengthen the relationship.

The relationship is much weaker in Spain (r2=0.35) and is worse than both TSR (r2=0.54) and xGR (r2=0.77).  A lot of this is driven by the almost non-existent explanatory power of TPD when compared with goals conceded (r2=0.06). La Liga warrants further investigation.

England sits in-between (r2=0.69), which is on a par with TSR (r2=0.72). I don’t have xGR numbers for last season but I believe xGR is usually a few points higher than TSR in the Premier League.

Blah.

Relationship between goal difference per game and territorial-possession dominance for the 2013/14 English Premier League, La Liga and Bundesliga. Data via Opta.

The relationship between TPD and points (overall r2=0.56) is shown below and is broadly similar to goal difference. The main difference is that the strength of the relationship in Germany is weakened.

Blah.

Relationship between points per game and territorial-possession dominance for the 2013/14 English Premier League, La Liga and Bundesliga. Data via Opta.

Over the summer, I’ll return to these correlations in more detail when I have more data and the relationships are more robust. For now, the metric appears to be useful and I plan to improve it further. Also, I’ll be investigating what it can tell us about a teams style when combined with other metrics.

——————————————————————————————————————– *For those who are interested in the method, I calculated the relative distance of each pass from the centre of the opposition goal using the distance along the x-axis (the length of the pitch) and the angle relative to a centre line along the length of the pitch.

I then used logistic regression to calculate the probability of a pass being completed; passes are deemed either successful or unsuccessful, so logistic regression is ideal and avoids putting the passes into location buckets on the pitch.

I then weighted the resulting probability according to the frequency of passes received relative to the distance from the opposition goal-line. This gave me a ‘score’ for each pass, which I used to calculate the territory weighted possession for each team.

Scoring ability: the good, the bad and the Messi

Identifying scoring talent is one of the main areas of investigation in analytics circles, with the information provided potentially helping to inform decisions that can cost many, many millions. Players who can consistently put the ball in the net cost a premium; can we separate these players from the their peers?

I’m using data from the 2008/09 to 2012/13 seasons across the top divisions in England, Spain, Germany and Italy from ESPN. An example of the data provided is available here for Liverpool in 2012/13. This gives me total shots (including blocked shots) and goals for over 8000 individual player seasons. I’ve also taken out penalties from the shot and goal totals using data from TransferMarkt. This should give us a good baseline for what looks good, bad and extraordinary in terms of scoring talent. Clearly this ignores the now substantial work being done in relation to shot location and different types of shot but the upside here is that the sample size (number of shots) is larger.

Below is a graph of shot conversion (defined as goals divided by total shots) against total shots. All of the metrics I’ll use will have penalties removed from the sample. The average conversion rate across the whole sample is 9.2%. Using this average, we can calculate the bounds of what average looks like in terms of shot conversion; we would expect some level of random variation around the average and for this variation to be larger for players who’ve taken fewer shots.

Shot conversion versus total shots for individual players in the top leagues in England, Italy, Spain and Germany from 2008/09-2012/13. Points are shown in grey with certain players highlighted, with the colours corresponding to the season. The solid black line is the average conversion rate of 9.2%, with the dotted lines above and below this line corresponding to two standard errors above the average. The dashed line corresponds to five standard errors. Click on the image for a larger view.

On the plot I’ve also added some lines to illustrate this. The solid black line is the average shot conversion rate, while the two dotted lines either side of it represent upper and lower confidence limits calculated as being two standard errors from the mean. These are known as funnel plots and as far as I’m aware, they were introduced to football analysis by James Grayson in his work on penaltiesPaul Riley has also used them when looking at shot conversion from different areas of the pitch. There is a third dotted line but I’ll talk about that later.

So what does this tell us? Well we would expect approximately 95% of the points to fall within this envelope around the average conversion rate; the actual number of points is 97%. From a statistical point of view, we can’t identify whether these players are anything other than average at shot conversion. Some players fall below the lower bound, which suggests that they are below average at converting their shots into goals. On the other hand, those players falling above the upper bound, are potentially above average.

The Bad

I’m not sure if this is surprising or not, but it is actually quite hard to identify players who fall below the lower bound and qualify as “bad”. A player needs to take about 40 shots without scoring to fall beneath the lower bound, so I suspect “bad” shooters don’t get the opportunity to approach statistical significance. Some do though.

Only 62 player seasons fall below the lower bound, with Alessandro Diamanti, Antonio Candreva, Gökhan Inler and (drum-roll) Stewart Downing having the dubious record of appearing twice. Downing actually holds the record in my data for the most shots (80) without scoring in 2008/09, with his 2011/12 season coming in second with 71 shots without scoring.

The Good

Over a single season of shots, it is somewhat easier to identify “good” players in the sample, with 219 players lying above the two standard error curve. Some of these players are highlighted in the graph above and rather than list all of them, I’ll focus on players that have managed to consistently finish their shooting opportunities at an above average rate.

Only two players appear in each of the five seasons of this sample; Gonzalo Higuaín and Lionel Messi. Higuaín has scored an impressive 94 goals with a shot conversion rate of 25.4% over that sample. I’ll leave Messi’s numbers until a little later. Four players appear on four separate occasions; Álvaro Negredo, Stefan Kießling, Alberto Gilardino and Giampaolo Pazzini. Negredo is interesting here as while his 15.1% conversion rate over multiple seasons isn’t as exceptional as some other players, he has done this over a sustained period while taking a decent volume of shots each season (note his current conversion rate at Manchester City is 16.1%).

Eighteen players have appeared on this list three times; notable names include van Persie, Di Natale, Cavani, Agüero, Gómez, Soldado, Benzema, Raúl, Fletcher, Hernández and Agbonlahor (wasn’t expecting that last one). I would say that most of the players mentioned here are more penalty box strikers, which suggests they take more of their shots from closer to the goal, where conversion rates are higher. It would be interesting to cross-check these with analysts who are tracking player shot locations.

The Messi

To some extent, looking at players that lie two standard errors above or below the average shot conversion rate is somewhat arbitrary. The number of standard errors you use to judge a particular property typically depends on your application and how “sure” you want to be that the signal you are observing is “real” rather than due to “chance”. For instance, when scientists at CERN were attempting to establish the existence of the Higgs boson, they used a very stringent requirement that the observed signal is five standard errors above the typical baseline of their instruments; they want to be really sure that they’ve established the existence of a new particle. The tolerance here is that there be much less than a one in a million chance that any observed signal be the result of a statistical fluctuation.

As far as shot conversion is concerned, over the two seasons prior to this, Lional Messi is the Higgs boson of football. While other players have had shot conversion rates above this five-standard error level, Messi has done this while taking huge shot volumes. This sets him apart from his peers. Over the five seasons prior to this, Messi took 764 shots, from which an average player would be expected to score between 54 and 86 goals based on a player falling within two standard errors of the average; Messi has scored 162! Turns out Messi is good at the football…who knew?

Is shooting accuracy maintained from season to season?

This is a short follow-up to this post using the same dataset. Instead of shot conversion, we’re now looking at shooting accuracy which is defined as the number of shots on target divided by the total number of shots. The short story here is that shooting accuracy regresses more strongly to the mean than shot conversion at the larger shot samples (more than 70 shots) and is very similar below this.

Comparison between shooting accuracy for players in year zero and the following season (year one). Click on the image or here for a larger interactive version.

Comparison between shooting accuracy for players in year zero and the following season (year one). Click on the image or here for a larger interactive version.

Minimum Shots Players year-to-year r^2 ‘luck’ ‘skill’
1 2301 0.045 79% 21%
10 1865 0.118 66% 34%
20 1428 0.159 60% 40%
30 951 0.214 54% 46%
40 632 0.225 53% 47%
50 456 0.219 53% 47%
60 311 0.190 56% 44%
70 180 0.245 51% 49%
80 117 0.305 45% 55%
90 75 0.341 42% 58%
100 43 0.359 40% 60%

Comparison of the level of ‘skill’ and ‘luck’ attributed to shooting accuracy (measured by shots on target divided by all shots) from one season to the next. The data is filtered by the total number of shots a player takes in consecutive seasons.

Essentially, there is quite a bit of luck involved with getting shots on target and for large-volume shooters, there is more luck involved in getting accurate shots in than in scoring them.

Is scoring ability maintained from season to season? (slight return)

In my previous post (many moons ago), I looked at whether a players’ shot conversion in one season was a good guide to their shot conversion in the next. While there were some interesting features in this, I was wary of being too definitive given the relatively small sample size that was used. Data analysis is a journey with no end, so this is the next step. I collated the last 5 seasons of data across the top divisions in England, Spain, Germany and Italy (I drew the line at collecting France) from ESPN. An example of the data provided is available here for Liverpool in 2012/13. The last 5 seasons on ESPN are Opta provided data and matched up perfectly when I compared with English Premier League data from EPL-Index.

Before digging into the results, a few notes on the data. The data is all shots and all goals i.e. penalties are not removed. Ideally, you would strip out penalty shots and goals but that would require player-level data that I don’t have and I’ve already done enough copy and pasting. I doubt including penalties will change the story too much but it would alter the absolute numbers. Shot conversion here is defined as goals divided by total shots, where total shots includes blocked shots. I then compared shot conversion for individual players in year zero with their shot conversion the following year (year one). The initial filter that I applied here was that the player had to have scored at least one goal in both years (so as to exclude players having 0% shot conversion).

Comparison between shot conversion rates for players in year zero and the following season (year one). Click on the image or here for a larger interactive version.

Starting out with the full dataset, we have 2301 data points where a player scored a goal in two consecutive seasons. The R^2 here (a measure of the strength of the relationship) is very low, with a value of 0.061 (where zero would mean no relationship and one would be perfect). Based on the method outlined here by James Grayson, this suggests that shot conversion regresses 75% towards the mean from one season to the next. The implication of this number is that shot conversion is 25% ‘skill’ and 75% is due to random variation, which is often described as ‘luck’.

As I noted in my previous post on this subject, the attribution to skill and luck is dependent on the number of shots taken. As the number of shots increases, we smooth out some of the randomness and skill begins to emerge. A visualisation of the relationship between shot conversion and total shots is available here. Below is a summary table showing how this evolves in 10 shot increments. After around 30 shots, skill and luck are basically equal and this is maintained up to 60 shots. Above 80 shots, we seem to plateau at a 70/30% split between ‘skill’ and ‘luck’ respectively.

Minimum Shots Players year-to-year r^2 ‘luck’ ‘skill’
1 2301 0.061 75% 25%
10 1865 0.128 64% 36%
20 1428 0.174 58% 42%
30 951 0.234 52% 48%
40 632 0.261 49% 51%
50 456 0.262 49% 51%
60 311 0.261 49% 51%
70 180 0.375 39% 61%
80 117 0.489 30% 70%
90 75 0.472 31% 69%
100 43 0.465 32% 68%

Comparison of the level of ‘skill’ and ‘luck’ attributed to scoring ability (measured by shot conversion) from one season to the next. The data is filtered by the total number of shots a player takes in consecutive seasons.

The results here are different to my previous post, where the equivalence of luck and skill was hit around 70 shots whereas it lies from 30-60 shots here. I suspect this is driven by the smaller sample size in the previous analysis. The song remains the same though; judging a player on around half a season of shots will be about as good as a coin toss. Really you want to assess a heavy shooter over at least a season with the proviso that there is still plenty of room for random variation in their shot conversion.

What is shot conversion anyway?

The past summer in the football analytics community saw a wonderful catalytic cycle of hypothesis, analysis and discussion. It’s been great to see the community feeding off each other; I would have liked to join in more but the academic conference season and the first UK heatwave in 7 years put paid to that. Much of the focus has been on shots and their outcomes. Increasingly the data is becoming more granular; soon we’ll know how many shots per game are taken within 10 yards of the corner flag at a tied game state by players with brown hair and blue eyes while their manager juggles on the sideline (corrected for strength of opposition of course). This increasing granularity is a fascinating and exciting development. While it was already clear that all shots aren’t created equal from purely watching the football, the past summer has quantified this very clearly. To me, this demonstrates that the traditional view of ‘shot conversion’ as a measure of finishing ability is erroneous.

As an illustrative example, consider two players who both take 66 shots in a season. Player A scores 11 goals, so has a shot conversion of 17%. Player B scores 2 goals, so has a shot conversion of 3%. The traditional view of shot conversion would suggest that Player A is a better finisher than Player B. However, if Player A took all of his shots from a central area within the 18-yard box, he would be bang in line with the Premier League average over the past 3 seasons. If Player B took all of his shots from outside the area, he would also be consistent with the average Premier League player. Both players are average when controlling for shot location. Clearly this is an extreme example but then again it is meant to be an illustration. To me at least, shot conversion seems more indicative of shooting efficiency i.e. taking shots from good positions under less defensive pressure will lead to an increased shot conversion percentage. Worth bearing in mind the next time someone mentions ‘best’ or ‘worst’ in combination with shot conversion.

The remaining question for me is how sustainable the more granular data is from season-to-season, especially given the smaller sample sizes.

Is scoring ability maintained from season to season?

With the football season now over across the major European leagues, analysis and discussion turns to reflection of the who, what and why of the past year. With the transfer window soon to do whatever the opposite of slam shut is, thoughts also turn to how such reflections might inform potential transfer acquisitions. As outlined by Gabriele Marcotti today in the Wall Street Journal, strikers are still the centre of attention when it comes to transfers:

The game’s obsession with centerforwards is not new. After all, it’s the glamour role. Little kids generally dream of being the guy banging in the goals, not the one keeping them out.

On the football analytics front, there has been a lot of discussion surrounding the relative merits of various forward players, with an increasing focus on their goal scoring efficiency (or shot conversion rate) and where players are shooting from. There has been a lot of great work produced but a very simple question has been nagging away at me:

Does being ‘good’ one year suggest that you’ll be ‘good’ next year?

We can all point to examples of forwards shining brightly for a short period during which they plunder a large number of goals, only to then fade away as regression to their (much lower) mean skill level ensues. With this in mind, let’s take a look at some data.

Scoring proficiency

I’ve put together data on players over the past two seasons who have scored at least 10 goals during a single season in the top division in either England, Spain, Germany or Italy from WhoScored. Choosing 10 goals is basically arbitrary but I wanted a reasonable number of goals so that calculated conversion rates didn’t oscillate too wildly and 10 seems like a good target for your budding goalscorer. So for example, Gareth Bale is included as he scored 21 in 2012/13 and 9 goals in 2011/12 but Nikica Jelavić isn’t as he didn’t pass 10 league goals in either season. Collecting the data is painful so a line had to be drawn somewhere. I could have based it on shots per game but that is prone to the wild shooting of the likes of Adel Taarabt and you end up with big outliers. If a player was transferred to or from a league within the WhoScored database (so including France), I retained the player for analysis but if they left the ‘Big 5’ then they were booted out.

In the end I ended up with 115 players who had scored at least 10 league goals in one of the past two seasons. Only 43 players managed to score 10 league goals in both 2011/12 and 2012/13, with only 6 players not named Lionel Messi or Cristiano Ronaldo able to score 20 or more in both seasons. Below is how they match up when comparing their shot conversion, where their goals are divided by their total shots, across both seasons. The conversion rates are based on all goals and all shots, ideally you would take out penalties but that takes time to collate and I doubt it will make much difference to the conclusions.

Comparison between shot conversion rates for players in 2011/12 and 2012/13. Click on the image or here for a larger interactive version.

If we look at the whole dataset, we get a very weak relationship between shot conversion in 2013/12 relative to shot conversion in 2011/12. The R^2 here is 0.11, which suggests that shot conversion by an individual player shows 67% regression to the mean from one season to the next. The upshot of this is that shot conversion above or below the mean is around two-thirds due to luck and one-third due to skill. Without filtering the data any further, this would suggest that predicting how a player will convert their chances next season based on the last will be very difficult.

A potential issue here is the sample size for the number of shots taken by an individual in a season. Dimitar Berbatov’s conversion rate of 44% in 2011/12 is for only 16 shots; he’s good but not that good. If we filter for the number of shots, we can take out some of the outliers and hopefully retain a representative sample. Up to 50 shots, we’re still seeing a 65% regression to the mean and we’ve reduced our sample to 72 players. It is only when we get up to 70 shots and down to 44 players that we see a close to even split between ‘luck’ and ‘skill’ (54% regression to the mean). The problem here is that we’re in danger of ‘over-fitting’ as we rapidly reduce our sample size. If you are happy with a sample of 18 players, then you need to see around 90 shots per season to able to attribute 80% of shot conversion to ‘skill’.

Born again

So where does that leave us? Perhaps unsurprisingly, the results here for players are similar to what James Grayson found at the team level, with a 61% regression to the mean from season to season. Mark Taylor found that around 45 shots was where skill overtook luck for assessing goal scoring, so a little lower than what I found above although I suspect this is due to Mark’s work being based on a larger sample over 3 season in the Premier League.

The above also points to the ongoing importance of sample size when judging players, although I’d want to do some more work on this before being too definitive. Judgements on around half a season of shots appears rather unwise and is about as good as flipping a coin. Really you want around a season for a fuller judgement and even then you might be a little wary of spending too much cash. For something approaching a guarantee, you want some heavy shooting across two seasons, which allied with a good conversion rate can bring you over 20 league goals in a season. I guess that is why the likes of Van Persie, Falcao, Lewandowski, Cavani and Ibrahimovic go for such hefty transfer fees.

Borussia Dortmund vs Real Madrid: passing network analysis

Borussia Dortmund defeated Real Madrid 4-1.

Below is the passing network for the match. The positions of the players are loosely based on the formations played by the two teams, although some creative license is employed for clarity. It is important to note that these are fixed positions, which will not always be representative of where a player passed/received the ball. Only the starting eleven is shown as the substitutes had little impact in a passing sense.

Passing network for Bayern Munich and Barcelona from the Champions League match at the Allianz Arena on the 23rd April 2013. Only completed passes are shown. Darker and thicker arrows indicate more passes between each player. The player markers are sized according to their passing influence, the larger the marker, the greater their influence. The size and colour of the markers is relative to the players on their own team i.e. they are on different scales for each team. Only the starting eleven is shown. Click on the image for a larger view.

Passing networks for Borussia Dortmund and Real Madrid from the Champions League match at the Westfalenstadion on the 24th April 2013. Only completed passes are shown. Darker and thicker arrows indicate more passes between each player. The player markers are sized according to their passing influence, the larger the marker, the greater their influence. The size and colour of the markers is relative to the players on their own team i.e. they are on different scales for each team. Only the starting eleven is shown. Click on the image for a larger view.

The most striking difference between the sides respective passing networks was that Real had a greater emphasis down the flanks, with strong links between the wide players and their full backs. Dortmund were quite balanced in their passing approach with much of their play going through the trio of Hummels, Gundogan and Gotze.

Influential potential

Dortmund’s number ‘ten’ (Gotze) had a greater influence on proceedings than Modric did for Real, with Gotze coming second only to Gundogan in terms of passing influence for Dortmund. Ozil was far more influential than Modric, although he rarely combined with Higuain and Ronaldo. Modric was well down the pecking order for Madrid with the likes of Pepe, Varane and Coentrao ahead of him. On its own, this might not have been a problem but aside from Ramos and Lopez, the only other Real players with less influence were Higuain and Ronaldo. This contrasts directly with Dortmund, where Reus and Lewandowski played an important linking roles.

In summary, Dortmund’s attacking players were among their most influential passing performers; Real Madrid’s were not.

——————————————————————————————————————–

Passing matrices from Uefa.com press kits.

Bayern Munich vs Barcelona: passing network analysis

Bayern Munich defeated Barcelona 4-0 with a dominant performance. The way both teams approached the game in terms of their passing was interesting and worth some additional analysis.

Much of the post-match discussion on TV focussed on Barcelona’s dominance of possession not being reflected in the final scoreline. According to UEFA, Barcelona had 63%, while WhoScored/Opta had it at 66%. However, Bayern were well ahead in terms of shots (15-4 in favour of Bayern, with a 7-1 advantage for on-target shots). It seems that whenever Barcelona lose, their possession statistics are trotted out as a stick to beat them with. Given that Barcelona have gone more than 300 games and close to half a decade since they last played a game with less than 50% possession, I very much doubt there is causality between their possession statistics and match results. Barcelona choose to play this way and it has certainly been successful. However, it is worth remembering that not all teams play the same way and the assumption that there is a single holy grail metric that can ‘explain’ winning football matches is probably a fool’s errand. Even if one does exist, it isn’t a match aggregated possession statistic.

Process, not outcome

In terms of passing, I’ve tried to look more at the process using network analysis to establish how teams pass the ball and which players are the most influential in passing terms in a given match, rather than focussing on a single statistic. Below is the passing network for the match. The positions of the players are loosely based on the formations played by the two teams, although some creative license is employed for clarity. It is important to note that these are fixed positions, which will not always be representative of where a player passed/received the ball. Only the starting eleven is shown as the substitutes had little impact in a passing sense.

Passing network for Bayern Munich and Barcelona from the Champions League match at the Allianz Arena on the 23rd April 2013. Only completed passes are shown. Darker and thicker arrows indicate more passes between each player. The player markers are sized according to their passing influence, the larger the marker, the greater their influence. The size and colour of the markers is relative to the players on their own team i.e. they are on different scales for each team. Only the starting eleven is shown. Click on the image for a larger view.

As might be expected, the contrast between the two teams is quite clear. Bayern focussed their passing down the flanks, with Ribery and Robben combining well with their respective full-backs. Neuer, Dante and Boateng fed the full-backs well to begin these passing transitions. Barcelona on the other hand engaged in their familiar multitude of passing triangles, although with a bias towards their right flank. There are a number of strong links although the somewhat uninspiring Bartra-Pique link was the strongest (23 passes).

Sterile domination

The issue for Barcelona was that their possession was largely in deeper areas, away from Bayern’s penalty area. This was neatly summed up by this tweet (including a graphic) by Albert Larcada:

While Barcelona’s passing network showed plenty of combinations in deeper areas, their more attacking players combined much less, with the links between Alexis, Messi and Pedro being relatively weak. In particular, the passes to Messi were low in number as he received just 7 passes combined from Iniesta (3), Pedro (2) and Alexis (2). Messi had much stronger links with Xavi (received 20 passes) and Alves (received 19 passes) although I suspect many of these were in deeper areasWhile, Barcelona’s midfield three exerted their usual influence, the next most influential players were Pique and Bartra. This is a stark comparison with the home match against AC Milan, where Messi was the most influential player after the midfield trio.

Bayern did a great job of limiting Messi’s influence, although his injury likely contributed also.

Avoid the puddle

Schweinsteiger was the most influential player for Bayern, linking well with Dante, Alaba and Ribery. After the centre-backs, Bayern’s next most influential players were Robben and Ribery who counter-attacked superbly, with excellent support from their full-backs. As discussed by Zonal Marking, Bayern preyed on Barcelona’s weakness on the counter-attack with speedy breaks down the flanks.

Bayern were incredibly effective and deservedly won the match and very likely the tie.

——————————————————————————————————————–

Passing matrices from Uefa.com press kits.

Barcelona vs AC Milan: passing network analysis

Barcelona. Good at the football.

Passing network for Liverpool and West Brom from the match at Anfield on the 11th February 2013. Only completed passes are shown. Darker and thicker arrows indicate more passes between each player. The player markers are sized according to their passing influence, the larger the marker, the greater their influence. The size and colour of the markers is relative to the players on their own team i.e. they are on different scales for each team. The player markers are coloured by the number of times they lost possession during the match, with darker colours indicating more losses. Only the starting eleven is shown. Players with an * next to their name were substituted. Click on the image for a larger view.

Passing network for Barcelona and AC Milan from the Champions League match at the Camp Nou on the 12th March 2013. Only completed passes are shown. Darker and thicker arrows indicate more passes between each player. The player markers are sized according to their passing influence, the larger the marker, the greater their influence. The size and colour of the markers is relative to the players on their own team i.e. they are on different scales for each team. Only the starting eleven is shown. Click on the image for a larger view.

——————————————————————————————————————–

Passing matrices from Uefa.com press kits.

More information on these passing networks is available here.

I don’t have time for a fuller write-up but this from Zonal Marking is excellent.