At the recent OptaPro Forum, I was delighted to be selected to present to an audience of analysts and representatives from the football industry. I presented a technique to identify different player types using their underlying statistical performance. My idea was that this would aid player scouting by helping to find the “right fit” and avoid the “square peg for a round hole” cliché.

In the presentation, I outlined the technique that I used, along with how Dani Alves made things difficult. My vision for this technique is that the output from the analysis can serve as an additional tool for identifying potential transfer signings. Signings can be categorised according to their team role and their performance can then be compared against their peers in that style category based on the important traits of those player types.

The video of my presentation is below, so rather than repeating myself, go ahead and watch it! The slides are available here.

Each of the player types is summarised below in the figures. My plan is to build on this initial analysis by including a greater number of leagues and use more in-depth data. This is something I will be pursuing over the coming months, so watch this space.

14 thoughts on “Square pegs for square holes: OptaPro Forum Presentation”

Good stuff–exactly what comes to mind when talking about pushing football analytics further.

My one question is that aren’t you assuming player style/cluster is exogenous to their team/system with this? Isn’t player stat cluster-assignment (derived from statistics) greatly influenced by team tactics (as presumably decided by the manager)? It’s easy to compare players in the same cluster, but comparing players in the same cluster is actually just comparing players in the same position. In other words, if you I’m guessing you could predict player classification given a discrete position–to be more clear, this analysis does not address endogeneity of player style and position.

I think some future research could be something along the lines of attempting to explain player statistical variance *conditioned* on position or group cluster.

First, I do think this is some great work. To follow up…think of this as a graphical-causal model:

Your 1st node is ‘ability, or style, etc.’ and it arcs directionally to your 2nd node ‘position’, which arcs to your 3rd node ‘stats/output’–you are clustering based on stats/output. Here, I think, stats/output is conditionally independent of ‘ability, styile’, given position. What is explaining the variance in stats/output after conditioning on position/cluster? More concisely, I think your clusters are just calling out positions granularity, given attack/mid/defense, not ‘style/ability’.

I’ll repeat your last statement here, so that it is clear what your question was:
“I think your clusters are just calling out positions granularity, given attack/mid/defense, not ‘style/ability’.”

Firstly, I didn’t comment on ‘ability’ and when I say ability, I think of ‘how well does a player do this action or perform in this role’. That is something I am developing – I did have some early work on that done for the forum but I did not have time to go through it.

I see your point on positions and yes there is an element of that, as certain actions are more biased towards certain positions. In fact, the ability of the analysis to distinguish those was a source of initial encouragement!

To some extent this is semantics, but I think the analysis does pull out style within a given position on the field. The best example is probably the defensive midfielders; nominally they occupy the same position ahead of the defence but deeper than their fellow midfielders. The analysis is able to distinguish the certain styles of play within that position (disruptors, controllers, deep-lying playmakers).

Going back to this statement: “What is explaining the variance in stats/output after conditioning on position/cluster?”

There is a lot of variance within clusters, which I think could be developed to link to performance or ‘ability’. As a crude example, for attacking players, the ratio of key passes to shots gives a good discrimination between clusters but the absolute number of key passes or shots varies within clusters. What I started working on was comparing, e.g. the rate of key passes, within clusters and calculating percentiles for players so that we can compare their performance against their peers.

What I like about this is that the comparison is more apples-to-apples; usually, a direct attacker and an attacking midfielder would be compared within the same group, which would likely under-rate the direct attacker. However, with this system, you can distinguish the different style/role of a player and make more appropriate comparisons – your tactical system may call for a direct attacker, so now you can look for players who fulfil that role and compare them.

Finally, your point about the link between player statistical output being greatly influenced by team tactics is a good one. I would say this is an issue for all player and team analysis and it is something we need to be aware of (this issue isn’t just limited to statistical analysis either). I’ve also used a similar system and some other techniques to look at team player style/tactics, which could be useful here (as have others in the analytics community).

Thanks again.

Thanks for the quick reply and answers, Will!

1. Good point on between-cluster discrimination being ratio of certain stats, while within cluster variance being explained by absolutes — you mentioned that you started looking at percentiles within clusters–do you control for club ‘level’ or performance? (e.g. Cesc, being at a ‘highly competitive club’ will naturally be ‘given’ more opportunities for certain offensive statistics than a similar player at a ‘less competitive’ club, even after conditioning on cluster)

2. W/r/t the tactics influencing statistical output, have you thought about a time-varying latent structure model? I’m thinking something like a latent/hidden Markov model where your transition matrix (from latent state-to-state [read: cluster-to-cluster]) could be influenced by opposition, cluster of team mates on the field, etc. That way, you could, in essence, explain cluster-transition (proxy for tactical change) when an unexpected state transition occurs, conditioned on your included covariates. I’d assume the diagonal of the transition matrix would be close to 1, but it could be a very intriguing avenue to go down. The only downside is this would require game-level, player-level statistics (and some preprocessing/smoothing — think Kalman filter) and I am not very familiar with what data is available out there.

Good stuff–exactly what comes to mind when talking about pushing football analytics further.

My one question is that aren’t you assuming player style/cluster is exogenous to their team/system with this? Isn’t player stat cluster-assignment (derived from statistics) greatly influenced by team tactics (as presumably decided by the manager)? It’s easy to compare players in the same cluster, but comparing players in the same cluster is actually just comparing players in the same position. In other words, if you I’m guessing you could predict player classification given a discrete position–to be more clear, this analysis does not address endogeneity of player style and position.

I think some future research could be something along the lines of attempting to explain player statistical variance *conditioned* on position or group cluster.

I feel that my question was not too clear…

First, I do think this is some great work. To follow up…think of this as a graphical-causal model:

Your 1st node is ‘ability, or style, etc.’ and it arcs directionally to your 2nd node ‘position’, which arcs to your 3rd node ‘stats/output’–you are clustering based on stats/output. Here, I think, stats/output is conditionally independent of ‘ability, styile’, given position. What is explaining the variance in stats/output after conditioning on position/cluster? More concisely, I think your clusters are just calling out positions granularity, given attack/mid/defense, not ‘style/ability’.

Really good stuff, mate.

Thanks for your comments.

I’ll repeat your last statement here, so that it is clear what your question was:

“I think your clusters are just calling out positions granularity, given attack/mid/defense, not ‘style/ability’.”

Firstly, I didn’t comment on ‘ability’ and when I say ability, I think of ‘how well does a player do this action or perform in this role’. That is something I am developing – I did have some early work on that done for the forum but I did not have time to go through it.

I see your point on positions and yes there is an element of that, as certain actions are more biased towards certain positions. In fact, the ability of the analysis to distinguish those was a source of initial encouragement!

To some extent this is semantics, but I think the analysis does pull out style within a given position on the field. The best example is probably the defensive midfielders; nominally they occupy the same position ahead of the defence but deeper than their fellow midfielders. The analysis is able to distinguish the certain styles of play within that position (disruptors, controllers, deep-lying playmakers).

Going back to this statement: “What is explaining the variance in stats/output after conditioning on position/cluster?”

There is a lot of variance within clusters, which I think could be developed to link to performance or ‘ability’. As a crude example, for attacking players, the ratio of key passes to shots gives a good discrimination between clusters but the absolute number of key passes or shots varies within clusters. What I started working on was comparing, e.g. the rate of key passes, within clusters and calculating percentiles for players so that we can compare their performance against their peers.

What I like about this is that the comparison is more apples-to-apples; usually, a direct attacker and an attacking midfielder would be compared within the same group, which would likely under-rate the direct attacker. However, with this system, you can distinguish the different style/role of a player and make more appropriate comparisons – your tactical system may call for a direct attacker, so now you can look for players who fulfil that role and compare them.

Finally, your point about the link between player statistical output being greatly influenced by team tactics is a good one. I would say this is an issue for all player and team analysis and it is something we need to be aware of (this issue isn’t just limited to statistical analysis either). I’ve also used a similar system and some other techniques to look at team player style/tactics, which could be useful here (as have others in the analytics community).

Thanks again.

Thanks for the quick reply and answers, Will!

1. Good point on between-cluster discrimination being ratio of certain stats, while within cluster variance being explained by absolutes — you mentioned that you started looking at percentiles within clusters–do you control for club ‘level’ or performance? (e.g. Cesc, being at a ‘highly competitive club’ will naturally be ‘given’ more opportunities for certain offensive statistics than a similar player at a ‘less competitive’ club, even after conditioning on cluster)

2. W/r/t the tactics influencing statistical output, have you thought about a time-varying latent structure model? I’m thinking something like a latent/hidden Markov model where your transition matrix (from latent state-to-state [read: cluster-to-cluster]) could be influenced by opposition, cluster of team mates on the field, etc. That way, you could, in essence, explain cluster-transition (proxy for tactical change) when an unexpected state transition occurs, conditioned on your included covariates. I’d assume the diagonal of the transition matrix would be close to 1, but it could be a very intriguing avenue to go down. The only downside is this would require game-level, player-level statistics (and some preprocessing/smoothing — think Kalman filter) and I am not very familiar with what data is available out there.

I look forward to reading the rest of your posts!

1. I’m very much at the start of this, so I’ve not looked at normalising or how player stats alter when moving clubs. Obviously something to look at.

2. I’ll have to go away and read up on this! While I’ve read a little about Markov chains, it isn’t something I’ve actually used myself.

Thanks again.

Pingback: What Can Season-End Bundesliga Data Teach Us? |

Pingback: OptaPro Analytics Forum 2016 accepting abstract proposals | 2+2=11

Pingback: Recruitment by numbers: the tale of Adam and Bobby | 2+2=11

Pingback: Going Forwards Under Rosler: How 4231 Could Have Saved Uwe | Leeds United By Numbers

Pingback: Murphy, Mowatt and Mirco: How 4231 Could Have Saved Uwe Rosler | Leeds United By Numbers

Pingback: Murphy, Mowatt and Mirco: How 4231 Could Have Saved Uwe Rosler « Proper Football Nerds

Pingback: Counting counters | 2+2=11

Pingback: Identifying and assessing team-level strategies: 2017 OptaPro Forum Presentation | 2+2=11

Pingback: Using Pressure to Evaluate Centre Backs | 2+2=11