Laduma Analytics Football League Winners Prediction Challenge
Can you predict the outcome of a football match based on historical data?
$2 000 USD
Ended 5 months ago
154 active · 700 enrolled
Good for beginners
2nd place solution
5 Sep 2022, 22:01 · 6

Hi and thank you to everyone who made this competition possible, Laduma Analytics for funding it and providing the data, Zindi for hosting it, Layer for contributing to it and to all the competitors for making it more challenging for each one of us.

Competition duration and number of submits worked nice for me because I joined the competition during the first days. I focused on a single competition, not like others, I’m not looking to anyone @JONNY and @Koleshjr. :-)

My approach:

1. We generate some new features and later compress some features with a simple Neural Network (Keras) being used as an encoder/decoder, we set input=output (can sound a bit strange first time you use it) and once the learning is done we are interested in the values at the middle of the Neural Network because we'll use them as features. To decide which features to compress we do this:

  • cut=1.05
  • mean_diff_goal_vs_no_goal=abs(X_train[y_train==True].describe().loc["mean"]/X_train[y_train==False].describe().loc["mean"])
  • cols_to_compress=mean_diff_goal_vs_no_goal[mean_diff_goal_vs_no_goal<cut].index

2. We will later also "compress" the rows to keep only relevant ones by using following method: we run a prediction (supervised learning) but we will use its results more like a clustering algorithm (unsupervised learning). We do this to get rid of what we do not want to analyze so we can concentrate our effort and our computing power on the most relevant data.

I thought the above was a nice approach until Marzoc shared a far simpler and less computing resources intensive approach. To make sure we are all on the same page, all the above gets to the same point were Marzoc got on his first step in just 2 lines of code.

3. To try to predict which team scored a goal based on the above data we can:

a. Predict the ones that have a play pattern easy to predict. (We don’t need it now but here we also want to learn to differentiate field attacking players from goalkeepers.)

b. For the difficult ones we can get some help if we rely on identifying if the involved players are field attacking players or goalkeepers taking into account the following:

  • What we learned when predicting the easy ones.
  • The identifications that we can easily generate from train data. E.g. true_goalkeepers=pd.DataFrame(train_stats[train_stats["Action"]=="Goals conceded"]["Player_ID"])
  • During this whole process we keep learning new field attacking players and goalkeepers all the time, that’s why we start running iterations from the players which we identified with more confidence and we keep adding players identified with less confidence at each iteration. E.g. for first round we only accept players scoring or conceding 50 goals, for second round 40, and we go down so every time we require less goals to identify a player and moreover, we have processed more data.
  • A goalkeeper can score a goal, for example at minute 93 or in other situations. So we’ll have to take care of this otherwise we have a player with 30 goals conceded (clearly a goalkeeper) but can also score one goal so we do not want to use this uncommon goal to identify a goalkeeper also as a field player. My model is currently unable to predict such uncommon situations.
  • A field player may end up playing as goalkeeper if goalkeeper gets a red card and the manger had run out of changes. If a goal is conceded this way we also do not want to consider it for identifying a striker also as a goalkeeper. My model is currently unable to predict such uncommon situations.
  • Own goals are quite difficult because a player suddenly appears as playing for the Opposite team and if the goalkeeper is the one scoring the goal it gets even more weird.

4. Final prediction. Once we have predicted the goals we can get to the final prediction of matches outcome. I used a Support Vector Machine, in one of the submits mixed with a RandomForest. If you think that SVMs do not natively provide probability estimates you are 100% right although you can get them by just enabling probability=True.

Data cleaning notes:

I removed part of ID_HPYKEW7R Game_ID from train data because it looked duplicated.

I kept following flag for missing data because I wanted to keep an eye on it during the whole process. If I removed missing data the model was ruined but dealing with it wasn't that easy: caution=test_stats[(test_stats["Passes"].isna())|(test_stats["Half"].isna())].index

This didn’t help but I wanted to share. :-) test_stats["Manager"]=test_stats["Manager"].fillna("Shy_Manager")

Looks like Player_X982WR9W is the token assigned for the AI that processed the videos of the matches to Unknown players so let’s change the name for 2 reasons:

  • So it’s more human readable
  • And even more important, every time we will check if we’ve learned something about the “Unknown” player because this is noise for the model because some times the Unknow player is a field player, sometimes it’s a goalkeeper, it can be from every team so bye bye Unknown player, we don’t want misleading information. To rename it: relevant_actions_test["Player_ID"].iloc[relevant_actions_test[relevant_actions_test["Player_ID"]=="Player_X982WR9W"].index]="Unknown"

What didn't work for me:

I tried to identify the Unknown player using different approaches but I didn’t succeed, for example, if I have a play by Unknown player at minute 60, I tried to identify same team goalkeeper plays before and after minute 60, if before and after the goalkeeper name was xxxxxx then I could conclude that Unknown player wasn’t the goalkeeper of that team at minute 60 so he was a field player which would unlock some further processing but it didn’t work a single time. With different data I understand that this has the potential of being useful.

Discussion 6 answers

Wow, this is very good, the real challenge must have come from identifying players and their positions, but the test dataset didn't have an actions column so how did you deal with that and also need teams with new players? Were you predicting goals per match and then classifying it as win, lose or draw according to the goals?

6 Sep 2022, 00:00
Upvotes 0

Answering your second question: You are right, first I predict goals per match and then I run a SVM classifier (alone or stacked with a RandomForestClassifier) to get probabilities of win, lose or draw.

To answer your first question, imagine you have following plays in test and you expect to have 1 striker and 1 goalkeeper at each play and you want to know who is who:

  • Play 1: player A, player B
  • Play 2: player B, player C

You start 1st iteration:

  1. You try to process play 1 but if for players A and B right now you have no idea you have to skip play 1, do nothing now, we’ll come back later.
  2. You get to play 2 and you still have no idea about player B but if from the train data you concluded that player C looks like a striker then you can conclude that player B is a goalkeeper. You’ve learnt what you wanted to learn from play 2 so you remove it from the pending actions to be reviewed.

You start second iteration so you go again to Play 1:

  1. This time you know that player B is a goalkeeper so you know that player A is a striker. By iterating again and again and again you start slowly identifying the test players which initially where not possible to identify.

To make the above less prone to errors you only modify what you consider a striker and goalkeeper at the beginning of each iteration but no during the iteration. Additionally, you will want to:

  • Clear Unknow player, you do not want that noise to propagate and start a chain of misclassified players.
  • Move the threshold to be considered a goalkeeper or a field player and reduce it at each iteration (50 goals, 40 goals, …). It’s important to require several goals because e.g.:
  • Play 10: player D, player E
  • Play 11: player E, player Unknown
  1. If we’ve just learnt that player D conceded a goal and we categorize the player as goalkeeper we may be doing it wrong (e.g. it could be red flag on goalkeeper on minute 85, changes have been used so player D a defender will play last minutes as goalkeeper and can concede one goal).
  2. If by a single goal we wrongly considered player D a goalkeeper when we get into iteration 10 we will conclude that E is a striker and the play could be the opposite, a defender (D) scoring on a corner kick where E would be the goalkeeper for example.
  3. In turn we could end up using E to misclassify play 11 and so on. That’s why it’s important to start from what we believe it’s more safe to use and go down from there, to reduce the risk of error.

Pls, share your notebook.

6 Sep 2022, 05:33
Upvotes 0

Thanks for sharing such detail approach. The corner cases you mentioned like goalkeeper's own goal, the probability of he is being sentoff are the real problem of the dataset nicely noted👌. Also the cleaning part is nice. I think if you do some kind of manual postprcessing for the model prediction might improve the performance.

Amazing approach👏 Congrats🥳🥳🥳

6 Sep 2022, 08:45
Upvotes 0

Awesome solution + write up. Really motivating to see, that you went the extra-mile with many of the features you describe, like identifiying attacking players and goalkeepers. I had stuff like that in mind too, but I guess I was just too lazy to dig deeper. I will remember this for the next competition ;)

6 Sep 2022, 11:12
Upvotes 0