Hi and thank you to everyone who made this competition possible, Laduma Analytics for funding it and providing the data, Zindi for hosting it, Layer for contributing to it and to all the competitors for making it more challenging for each one of us.
Competition duration and number of submits worked nice for me because I joined the competition during the first days. I focused on a single competition, not like others, I’m not looking to anyone @JONNY and @Koleshjr. :-)
1. We generate some new features and later compress some features with a simple Neural Network (Keras) being used as an encoder/decoder, we set input=output (can sound a bit strange first time you use it) and once the learning is done we are interested in the values at the middle of the Neural Network because we'll use them as features. To decide which features to compress we do this:
2. We will later also "compress" the rows to keep only relevant ones by using following method: we run a prediction (supervised learning) but we will use its results more like a clustering algorithm (unsupervised learning). We do this to get rid of what we do not want to analyze so we can concentrate our effort and our computing power on the most relevant data.
I thought the above was a nice approach until Marzoc shared a far simpler and less computing resources intensive approach. To make sure we are all on the same page, all the above gets to the same point were Marzoc got on his first step in just 2 lines of code.
3. To try to predict which team scored a goal based on the above data we can:
a. Predict the ones that have a play pattern easy to predict. (We don’t need it now but here we also want to learn to differentiate field attacking players from goalkeepers.)
b. For the difficult ones we can get some help if we rely on identifying if the involved players are field attacking players or goalkeepers taking into account the following:
4. Final prediction. Once we have predicted the goals we can get to the final prediction of matches outcome. I used a Support Vector Machine, in one of the submits mixed with a RandomForest. If you think that SVMs do not natively provide probability estimates you are 100% right although you can get them by just enabling probability=True.
I removed part of ID_HPYKEW7R Game_ID from train data because it looked duplicated.
I kept following flag for missing data because I wanted to keep an eye on it during the whole process. If I removed missing data the model was ruined but dealing with it wasn't that easy: caution=test_stats[(test_stats["Passes"].isna())|(test_stats["Half"].isna())].index
This didn’t help but I wanted to share. :-) test_stats["Manager"]=test_stats["Manager"].fillna("Shy_Manager")
Looks like Player_X982WR9W is the token assigned for the AI that processed the videos of the matches to Unknown players so let’s change the name for 2 reasons:
I tried to identify the Unknown player using different approaches but I didn’t succeed, for example, if I have a play by Unknown player at minute 60, I tried to identify same team goalkeeper plays before and after minute 60, if before and after the goalkeeper name was xxxxxx then I could conclude that Unknown player wasn’t the goalkeeper of that team at minute 60 so he was a field player which would unlock some further processing but it didn’t work a single time. With different data I understand that this has the potential of being useful.
Wow, this is very good, the real challenge must have come from identifying players and their positions, but the test dataset didn't have an actions column so how did you deal with that and also need teams with new players? Were you predicting goals per match and then classifying it as win, lose or draw according to the goals?
Answering your second question: You are right, first I predict goals per match and then I run a SVM classifier (alone or stacked with a RandomForestClassifier) to get probabilities of win, lose or draw.
To answer your first question, imagine you have following plays in test and you expect to have 1 striker and 1 goalkeeper at each play and you want to know who is who:
You start 1st iteration:
You start second iteration so you go again to Play 1:
To make the above less prone to errors you only modify what you consider a striker and goalkeeper at the beginning of each iteration but no during the iteration. Additionally, you will want to:
Thanks
Pls, share your notebook.
Thanks for sharing such detail approach. The corner cases you mentioned like goalkeeper's own goal, the probability of he is being sentoff are the real problem of the dataset nicely noted👌. Also the cleaning part is nice. I think if you do some kind of manual postprcessing for the model prediction might improve the performance.
Amazing approach👏 Congrats🥳🥳🥳
Awesome solution + write up. Really motivating to see, that you went the extra-mile with many of the features you describe, like identifiying attacking players and goalkeepers. I had stuff like that in mind too, but I guess I was just too lazy to dig deeper. I will remember this for the next competition ;)