First of all, I’d like to think Zindi for hosting this interesting but a quite challenging one.
My Honest Point Of View
During this competition, to what I understand the main goal was to test the relevance of GeoFM foundation models. But to me, this purpose was and is still a bottleneck because a proper minimal training data was not provided. So, it adds a data centric dimension to the challenge. Because someone with an average model and a good dataset could beat a great model with datasets that are not aligned with the test dataset.
And It was even more challenging, because the potential datasets I found, did not share the same temporality and the same scale as the provided data. And to, me it was the reason that discouraged so many participants.
Datasets I Used
I used two datasets:
CIV Data : The Dataset from the Côte d’Ivoire Byte-Sized Agriculture Challenge challenge : my feature engineering for this challenge can be found here https://www.kaggle.com/code/ulrich07/civ-data/notebook.
Probed Data : The Second one is the dataset derived by LB Probing (will explain it more in the following lines).
PIPELINE
· Step 1: LB Probing
o In the beginning of the challenge, my first attempts could not break the 0.85 barrier. So, I decided to at least submit the MLE estimates which is equivalent to the public dataset mean by class. This can be done, with only 3 submissions (solving a simple linear system). By this, I got the public dataset means: cocoa (0.75261324), oil (0.09946151), rubber (0.14792524). At this stage you should score Pub= 0.72614346 and Private =0.742866516 .
· Step 2: Generation of GeoFM Embeddings for CIV Data and Test Data
o The GeoFM embeddings are generated by quarter based on the public notebook shared by the hosts. But to do so, I need to solve two challenges. The first challenge was first to line up the scale of CIV Data to the provided datasets. To do so, I just multiply every common variable by the mean ratio in order to line up the means. The second one, was to predict the variables of the provided dataset not in the CIV Data. I just did it with a simple Ridge Regression. It is available here (https://www.kaggle.com/code/ulrich07/hf-geofm3/notebook).
· Step 3: Feature Engineering
o I combined classical feature engineering by averaging things over time and perform PCA reduction from quarterly GeoFM Embeddings.
· Step 4: Final Model
o The final models were simple blend of a LightGBM and a XGBoost for every dataset I used. The CIV data model scored LB=0.732999723, Private= 0.752169045. This is even under the static MLE performance. (the model part is here : https://www.kaggle.com/code/ulrich07/geofm-civ-final/notebook )
FINAL THOUGHTS
At this extend of the competition, It seems that my tabular generated features are more predictive than the GeoFM Embeddings. Yet it doesn’t mean that the GeoFM Embeddings are useless, I think we need to fine tune the models to get more relevant embeddings. And unfortunately, I and most of the participants were not able to do it during the challenge.
And I was quite frustrating that we couldn't properly beat a simple static MLE estimates. But I am almost sure that it is due to the lack of good readily available dataset.
All the data used have been made public on kaggle as well. If you want to play around...
Ah, thanks for this. The same reason I got discouraged from competing.
I saw the code in your solution as follows: https://www.kaggle.com/code/ulrich07/geofm-civ-final/notebook
tr["wgt"] = tr["Target"].map({"Cocoa":0.75261324, "Palm":0.09946151, "Rubber":0.14792524}) / tr["nb"]
It appears that you manually labeled the test set (which is different from pseudo-labeling, where labels are generated from the model's predictions), concatenated it with the CIV training data, then trained the model and used it to re-predict the test set. This causes the model to almost always predict the test samples as the three values you manually assigned. I'm quite certain this violates the rules, as we are not allowed to interfere with the prediction results through hand-labeling.
NO THIS IS NOT WHAT I DID. I think you MISREAD the code. This is no manually test set labelling. If you read well the code, this is just the sample weights to train the models not a hand-labelling.
Maybe you are trying to get me out of the LB 😂.
Sorry, it seems I misunderstood the above code.
No, I don't want to get you out of the leaderboard 😂
Thank you @marching_learning for the write up and code. Congratulations once again!