Primary competition visual

Amini GeoFM Decoding the Field Challenge

Helping Africa
$8 500 USD
Completed (8 months ago)
Classification
798 joined
153 active
Starti
Jun 10, 25
Closei
Jul 20, 25
Reveali
Jul 21, 25
User avatar
marching_learning
Nostalgic Mathematics
My Not So Happy Journey To 1st Place : Honest Insights and Point Of View
Notebooks · 4 Aug 2025, 06:22 · 6

First of all, I’d like to think Zindi for hosting this interesting but a quite challenging one.

My Honest Point Of View

During this competition, to what I understand the main goal was to test the relevance of GeoFM foundation models. But to me, this purpose was and is still a bottleneck because a proper minimal training data was not provided. So, it adds a data centric dimension to the challenge. Because someone with an average model and a good dataset could beat a great model with datasets that are not aligned with the test dataset.

And It was even more challenging, because the potential datasets I found, did not share the same temporality and the same scale as the provided data. And to, me it was the reason that discouraged so many participants.

Datasets I Used

I used two datasets:

CIV Data : The Dataset from the Côte d’Ivoire Byte-Sized Agriculture Challenge challenge : my feature engineering for this challenge can be found here https://www.kaggle.com/code/ulrich07/civ-data/notebook.

Probed Data : The Second one is the dataset derived by LB Probing (will explain it more in the following lines).

PIPELINE

· Step 1: LB Probing

o In the beginning of the challenge, my first attempts could not break the 0.85 barrier. So, I decided to at least submit the MLE estimates which is equivalent to the public dataset mean by class. This can be done, with only 3 submissions (solving a simple linear system). By this, I got the public dataset means: cocoa (0.75261324), oil (0.09946151), rubber (0.14792524). At this stage you should score Pub= 0.72614346 and Private =0.742866516 .

· Step 2: Generation of GeoFM Embeddings for CIV Data and Test Data

o The GeoFM embeddings are generated by quarter based on the public notebook shared by the hosts. But to do so, I need to solve two challenges. The first challenge was first to line up the scale of CIV Data to the provided datasets. To do so, I just multiply every common variable by the mean ratio in order to line up the means. The second one, was to predict the variables of the provided dataset not in the CIV Data. I just did it with a simple Ridge Regression. It is available here (https://www.kaggle.com/code/ulrich07/hf-geofm3/notebook).

· Step 3: Feature Engineering

o I combined classical feature engineering by averaging things over time and perform PCA reduction from quarterly GeoFM Embeddings.

· Step 4: Final Model

o The final models were simple blend of a LightGBM and a XGBoost for every dataset I used. The CIV data model scored LB=0.732999723, Private= 0.752169045. This is even under the static MLE performance. (the model part is here : https://www.kaggle.com/code/ulrich07/geofm-civ-final/notebook )

FINAL THOUGHTS

At this extend of the competition, It seems that my tabular generated features are more predictive than the GeoFM Embeddings. Yet it doesn’t mean that the GeoFM Embeddings are useless, I think we need to fine tune the models to get more relevant embeddings. And unfortunately, I and most of the participants were not able to do it during the challenge.

And I was quite frustrating that we couldn't properly beat a simple static MLE estimates. But I am almost sure that it is due to the lack of good readily available dataset.

Discussion 6 answers
User avatar
marching_learning
Nostalgic Mathematics

All the data used have been made public on kaggle as well. If you want to play around...

4 Aug 2025, 06:52
Upvotes 0
User avatar
zero_shot_assassin
Laikipia university

Ah, thanks for this. The same reason I got discouraged from competing.

4 Aug 2025, 07:20
Upvotes 1
User avatar
3B

I saw the code in your solution as follows: https://www.kaggle.com/code/ulrich07/geofm-civ-final/notebook

tr["wgt"] = tr["Target"].map({"Cocoa":0.75261324, "Palm":0.09946151, "Rubber":0.14792524}) / tr["nb"]

It appears that you manually labeled the test set (which is different from pseudo-labeling, where labels are generated from the model's predictions), concatenated it with the CIV training data, then trained the model and used it to re-predict the test set. This causes the model to almost always predict the test samples as the three values you manually assigned. I'm quite certain this violates the rules, as we are not allowed to interfere with the prediction results through hand-labeling.

4 Aug 2025, 08:11
Upvotes 0
User avatar
marching_learning
Nostalgic Mathematics

NO THIS IS NOT WHAT I DID. I think you MISREAD the code. This is no manually test set labelling. If you read well the code, this is just the sample weights to train the models not a hand-labelling.

Maybe you are trying to get me out of the LB 😂.

User avatar
3B

Sorry, it seems I misunderstood the above code.

No, I don't want to get you out of the leaderboard 😂

User avatar
CodeJoe

Thank you @marching_learning for the write up and code. Congratulations once again!

4 Aug 2025, 11:18
Upvotes 0