🚜 Data Talk: My Not So Happy Journey To...

Amini GeoFM Decoding the Field Challenge

Helping Africa

$8 500 USD

Completed (12 months ago)

Skills you will learn

Classification

800 joined

153 active

Info Data Chat Leaderboard

Start

Jun 10, 25

Jul 20, 25

Reveal

Jul 21, 25

marching_learning

Nostalgic Mathematics

My Not So Happy Journey To 1st Place : Honest Insights and Point Of View

Notebooks · 4 Aug 2025, 06:22 · 6

First of all, I’d like to think Zindi for hosting this interesting but a quite challenging one.

My Honest Point Of View

During this competition, to what I understand the main goal was to test the relevance of GeoFM foundation models. But to me, this purpose was and is still a bottleneck because a proper minimal training data was not provided. So, it adds a data centric dimension to the challenge. Because someone with an average model and a good dataset could beat a great model with datasets that are not aligned with the test dataset.

And It was even more challenging, because the potential datasets I found, did not share the same temporality and the same scale as the provided data. And to, me it was the reason that discouraged so many participants.

Datasets I Used

I used two datasets:

CIV Data : The Dataset from the Côte d’Ivoire Byte-Sized Agriculture Challenge challenge : my feature engineering for this challenge can be found here https://www.kaggle.com/code/ulrich07/civ-data/notebook.

Probed Data : The Second one is the dataset derived by LB Probing (will explain it more in the following lines).

PIPELINE

· Step 1: LB Probing

o In the beginning of the challenge, my first attempts could not break the 0.85 barrier. So, I decided to at least submit the MLE estimates which is equivalent to the public dataset mean by class. This can be done, with only 3 submissions (solving a simple linear system). By this, I got the public dataset means: cocoa (0.75261324), oil (0.09946151), rubber (0.14792524). At this stage you should score Pub= 0.72614346 and Private =0.742866516 .

· Step 2: Generation of GeoFM Embeddings for CIV Data and Test Data

o The GeoFM embeddings are generated by quarter based on the public notebook shared by the hosts. But to do so, I need to solve two challenges. The first challenge was first to line up the scale of CIV Data to the provided datasets. To do so, I just multiply every common variable by the mean ratio in order to line up the means. The second one, was to predict the variables of the provided dataset not in the CIV Data. I just did it with a simple Ridge Regression. It is available here (https://www.kaggle.com/code/ulrich07/hf-geofm3/notebook).

· Step 3: Feature Engineering

o I combined classical feature engineering by averaging things over time and perform PCA reduction from quarterly GeoFM Embeddings.

· Step 4: Final Model

o The final models were simple blend of a LightGBM and a XGBoost for every dataset I used. The CIV data model scored LB=0.732999723, Private= 0.752169045. This is even under the static MLE performance. (the model part is here : https://www.kaggle.com/code/ulrich07/geofm-civ-final/notebook )

FINAL THOUGHTS

At this extend of the competition, It seems that my tabular generated features are more predictive than the GeoFM Embeddings. Yet it doesn’t mean that the GeoFM Embeddings are useless, I think we need to fine tune the models to get more relevant embeddings. And unfortunately, I and most of the participants were not able to do it during the challenge.

And I was quite frustrating that we couldn't properly beat a simple static MLE estimates. But I am almost sure that it is due to the lack of good readily available dataset.

Discussion 6 answers

marching_learning

Nostalgic Mathematics

All the data used have been made public on kaggle as well. If you want to play around...

4 Aug 2025, 06:52

Upvotes 0

zero_shot_assassin

Laikipia university

Ah, thanks for this. The same reason I got discouraged from competing.

4 Aug 2025, 07:20

Upvotes 1

I saw the code in your solution as follows: https://www.kaggle.com/code/ulrich07/geofm-civ-final/notebook

tr["wgt"] = tr["Target"].map({"Cocoa":0.75261324, "Palm":0.09946151, "Rubber":0.14792524}) / tr["nb"]

It appears that you manually labeled the test set (which is different from pseudo-labeling, where labels are generated from the model's predictions), concatenated it with the CIV training data, then trained the model and used it to re-predict the test set. This causes the model to almost always predict the test samples as the three values you manually assigned. I'm quite certain this violates the rules, as we are not allowed to interfere with the prediction results through hand-labeling.

4 Aug 2025, 08:11

Upvotes 0

marching_learning

Nostalgic Mathematics

NO THIS IS NOT WHAT I DID. I think you MISREAD the code. This is no manually test set labelling. If you read well the code, this is just the sample weights to train the models not a hand-labelling.

Maybe you are trying to get me out of the LB 😂.

replied to 3B4 Aug 2025, 08:23

Upvotes 0

Sorry, it seems I misunderstood the above code.

No, I don't want to get you out of the leaderboard 😂

replied to marching_learning4 Aug 2025, 08:31

Upvotes 1

CodeJoe

Thank you @marching_learning for the write up and code. Congratulations once again!

4 Aug 2025, 11:18

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status