Primary competition visual

CGIAR Root Volume Estimation Challenge

Helping Africa
$15 000 USD
Completed (~1 year ago)
Computer Vision
Prediction
1063 joined
257 active
Starti
Jan 24, 25
Closei
Mar 09, 25
Reveali
Mar 10, 25
You can get top10 with 6 lines of code!! LOL
Notebooks Ā· 10 Mar 2025, 01:13 Ā· 19

Congrats to the winners! When I review my sub, I found one submitted on day 1 that is really good. Let me know if it works for you. Public: 1.159004623 Private: 1.268172086

import pandas as pd

train = pd.read_csv('/raid/ml/root/Train.csv')

test = pd.read_csv('/raid/ml/root/Test.csv')

dg = dg.groupby('Genotype')['RootVolume'].mean().reset_index()

test = test.merge(dg, on='Genotype', how='left')

test[['ID','RootVolume']].to_csv('Genotype_mean.csv', index=False)

Discussion 19 answers
User avatar
marching_learning
Nostalgic Mathematics

Yes when the data is small, just use the smallest model.

10 Mar 2025, 01:14
Upvotes 1

yep, O(1) parameter model. :)

User avatar
marching_learning
Nostalgic Mathematics

Yes I think that, it is a bit frustrating given the size of the competition. The small size of the data will not reward the "true well performing models" but the more parsimonious. I also have many minimal models of less than 10 features that can could have made top 10

User avatar
CodeJoe

Everybody does. It just happens.

User avatar
Muhamed_Tuo
Inveniam

@CodeJoe @marching_learning We should make it a thing to share the best Private LB score, to feel a bit good about competitions like these hh

User avatar
CodeJoe

True 🤣🤣

It seems genotype mean adjust can make a huge boost on private lb. Try your sub with this:

import pandas as pd

train = pd.read_csv('/raid/ml/root/Train.csv')

test = pd.read_csv('/raid/ml/root/Test.csv')

sub = pd.read_csv('../scripts/sub.csv') # private lb 1.32

test = test.merge(sub,on='ID',how='left')

test['test_gmean'] = test.groupby('Genotype')['RootVolume'].transform('mean')

dg = train.groupby('Genotype')['RootVolume'].mean().reset_index().rename(columns={'RootVolume':'train_gmean'})

test = test.merge(dg, on='Genotype', how='left')

test['RootVolume'] = test['RootVolume']/test['test_gmean']*test['train_gmean']

test[['ID','RootVolume']].to_csv('sub_adjust.csv', index=False) # private lb 1.18

10 Mar 2025, 01:25
Upvotes 5
User avatar
CodeJoe

Oh I see. Genuinely, this competition was not fair to those who put in the work. It just favoured us that did little feature engineering and used a small model. No kfolds or stuff, and we could hit below 1.3. I guess there will really be more to it during the code review. With just these features:

PlantNumber, Side, Start, End, Genotype, Stage

You could get a score below 1.25. I personally with your trick, had a score below 1.22. I guess the public board didn't also tell the truth😂. Images weren't even necessary in the first place and I think that ,only if a vigorous code review takes place, can be a disqualification which in a way will not be fair to the participants who didn't use them.

I guess luck won this time. And yes, one must trust the CV.

I got your point. but i think it is fair to everyone. It is a real world lesson of statistics. especially on the minimum sufficent sample size to "learn" patterns.

User avatar
Muhamed_Tuo
Inveniam

@snow Yeah, true. It's hard to accept but I bet they still learned a lot from this.

I'm very curious to know what magic @Mohamed_abdelrazik was pulling out, with his impressive Public score

User avatar
marching_learning
Nostalgic Mathematics

Yes @Muhamed_Tuo, I'm curious to see @Mohamed_abdelrazik did !!!

User avatar
Mohamed_abdelrazik

a simple apporach split the train into 3 groups

# First DataFrame: RootVolume between 0.2 and 1.2

train_df_1 = train_df[(train_df['RootVolume'] > 0.2) & (train_df['RootVolume'] <= 1.6)].reset_index(drop=True)

# Second DataFrame: RootVolume between 1.2 and 2.6

train_df_2 = train_df[(train_df['RootVolume'] > 1.6) & (train_df['RootVolume'] <=2.6)].reset_index(drop=True)

# Third DataFrame: RootVolume between 2.6 and 8.5

train_df_3 = train_df[(train_df['RootVolume'] > 3) & (train_df['RootVolume'] <= 11)].reset_index(drop=True)

then have a seperate model to classify the test data if it belong to group 0 1 or 2 then feed this image to corresponding model from train

User avatar
marching_learning
Nostalgic Mathematics

That is awesome !!! I had a similar idea but didn't experience it. Thanks for sharing

User avatar
Muhamed_Tuo
Inveniam

Very nice contraints to reduce the predicted noise 👏. Thanks for sharing

User avatar
MICADEE
LAHASCOM

@marching_learning @snow, Yes, I completely agree that the data is small. However, I believe the organizers still need to evaluate how participants arrive at their final private LB scores by reviewing the entire process—data analysis, preprocessing, feature engineering, modeling strategies, and more—not just a few lines of code. This is why they carefully review our individual code submissions before finalizing the winning solutions.

My candid take anyway !!!

Cheers !!!

10 Mar 2025, 01:50
Upvotes 4

absolutely, my post is not to devalue anyone's work. Just want to share my findings.

User avatar
CodeJoe

Seriously🤣🤣😭😭😭

10 Mar 2025, 02:16
Upvotes 0
User avatar
3B

I gave up on this competition after testing and realizing that images contributed almost nothing to accuracy. My LGBM model, which only used metadata, achieved accuracy comparable to image-based models. It feels like the models are learning the mean volume value of the root rather than the volume value for each individual plant.

10 Mar 2025, 03:16
Upvotes 2

In my opinion the biggest mistake here is that split data and images between Train, Test

lead to the meta data or data itself can perform better, or images without data can better generalization,

+

also any underfit model can be better "shallow depth, shallow estimators" achieve 1.237x.

if provided full Folders Train, Full Folders Test can be better.

10 Mar 2025, 09:59
Upvotes 0