🌾 Join the Buzz: You can get top10 with 6...

You can get top10 with 6 lines of code!! LOL

Notebooks · 10 Mar 2025, 01:13 · 19

Congrats to the winners! When I review my sub, I found one submitted on day 1 that is really good. Let me know if it works for you. Public: 1.159004623 Private: 1.268172086

import pandas as pd

train = pd.read_csv('/raid/ml/root/Train.csv')

test = pd.read_csv('/raid/ml/root/Test.csv')

dg = dg.groupby('Genotype')['RootVolume'].mean().reset_index()

test = test.merge(dg, on='Genotype', how='left')

test[['ID','RootVolume']].to_csv('Genotype_mean.csv', index=False)

Discussion 19 answers

Nostalgic Mathematics

Yes when the data is small, just use the smallest model.

10 Mar 2025, 01:14

Upvotes 1

replied to marching_learning10 Mar 2025, 01:16

yep, O(1) parameter model. :)

Upvotes 1

replied to snow10 Mar 2025, 01:20

Nostalgic Mathematics

Yes I think that, it is a bit frustrating given the size of the competition. The small size of the data will not reward the "true well performing models" but the more parsimonious. I also have many minimal models of less than 10 features that can could have made top 10

Upvotes 2

replied to marching_learning10 Mar 2025, 02:17

Everybody does. It just happens.

Upvotes 0

Muhamed_Tuo

Inveniam

@CodeJoe @marching_learning We should make it a thing to share the best Private LB score, to feel a bit good about competitions like these hh

replied to marching_learning10 Mar 2025, 09:39

Upvotes 3

replied to Muhamed_Tuo10 Mar 2025, 11:01

True 🤣🤣

Upvotes 0

It seems genotype mean adjust can make a huge boost on private lb. Try your sub with this:

import pandas as pd

train = pd.read_csv('/raid/ml/root/Train.csv')

test = pd.read_csv('/raid/ml/root/Test.csv')

sub = pd.read_csv('../scripts/sub.csv') # private lb 1.32

test = test.merge(sub,on='ID',how='left')

test['test_gmean'] = test.groupby('Genotype')['RootVolume'].transform('mean')

dg = train.groupby('Genotype')['RootVolume'].mean().reset_index().rename(columns={'RootVolume':'train_gmean'})

test = test.merge(dg, on='Genotype', how='left')

test['RootVolume'] = test['RootVolume']/test['test_gmean']*test['train_gmean']

test[['ID','RootVolume']].to_csv('sub_adjust.csv', index=False) # private lb 1.18

10 Mar 2025, 01:25

Upvotes 5

replied to snow10 Mar 2025, 02:35

Oh I see. Genuinely, this competition was not fair to those who put in the work. It just favoured us that did little feature engineering and used a small model. No kfolds or stuff, and we could hit below 1.3. I guess there will really be more to it during the code review. With just these features:

PlantNumber, Side, Start, End, Genotype, Stage

You could get a score below 1.25. I personally with your trick, had a score below 1.22. I guess the public board didn't also tell the truth😂. Images weren't even necessary in the first place and I think that ,only if a vigorous code review takes place, can be a disqualification which in a way will not be fair to the participants who didn't use them.

I guess luck won this time. And yes, one must trust the CV.

Upvotes 2

replied to CodeJoe10 Mar 2025, 02:43

I got your point. but i think it is fair to everyone. It is a real world lesson of statistics. especially on the minimum sufficent sample size to "learn" patterns.

Upvotes 3

Muhamed_Tuo

Inveniam

@snow Yeah, true. It's hard to accept but I bet they still learned a lot from this.

I'm very curious to know what magic @Mohamed_abdelrazik was pulling out, with his impressive Public score

replied to snow10 Mar 2025, 09:34

Upvotes 1

replied to Muhamed_Tuo10 Mar 2025, 09:48

Nostalgic Mathematics

Yes @Muhamed_Tuo, I'm curious to see @Mohamed_abdelrazik did !!!

Upvotes 2

Mohamed_abdelrazik

a simple apporach split the train into 3 groups

# First DataFrame: RootVolume between 0.2 and 1.2

train_df_1 = train_df[(train_df['RootVolume'] > 0.2) & (train_df['RootVolume'] <= 1.6)].reset_index(drop=True)

# Second DataFrame: RootVolume between 1.2 and 2.6

train_df_2 = train_df[(train_df['RootVolume'] > 1.6) & (train_df['RootVolume'] <=2.6)].reset_index(drop=True)

# Third DataFrame: RootVolume between 2.6 and 8.5

train_df_3 = train_df[(train_df['RootVolume'] > 3) & (train_df['RootVolume'] <= 11)].reset_index(drop=True)

then have a seperate model to classify the test data if it belong to group 0 1 or 2 then feed this image to corresponding model from train

replied to Muhamed_Tuo10 Mar 2025, 09:53

Upvotes 5

replied to Mohamed_abdelrazik10 Mar 2025, 09:59

Nostalgic Mathematics

That is awesome !!! I had a similar idea but didn't experience it. Thanks for sharing

Upvotes 1

Muhamed_Tuo

Inveniam

Very nice contraints to reduce the predicted noise 👏. Thanks for sharing

replied to Mohamed_abdelrazik10 Mar 2025, 10:22

Upvotes 1

MICADEE

LAHASCOM (Freelance)

@marching_learning @snow, Yes, I completely agree that the data is small. However, I believe the organizers still need to evaluate how participants arrive at their final private LB scores by reviewing the entire process—data analysis, preprocessing, feature engineering, modeling strategies, and more—not just a few lines of code. This is why they carefully review our individual code submissions before finalizing the winning solutions.

My candid take anyway !!!

Cheers !!!

10 Mar 2025, 01:50

Upvotes 4

replied to MICADEE10 Mar 2025, 02:40

absolutely, my post is not to devalue anyone's work. Just want to share my findings.

Upvotes 3