🛡️ Hot Topic: CV vs Public Leaderboard

CGIAR Eyes on the Ground Challenge

Helping Africa

$10 000 USD

Completed (over 2 years ago)

Skills you will learn

Prediction

872 joined

137 active

Info Data Chat Leaderboard

Start

Jul 21, 23

Nov 03, 23

Reveal

Nov 03, 23

tahsin

CV vs Public Leaderboard

Data · 10 Oct 2023, 22:37 · 6

Let's share our CV vs LB scores. I'll start:

Model: Resnet34

Data augmentation: Horizontal rotation, Brightness

CV: 8.79

LB: 16.09

Anyone seeing a similar gap between CV and LB? Any idea how to solve it? Thanks,

Discussion 6 answers

Lemin

You can try creating a new column(pseudo-location) with the first N characters of the filename, then split training and validation sets in such a way that they don't have an overlap of items from this new column.

11 Oct 2023, 01:04

Upvotes 1

tahsin

Thanks. I tried your method with say N=25 and found out that a lot of files share similar filenames up to N characters but they are not the same file actually. Is there a correlation between these images if they have similar names? @Lemin

replied to Lemin11 Oct 2023, 10:37

Upvotes 0

doItLikeThis99

I'm not seeing quite as high of a gap, but still a gap:

CV: 13.3 (quite high :c )

LB: 15.1

What is your CV method? Splitting randomly probably isn't the way to go - note that the competition mentions this:

"The hold-out is grouped by season and by growth_stage, damage type and extent."

A potential issue here is dataset drift / correlated images, i.e. where images from the same season / damage type will have similar looking damage.

12 Oct 2023, 18:22

Upvotes 1

tahsin

I started with stratification using extent as the key. The dataset has a lot of similar-looking images. My CV score is low because similar images are being spread through train and validation sets. Not exactly sure how I prevent this. As @Lemin mentioned above, splitting and grouping images with similar names might be one strategy.

replied to doItLikeThis9912 Oct 2023, 19:54

Upvotes 0

doItLikeThis99

you could try using stratified cv splits based on season, growth stage, etc.

replied to tahsin12 Oct 2023, 20:34

Upvotes 1

doItLikeThis99

You may be making the same mistake as me - MAKE SURE you're using the train csv which is downloaded manually from this competition page, NOT the other train data that you get when you download all the images.

An easy check is to do train_df.groupby("damage_type").extent.describe(). Everything other than DR has to be 0.

If you're using the wrong data, it isn't necessary to load in the other data. You can just set non-DR extent in your current dataframe to 0 manually and rerun, and your LB score will go down by a very satisfying margin

29 Oct 2023, 19:08

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status