🛡️ Join the Buzz: Label leakage

CGIAR Eyes on the Ground Challenge

Helping Africa

$10 000 USD

Completed (over 2 years ago)

Skills you will learn

Prediction

872 joined

137 active

Info Data Chat Leaderboard

Start

Jul 21, 23

Nov 03, 23

Reveal

Nov 03, 23

pmwaniki

Kemri wellcome trust research programme

Label leakage

Data · 25 Jul 2023, 13:23 · 18

Can the "damage" column in the test dataset be considered as label leakage? One can tell that the value of "extent" is 0 whenever "damage" is not equal to "DR".

Discussion 18 answers

alka

In the description page, it is stated that the competition host are interested in predicting the extent of drought related damages (DR). I'm not even sure why they added data related to other type of damages (which as you said are automatically set to 0); perhaps those extra data can be benefit in self-supervised pretraining approaches ¯\_(ツ)_/¯

29 Jul 2023, 20:06

Upvotes 2

doItLikeThis99

I also do not understand why this column is included. It seems the public leaderboard should not be trusted for this competition

22 Aug 2023, 22:36

Upvotes 1

berber-kramer

As someone from the team that was involved in collecting these datasets, a bit more clarification here, I hope this is useful:

The "extent of drought damage" is 0 whenever "damage" is not equal to "DR", as noted here. Because the solutions will be used to determine insurance claims, it is important to not only predict the "extent of damage" conditional on seeing drought damage; but also to predict zero drought damage, since those cases should not get any insurance payout, even if other types of damage have occurred. That is why "extent" is set to 0 for those cases.

You could ignore the column "type of damage", unless you can improve your solution by controlling for whether a crop suffered other types of (non-drought) damage, to explain why a crop looks poor but still has zero extent of damage.

12 Sep 2023, 09:17

Upvotes 1

pmwaniki

Kemri wellcome trust research programme

Thank you for the response. The issue I had with the column is that you should not have information in the test dataset that will not be available when deploying the model. Its ok to have the damage column in the training data but one can use it to improve their position in the leaderboard without fitting a better model by simply setting all test predictions to zero whenever damage is not "DR"

replied to berber-kramer12 Sep 2023, 13:35

Upvotes 3

triducnguyentang

the concern is: Why is the column "type of damage" available in test data? Will it be available in private test data too?

replied to berber-kramer13 Sep 2023, 03:53

Upvotes 0

Dahn14

I am new to Zindi but my understanding is that the private leaderboard is just a subset of the test data they've given us? So that would mean "type of damage" is available at least until they test the solutions themselves. However, I assume the organisers don't want that. It might be useful to know if they will now assess the submissions any differently to originally planned; it says they would assess the top 10, but the top 10 may be full of solutions with data leakage issues.

replied to triducnguyentang13 Sep 2023, 16:34

Upvotes 0

berber-kramer

Thanks for the comments here - we have organized the organizers to take the "type of damage" column out, since it is not supposed to be used in the solution. We had not realized that it would be a cause for confusion, hopefully by taking that column out, the issue will be resolved!

replied to Dahn1414 Sep 2023, 12:05

Upvotes 0

triducnguyentang

sorry, but the action is still unclear to me. Will we have a new testset (without column "type of damage") or will we continue with the current testset (which has already leaked)? Continuing to compete with the recent test seems a bad idea since many competitors can leverage the leak to get high scores but it is totally useless in real-world application.

replied to berber-kramer15 Sep 2023, 03:59

Upvotes 0

berber-kramer

Solutions that use the "type of damage" column will not be accepted. You can continue, but please don't use that column in your solution. Thanks!

replied to triducnguyentang15 Sep 2023, 07:54

Upvotes 0

masawdah

@berber-kramer Thank you for this clarification. However, it is still challenging for us to monitor our progress because we are unsure if the Leaderboard (LB) is reliable or not. This uncertainty might lead us to prematurely abandon our efforts, fearing that we are performing poorly on the Leaderboard, while the top-ranked participants might achieve their scores due to factors related to the 'damage' column.

If it's possible to replace the current datasets excluding the 'damage' column and assign new IDs to the images, it would help establish a more trustworthy Leaderboard. This way, we can better assess the effectiveness of our approaches in comparison to other's scores.

replied to berber-kramer17 Sep 2023, 10:59

Upvotes 0

triducnguyentang

"Solutions that use the "type of damage" column will not be accepted". But we can use this column when training the model, right?

replied to berber-kramer18 Sep 2023, 02:23

Upvotes 1

berber-kramer

This is something for the organizers to comment on; I'll forward your question, but please also reach out to them directly.

replied to masawdah18 Sep 2023, 04:20

Upvotes 0

berber-kramer

Yes, just don't use these data from the test dataset :)

replied to triducnguyentang18 Sep 2023, 04:21

Upvotes 1

Bartek

Why do you say that `extent` for non-DR is set to zero?

Below I displayed a histogram for WD (weed) and it clearly shows that there are non-zeros values. And I'm confused.

1. Does it mean that an example with an extent > 0 "WD" should be "DR"?

2. Or I should change value for `WD` to zeros?

But as I can't use `damage` column, the second option is not permitted.

I just want to understand if we should predict `extend` for `non-DR` damage (like DR would have values >0, non-DR would be always 0).

replied to berber-kramer10 Oct 2023, 21:12

Upvotes 0

pmwaniki

Kemri wellcome trust research programme

Running your command creates a histogram where all values are zero. You might be missing something eg plt.show()

replied to Bartek11 Oct 2023, 07:54

Upvotes 1

Bartek

Thanks for checking this out. Look like I checked the file, which was downloaded together with images, from S3 bucket. And in that file, values for "WD" are not zero-out.

replied to pmwaniki11 Oct 2023, 08:31

Upvotes 0

lyumax

Just to make sure:

Now I run this for the Train.csv and get:

import pandas as pd

pd.read_csv("csv_files/Train.csv")

for damage_type in list(df["damage"].unique()):

    extent_per_cat = df[df["damage"] == damage_type]["extent"].value_counts().index

    print(f"Unique extents for {damage_type} damage type: {extent_per_cat}")

and get:

Unique extents for WD damage type: Index([0], dtype='int64', name='extent')

Unique extents for G damage type: Index([0], dtype='int64', name='extent')

Unique extents for DR damage type: Index([10, 30, 40, 20, 50, 60, 90, 80, 70, 100, 0], dtype='int64', name='extent')

Unique extents for ND damage type: Index([0], dtype='int64', name='extent')

Unique extents for DS damage type: Index([0], dtype='int64', name='extent')

Unique extents for PS damage type: Index([0], dtype='int64', name='extent')

Unique extents for WN damage type: Index([0], dtype='int64', name='extent')

Unique extents for FD damage type: Index([0], dtype='int64', name='extent')

So the non-zero extent is set only for the DR damage type

replied to Bartek25 Oct 2023, 12:24

Upvotes 0

Reacher

In my understanding, it is not label leakage. The objective of this competition is to predict the effect of drought on crops, even though the crops can also encounter other types of damage, such as disease, flood... This is why the Train.csv provided by Zindi has a 0 extent for non-drought (DR). Therefore, it would make more sense if we were provided with only DR images in both the train and test datasets.

26 Sep 2023, 07:02

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status