Primary competition visual

CGIAR Eyes on the Ground Challenge

Helping Africa
$10 000 USD
Completed (over 2 years ago)
Prediction
869 joined
137 active
Starti
Jul 21, 23
Closei
Nov 03, 23
Reveali
Nov 03, 23
User avatar
pmwaniki
Kemri wellcome trust research programme
Label leakage
Data · 25 Jul 2023, 13:23 · 18

Can the "damage" column in the test dataset be considered as label leakage? One can tell that the value of "extent" is 0 whenever "damage" is not equal to "DR".

Discussion 18 answers
User avatar
alka

In the description page, it is stated that the competition host are interested in predicting the extent of drought related damages (DR). I'm not even sure why they added data related to other type of damages (which as you said are automatically set to 0); perhaps those extra data can be benefit in self-supervised pretraining approaches ¯\_(ツ)_/¯

29 Jul 2023, 20:06
Upvotes 2

I also do not understand why this column is included. It seems the public leaderboard should not be trusted for this competition

22 Aug 2023, 22:36
Upvotes 1

As someone from the team that was involved in collecting these datasets, a bit more clarification here, I hope this is useful:

The "extent of drought damage" is 0 whenever "damage" is not equal to "DR", as noted here. Because the solutions will be used to determine insurance claims, it is important to not only predict the "extent of damage" conditional on seeing drought damage; but also to predict zero drought damage, since those cases should not get any insurance payout, even if other types of damage have occurred. That is why "extent" is set to 0 for those cases.

You could ignore the column "type of damage", unless you can improve your solution by controlling for whether a crop suffered other types of (non-drought) damage, to explain why a crop looks poor but still has zero extent of damage.

12 Sep 2023, 09:17
Upvotes 1
User avatar
pmwaniki
Kemri wellcome trust research programme

Thank you for the response. The issue I had with the column is that you should not have information in the test dataset that will not be available when deploying the model. Its ok to have the damage column in the training data but one can use it to improve their position in the leaderboard without fitting a better model by simply setting all test predictions to zero whenever damage is not "DR"

the concern is: Why is the column "type of damage" available in test data? Will it be available in private test data too?

I am new to Zindi but my understanding is that the private leaderboard is just a subset of the test data they've given us? So that would mean "type of damage" is available at least until they test the solutions themselves. However, I assume the organisers don't want that. It might be useful to know if they will now assess the submissions any differently to originally planned; it says they would assess the top 10, but the top 10 may be full of solutions with data leakage issues.

Thanks for the comments here - we have organized the organizers to take the "type of damage" column out, since it is not supposed to be used in the solution. We had not realized that it would be a cause for confusion, hopefully by taking that column out, the issue will be resolved!

sorry, but the action is still unclear to me. Will we have a new testset (without column "type of damage") or will we continue with the current testset (which has already leaked)? Continuing to compete with the recent test seems a bad idea since many competitors can leverage the leak to get high scores but it is totally useless in real-world application.

Solutions that use the "type of damage" column will not be accepted. You can continue, but please don't use that column in your solution. Thanks!

@berber-kramer Thank you for this clarification. However, it is still challenging for us to monitor our progress because we are unsure if the Leaderboard (LB) is reliable or not. This uncertainty might lead us to prematurely abandon our efforts, fearing that we are performing poorly on the Leaderboard, while the top-ranked participants might achieve their scores due to factors related to the 'damage' column.

If it's possible to replace the current datasets excluding the 'damage' column and assign new IDs to the images, it would help establish a more trustworthy Leaderboard. This way, we can better assess the effectiveness of our approaches in comparison to other's scores.

"Solutions that use the "type of damage" column will not be accepted". But we can use this column when training the model, right?

This is something for the organizers to comment on; I'll forward your question, but please also reach out to them directly.

Yes, just don't use these data from the test dataset :)

Why do you say that `extent` for non-DR is set to zero?

Below I displayed a histogram for WD (weed) and it clearly shows that there are non-zeros values. And I'm confused.

1. Does it mean that an example with an extent > 0 "WD" should be "DR"?

2. Or I should change value for `WD` to zeros?

But as I can't use `damage` column, the second option is not permitted.

I just want to understand if we should predict `extend` for `non-DR` damage (like DR would have values >0, non-DR would be always 0).

User avatar
pmwaniki
Kemri wellcome trust research programme

Running your command creates a histogram where all values are zero. You might be missing something eg plt.show()

Thanks for checking this out. Look like I checked the file, which was downloaded together with images, from S3 bucket. And in that file, values for "WD" are not zero-out.

Just to make sure:

Now I run this for the Train.csv and get:

import pandas as pd
pd.read_csv("csv_files/Train.csv")
for damage_type in list(df["damage"].unique()):
    extent_per_cat = df[df["damage"] == damage_type]["extent"].value_counts().index
    print(f"Unique extents for {damage_type} damage type: {extent_per_cat}")

and get:

Unique extents for WD damage type: Index([0], dtype='int64', name='extent')
Unique extents for G damage type: Index([0], dtype='int64', name='extent')
Unique extents for DR damage type: Index([10, 30, 40, 20, 50, 60, 90, 80, 70, 100, 0], dtype='int64', name='extent')
Unique extents for ND damage type: Index([0], dtype='int64', name='extent')
Unique extents for DS damage type: Index([0], dtype='int64', name='extent')
Unique extents for PS damage type: Index([0], dtype='int64', name='extent')
Unique extents for WN damage type: Index([0], dtype='int64', name='extent')
Unique extents for FD damage type: Index([0], dtype='int64', name='extent')

So the non-zero extent is set only for the DR damage type

In my understanding, it is not label leakage. The objective of this competition is to predict the effect of drought on crops, even though the crops can also encounter other types of damage, such as disease, flood... This is why the Train.csv provided by Zindi has a 0 extent for non-drought (DR). Therefore, it would make more sense if we were provided with only DR images in both the train and test datasets.

26 Sep 2023, 07:02
Upvotes 0