This Week on Zindi: Data Leakage in Test set

Data Leakage in Test set

Data · 5 Nov 2024, 06:01 · 14

@Zindi There appears to be a massive data leakage in the test set. The ID feature in the test set is sequential and not shuffled. As such, with the right partition for each location, you can estimate exactly which samples are class 1 or 2, hence my perfect score on the leaderboard :).

I have found other issues in the dataset - eg, no cloud mask on some locations in Vietnam, so some values in the train (and test) csv(s) are estimated with cloud corruption; the entire dataset is from 2023 and not 2024 as described in the data section, and finally, there are several mislabeled samples.

At this point, I give up! :(

Discussion 14 answers

LAHASCOM

@da_ Really..,. Leak again ? Anyway, i believe @zindi will do justice to that because it's too late now. The kind of approach, the preprocessing method, data extraction method employed will definitely justify every participant work here. My take......

5 Nov 2024, 07:33

Upvotes 0

replied to MICADEE5 Nov 2024, 08:11

Multimedia university of kenya

[deleted]

Upvotes 0

replied to Koleshjr5 Nov 2024, 08:21

LAHASCOM

@Koleshjr Smiles😁... This is really serious.... Because if this is the case, then there's no need for the organisers to bring out this project as a problem to solve then in the first place. But let's see how it goes anyway. Notwithstanding, they knew better.

Upvotes 0

replied to MICADEE5 Nov 2024, 08:29

Multimedia university of kenya

Yeah, they won't accept the leaked solutions anyway, but the problem with these leaks is that now the leaderboard is corrupted. In these last crucial moments, we can't see a reliable score to beat. My suggestion is they come up with a reliable fix that would restore some confidence in the leaderboard as we head toward the ends so that we know how much effort to put @Zindi @ZINDI

Upvotes 1

replied to Koleshjr5 Nov 2024, 08:45

LAHASCOM

@Koleshjr Well, I agree with that. Because it will be a very tedious work to do for @Zindi if they don't. Majority of these participants using this "Leaked data method" will represent ( or constitute) the larger part of their top 10 on LB, and knowing full well that @Zindi will request for only top 10 participants final notebooks in the end because requesting for at least top 50 participants notebooks will be tiring.

Upvotes 2

replied to MICADEE5 Nov 2024, 11:13

It appears too late to implement any reasonable fix at this point.

Upvotes 0

replied to Koleshjr5 Nov 2024, 11:40

My intuition is that a reliable score to beat will be circa 0.989, estimating about 15-20 labelling errors and getting 50% of the cloud-corrupted samples right. If you get all the cloud-corrupted samples right, you're looking at 0.992 and above. Anyway, these are personal benchmarks I was working towards. I hope it helps anyone looking to optimize for the LB.

Upvotes 1

replied to da_5 Nov 2024, 12:12

Multimedia university of kenya

Thanks

Upvotes 0

Amy_Bray

Zindi

Dear Zindians,

We apologise for the hopefully last leak of the year. We have raised this with the host of GeoAI.

As mentioned in the description of this challenge.

By participating in this challenge, researchers and practitioners can contribute to the advancement of global cropland mapping, enabling a more precise and comprehensive understanding of agricultural landscapes worldwide.

The objective is to contribute to this field instead of exploiting the leaks.

Please can you keep to the spirit of Zindi in these terms.

Our apologies and thank you for your patience.

5 Nov 2024, 11:27

Upvotes 1

replied to Amy_Bray5 Nov 2024, 12:52

LAHASCOM

@Amy_Bray Great. I love this statement: "The objective is to contribute to this field instead of exploiting the leaks.".Even though mistakes are bound to happen a times, personally i don't like hearing any news about "leak data" because it weighs some dedicated participants down a times to continue researching on any project. However, all I did was to extract the provided datasets "date range" out from GEE to acertain and resolve the previous issue of date range. Then implemented my data extraction process scripts and lastly modelling.

Apologies accepted.

Cheers guys. We move !!!

Upvotes 1

replied to MICADEE5 Nov 2024, 13:17

Well said. I can relate to this. I feel completely demotivated at this point. While I've really enjoyed working on this project, however, the limited information provided, and non-response to questions and requests for clarification is not encouraging. This data leak was the straw that broke the camels back.

With the dataset time range for example, I also downloaded the data from gee and recreated the csvs provided without vv and vh variables. It is hard to reconcile which year the labels are from, because whatever time range you go with, 2023 or 2024, you get labelling errors and cloud masks. If you go with 2024, you get a lot of cloud mask in Spain. I realize I'm obsessing over the quality of the dataset at this point, but some clarity would have helped prevent going down this rabbit hole.

Upvotes 1

replied to da_5 Nov 2024, 14:18

LAHASCOM

@da_ Yeah. Anyway, there's a way I automatically extracted the "date range" corresponding to each of these datasets from GEE actually and all fall within 2023. Then, again I made use of vh and vv variables in my own case. Can I ask why you didn't make use of vv and vh variables, if I may ask ?

Upvotes 0

replied to MICADEE5 Nov 2024, 14:33

Yeah. I also have download scripts to extract the dataset, as csvs, and also as image tiles for specified time ranges and at given scales.

I excluded vv and vh due to mental laziness :-). During my initial trial(s), these features, and those engineered from them, improved my model only slightly. So, when I decided to download the data from source, I just didn't feel like adding those extra lines of code to get them.🤦🏿‍♀️

Upvotes 0