Primary competition visual

GeoAI Agricultural Plastic Cover Mapping with Satellite Imagery by ITU

Helping Italy
1 000 CHF
Challenge completed ~1 year ago
Prediction
365 joined
78 active
Starti
Jun 12, 24
Closei
Nov 11, 24
Reveali
Nov 11, 24
Data Leakage in Test set
Data · 5 Nov 2024, 06:01 · 14

@Zindi There appears to be a massive data leakage in the test set. The ID feature in the test set is sequential and not shuffled. As such, with the right partition for each location, you can estimate exactly which samples are class 1 or 2, hence my perfect score on the leaderboard :).

I have found other issues in the dataset - eg, no cloud mask on some locations in Vietnam, so some values in the train (and test) csv(s) are estimated with cloud corruption; the entire dataset is from 2023 and not 2024 as described in the data section, and finally, there are several mislabeled samples.

At this point, I give up! :(

Discussion 14 answers
User avatar
MICADEE
LAHASCOM

@da_ Really..,. Leak again ? Anyway, i believe @zindi will do justice to that because it's too late now. The kind of approach, the preprocessing method, data extraction method employed will definitely justify every participant work here. My take......

5 Nov 2024, 07:33
Upvotes 0
User avatar
Koleshjr
Multimedia university of kenya

[deleted]

User avatar
MICADEE
LAHASCOM

@Koleshjr Smiles😁... This is really serious.... Because if this is the case, then there's no need for the organisers to bring out this project as a problem to solve then in the first place. But let's see how it goes anyway. Notwithstanding, they knew better.

User avatar
Koleshjr
Multimedia university of kenya

Yeah, they won't accept the leaked solutions anyway, but the problem with these leaks is that now the leaderboard is corrupted. In these last crucial moments, we can't see a reliable score to beat. My suggestion is they come up with a reliable fix that would restore some confidence in the leaderboard as we head toward the ends so that we know how much effort to put @Zindi @ZINDI

User avatar
MICADEE
LAHASCOM

@Koleshjr Well, I agree with that. Because it will be a very tedious work to do for @Zindi if they don't. Majority of these participants using this "Leaked data method" will represent ( or constitute) the larger part of their top 10 on LB, and knowing full well that @Zindi will request for only top 10 participants final notebooks in the end because requesting for at least top 50 participants notebooks will be tiring.

It appears too late to implement any reasonable fix at this point.

My intuition is that a reliable score to beat will be circa 0.989, estimating about 15-20 labelling errors and getting 50% of the cloud-corrupted samples right. If you get all the cloud-corrupted samples right, you're looking at 0.992 and above. Anyway, these are personal benchmarks I was working towards. I hope it helps anyone looking to optimize for the LB.

User avatar
Koleshjr
Multimedia university of kenya

Thanks

User avatar
Amy_Bray
Zindi

Dear Zindians,

We apologise for the hopefully last leak of the year. We have raised this with the host of GeoAI.

As mentioned in the description of this challenge.

By participating in this challenge, researchers and practitioners can contribute to the advancement of global cropland mapping, enabling a more precise and comprehensive understanding of agricultural landscapes worldwide.

The objective is to contribute to this field instead of exploiting the leaks.

Please can you keep to the spirit of Zindi in these terms.

Our apologies and thank you for your patience.

5 Nov 2024, 11:27
Upvotes 1
User avatar
MICADEE
LAHASCOM

@Amy_Bray Great. I love this statement: "The objective is to contribute to this field instead of exploiting the leaks.".Even though mistakes are bound to happen a times, personally i don't like hearing any news about "leak data" because it weighs some dedicated participants down a times to continue researching on any project. However, all I did was to extract the provided datasets "date range" out from GEE to acertain and resolve the previous issue of date range. Then implemented my data extraction process scripts and lastly modelling.

Apologies accepted.

Cheers guys. We move !!!

Well said. I can relate to this. I feel completely demotivated at this point. While I've really enjoyed working on this project, however, the limited information provided, and non-response to questions and requests for clarification is not encouraging. This data leak was the straw that broke the camels back.

With the dataset time range for example, I also downloaded the data from gee and recreated the csvs provided without vv and vh variables. It is hard to reconcile which year the labels are from, because whatever time range you go with, 2023 or 2024, you get labelling errors and cloud masks. If you go with 2024, you get a lot of cloud mask in Spain. I realize I'm obsessing over the quality of the dataset at this point, but some clarity would have helped prevent going down this rabbit hole.

User avatar
MICADEE
LAHASCOM

@da_ Yeah. Anyway, there's a way I automatically extracted the "date range" corresponding to each of these datasets from GEE actually and all fall within 2023. Then, again I made use of vh and vv variables in my own case. Can I ask why you didn't make use of vv and vh variables, if I may ask ?

Yeah. I also have download scripts to extract the dataset, as csvs, and also as image tiles for specified time ranges and at given scales.

I excluded vv and vh due to mental laziness :-). During my initial trial(s), these features, and those engineered from them, improved my model only slightly. So, when I decided to download the data from source, I just didn't feel like adding those extra lines of code to get them.🤦🏿‍♀️

User avatar
Koleshjr
Multimedia university of kenya

Looks Like some people still submitted the submissions with data leakage.

12 Nov 2024, 03:40
Upvotes 0