🌊 AI in Focus: What is Zindi's general policy...

What is Zindi's general policy regarding leak?

Data · 12 Feb 2025, 21:32 · 45

[Edit]: I made an video to explain it: https://youtu.be/GDZ8-ta-P_A and yeah I used AI voice LOL

[Edit 2]: I want to apologize for my previous baseless claim that top 10 teams are knowingly using the leak. Your modelsing skills are amazing.

[Edit 3]: My thoughts on the leak:

1. if you apply the leak and the score is improved by about 0.0003, Congrats! Your model is confirmed to be leak free!

2. if you apply the leak and the score doesn't change, Congrats! Your model already learns to perfectly isolate flood locations vs non-flood locations, which is amazing! All you need to do is to justify the model learns that without the leak.

3. if you apply the leak and the score is improved by about 0.0001~0.0002, Congrats! Your model does a fairly good job at isolating flood locations vs non-flood locations. This is what is supposed to happen in a leak-free world, and what the images are intended for.

[Edit 4]: Upon furter analysis, most of the gain comes from normalizing probabilites for flood locations rather than setting probs to 0 for non-flood locations. Normalizing flood location probabilities is a legit step and it should be allowed. What admin should look at is how flood locations are found.

[Original Post is below]

It is really unfortunate that I joined late and found the leak only just now. It hasn’t been discussed, has it not? It is too late to reboot the competition.

My definition of leak: anything in the dataset that is not supposed to have predictive power.

Pardon my rudeness, but I am confident that everyone below 0.23 is knowingly exploiting the leak. I know leak is inevitable. On kaggle people generally accept them.

I am also aware that top 10 code will be reviewed, so could using a leak lead to disqualification?

In my opinion, the least damaging action we can take now is to acknowledge the leak and permit participants to use it. Top 10 teams please be honest in your solution about the leak and don’t try to hide it. We will still have a useful model to some degree in the end. This is not a complete disaster.

Discussion 45 answers

marching_learning

Nostalgic Mathematics

Since you're under 0.23 and are top 10, I can conlude that you are using a leak. We are not using no leak. So I will be happy for you to share

12 Feb 2025, 21:38

Upvotes 0

replied to marching_learning12 Feb 2025, 21:41

yeah, of course, otherwise I would not know.

Upvotes 0

replied to snow12 Feb 2025, 21:43

I am happy to share too. what's the vibe on zindi if I do a super late sharing of leak? on kaggle i will be downvoted to hell.

Upvotes 2

Juliuss

Freelance

This is an example of a recent data leak discovered a few days before competition closed. The sentiments of participants are in the comments.

There are no down votes on zindi. It's good to share. Zindi team is also very fair too during evaluation of top solution

https://zindi.africa/competitions/microsoft-learn-location-mention-recognition-challenge/discussions/22497

replied to snow12 Feb 2025, 22:05

Upvotes 2

replied to Juliuss12 Feb 2025, 22:08

Thank you so much!

Upvotes 0

marching_learning

Nostalgic Mathematics

No you'are not going to be downvoted on Kaggle. I think on kaggle, it would be aprreciated and they are quick to take action when the leak is confirmed. Generally, they extend competition deadline. I've seen it in more than 3 challenges.

replied to snow12 Feb 2025, 22:12

Upvotes 0

Abdourahamane_

Sorry to disappoint you @snow but we are not even aware about any leak in the dataset , I'm speaking for my team .

12 Feb 2025, 21:43

Upvotes 0

replied to Abdourahamane_12 Feb 2025, 21:45

yeah, it is possible. I realize I might need to eat my words about "knowly exploit"

Upvotes 0

grattN23

I'm not aware of any leak (I haven't even made a submission, probably won't have time to). I'm assuming it involves using the satellite images to find the exact locations and then introducing additional features based on that, which completely negates the anonymization of the event_id strings.

12 Feb 2025, 21:55

Upvotes 0

replied to grattN2312 Feb 2025, 21:56

no it is not in the image. Ok I'll share it later today.

Upvotes 0

crossentropy

Federal university of Technology, Akure

Okay...

replied to snow12 Feb 2025, 22:08

Upvotes 0

Inveniam

Not sure I want to pardon your rudeness.

I am confident that everyone below 0.23 is knowingly exploiting the leak

You admitted this with no proof or no way of knowing. I really wish I had this kind of confidence.

Just like the others, we're not aware nor using any leaks

12 Feb 2025, 22:34

Upvotes 0

replied to Muhamed_Tuo12 Feb 2025, 22:44

I apologize for what I said. But there is a way of knowing. I'll publish a function that boosts any submission. if a submission is not boosted, the leak is already in.

Upvotes 1

ML_Learner

Hi @snow , I have watched the video and try to implement the function but I keep getting errors . Could you please share the link to the colab notebook used in the youtube video ???

replied to snow13 Feb 2025, 09:02

Upvotes 0

replied to snow13 Feb 2025, 05:06

Hi all, I published the leak. please find it in the edited main message.

13 Feb 2025, 03:40

Upvotes 3

Papito

I tested it, and it work and it's easy way to score !!!, but i think it must not be allowed.

Upvotes 0

replied to Papito13 Feb 2025, 05:11

Multimedia university of kenya

I think what should not be allowed is the post processing step that is as a result of the leak no?

Upvotes 0

replied to Koleshjr13 Feb 2025, 05:12

Multimedia university of kenya

Btw what was your score before the leak? I have not tested mine yet as we are out of subs for the day

Upvotes 1

DJOE

Ecole Supérieure Privée d'Ingénierie et de Technologies - ESPRIT

My bad

replied to Koleshjr13 Feb 2025, 05:17

Upvotes 0

sys_ts__

I've seen the video, I didn't even realize there was a pattern. Thank you, I think what you shared is very valuable as input for the competition organizers.

replied to snow13 Feb 2025, 05:17

Upvotes 0

replied to DJOE13 Feb 2025, 05:19

Multimedia university of kenya

No worries bro :)

Upvotes 1

Papito

The score before the leak was 0.0024 and the score after were 0.0020. But it's was not the best score that was tested. And i understant that goal of this competition must be to give the raw probability of the model without any post processing.

replied to Koleshjr13 Feb 2025, 11:54

Upvotes 0

nymfree

Thank you for sharing this information.

13 Feb 2025, 05:41

Upvotes 0

Juliuss

Freelance

Most top big wigs am sure they did not intentionally use the leak and are surprised their models could pick that pattern up... Now that you brought this discussion, am not sure how it can be handled-whether to close one eye and permit it or not- but the wisdom of @Amy_Bray, & whole @Zindi team will come in handy.

For now we could be guided by the competition intention in the data section:

"If there is a flood for that event it can happen on any ONE of those 730 days. We have done this to ensure that you do not always select the middle day as the flood day."

Once again maybe the data provided was not well prepared to meet the competition intention perfectly.

13 Feb 2025, 06:22

Upvotes 1

replied to Juliuss13 Feb 2025, 06:40

Multimedia university of kenya

I have a different opinion. I don't agree that the model picked up the pattern because there is no explicit encoding of location order in any column. If the model were truly capturing a pattern, it would require some form of sequential or positional information of the location, which isn’t present in our features.

Additionally, once you apply stratified K-fold cross-validation, any inherent order in the data is effectively removed, making it even less likely for the model to learn a spatial pattern unintentionally.

I don’t consider this a data leak unless:

1. Post-processing is applied to exploit the pattern.

2. The location order is explicitly encoded in one of the features.

This is similar to the case discussed here:

https://zindi.africa/competitions/geoai-challenge-for-agricultural-plastic-cover-mapping-with-satellite-imagery/discussions/23270, where participants used an ID column to boost scores. That was a clear example of the model leveraging an unintended pattern.

However, in our case, if I speak for our team, we are not encoding any location order. So, the key question remains: what exactly constitutes a leak? In my view, a leak only occurs if post-processing exploits a discovered pattern or if location order is directly embedded in the features.

food for thought,

If indeed the model learnt the sequential order of the location wouldn't it be worse on the test set since the test set is reversed??

Upvotes 2

Juliuss

Freelance

Correct position

replied to Koleshjr13 Feb 2025, 06:46

Upvotes 0

Beekeeper

Machine learning models will not pick up the leak unless the index is used as a feature, which I think no experienced ML engineer would do, this is against common sense. And this goes against the philosphy of 'If you can measure it, do not predict it', which is not the case here, we need to predict the flood yearday from the precipitation and visual data.

13 Feb 2025, 07:49

Upvotes 3

replied to Beekeeper13 Feb 2025, 07:56

Multimedia university of kenya

Trueeeeee!!!!!!!, Like the model picking up the leak itself doesn't make sense to me unless you explicitly introduce it as a feature. And also just like I had said before:

If indeed the model learnt the sequential order of the location wouldn't it be worse on the test set since the test set is reversed??

Upvotes 2

replied to Koleshjr13 Feb 2025, 09:05

Inveniam

@Koleshjr Exactly, the argument contradicts itself

Upvotes 0

Inveniam

Kudos for finding the pattern and sharing it. Hopefully @Zindi will be proactive and quickly make an anouncement on this soon.

There are a few arguments I don't agree with, but I'go with this one.

It's possible that your model used it indirectly

The counter arguments are that:

most of us use cross-validation with the data being shuffled before being splitted and we also don't use the event_idx column ( which unless you're hoping for a leak, you shouldn't be using it ) . Hence, we loose the ordering pattern, and without that pattern, the model can't possibly exploit the leak.
the train and test patterns are complete opposite of each other. And if you train a model ( be it boosting models ) with one pattern, it will 'stupidly' reproduce or look for that very same or similar pattern. So once again, these models are not smart enough to deduct that the test has an opposite pattern to the train. I challenge anyone to prove it with this data and boosting models.

13 Feb 2025, 09:04

Upvotes 4

replied to Muhamed_Tuo13 Feb 2025, 09:06

Multimedia university of kenya

I agree with you @Muhamed_Tuo

Upvotes 2

Knowledge_Seeker101

Freelance

I agree with your logic @Muhamed_Tuo , the exploit is in post-processing, not in model training, I think most would agree with that

replied to Muhamed_Tuo13 Feb 2025, 09:41

Upvotes 2

Ebiendele

Federal university of technology akure

Really???

13 Feb 2025, 12:00

Upvotes 1

Thank you all for the good points! I want to apologize for my previous baseless claim that top 10 teams are knowingly using the leak. I was too excited. Your modelsing skills are amazing.

My thoughts on the leak:

1. if you apply the leak and the score is improved by about 0.0003 or even more, Congrats! Your model is confirmed to be leak free!

13 Feb 2025, 16:04

Upvotes 2

replied to snow13 Feb 2025, 16:16

Upon furter analysis, most of the gain comes from normalizing probabilites for flood locations rather than setting probs to 0 for non-flood locations. Normalizing flood location probabilities is a legit step and it should be allowed. What admin should look at is how flood locations are found.

13 Feb 2025, 16:13

Upvotes 4

marching_learning

Nostalgic Mathematics

That's what I wanted to say. Normalizing probabilies systematically lower Logloss It is just the mathematics of it The true issue, is the identification of flood location.

Upvotes 3

replied to snow13 Feb 2025, 16:51

Inveniam

Now, may the jumping game starts :) . The LB is going feel different

Upvotes 1

nymfree

it's going to be a wild 3 days. Hats of to you @snow for finding and disclosing the leak.

replied to Muhamed_Tuo13 Feb 2025, 16:55

Upvotes 1

capybara_lover

I feel like exploiting this leak just to gain some positions in the leaderboard is wrong as it goes against the ultimate goal of the competition that is to have a strong model that generalizes as good as possible without relying on any bias in the data. Plus, in the description the organizers explicitly ask to leave the predicted probabilities as they are without clipping or thresholding. So I believe this should be taken into account when reviewing the top solutions. Just my two cents

13 Feb 2025, 17:09

Upvotes 4

TTVM

While there's not much time left for the challenge, I'm quite interested in @Zindi 's position in normalizing the probabilities. The competition forbids from thresholding or rounding probabilities, but doesn't say anything about normalization, which mathematically can be a way of lowering the log loss of a model given that we predict flood locations correctly. Is ANY kind of post processing method forbidden in this challenge?

13 Feb 2025, 17:27

Upvotes 5

TernaryM01

I know it's late to reply, but I don't see anything wrong with doing that, taking advantage of the fact that there are never more than 1 flood for one 'event_id'. However, I'm very skeptical that would actually improve your score; you could try that and see for yourself. Intuitively speaking, it's good to overshoot (giving a total probability of more than 1) for 'events' where you're confident you can pinpoint a narrow window where you think the flood must be, and just give up (giving all 0) for 'events' where you think any day is as likely as any other day to contain flood. A model would implicitly learn about this automatically.

Also, it's weird to forbid thresholding. In practice, it never works anyway (speaking from my experience with this competition and the recently-ended sepsis detection competition on Kaggle, both of which deal with extremely imbalanced labels), so why bother forbid it? Again, a model would be smart enough to adjust its scoring in order to minimize the loss; a naive improvement strategy such as thresholding its score can only mess up.

replied to TTVM18 Feb 2025, 07:00

Upvotes 1

💯💯💯

14 Feb 2025, 03:09

Upvotes 0

TernaryM01

It's not a leak, but you simply uncovered a regularity in the test set. One would have to have a good enough predictor to detect the pattern (for example, if you simply use the BenchmarkSubmission.csv produced by the StartedNotebook.ipynb provided by DeepMind, you wouldn't see such a pattern).

16 Feb 2025, 18:21

Upvotes 0