[Edit]: I made an video to explain it: https://youtu.be/GDZ8-ta-P_A and yeah I used AI voice LOL
[Edit 2]: I want to apologize for my previous baseless claim that top 10 teams are knowingly using the leak. Your modelsing skills are amazing.
[Edit 3]: My thoughts on the leak:
1. if you apply the leak and the score is improved by about 0.0003, Congrats! Your model is confirmed to be leak free!
2. if you apply the leak and the score doesn't change, Congrats! Your model already learns to perfectly isolate flood locations vs non-flood locations, which is amazing! All you need to do is to justify the model learns that without the leak.
3. if you apply the leak and the score is improved by about 0.0001~0.0002, Congrats! Your model does a fairly good job at isolating flood locations vs non-flood locations. This is what is supposed to happen in a leak-free world, and what the images are intended for.
[Edit 4]: Upon furter analysis, most of the gain comes from normalizing probabilites for flood locations rather than setting probs to 0 for non-flood locations. Normalizing flood location probabilities is a legit step and it should be allowed. What admin should look at is how flood locations are found.
[Original Post is below]
It is really unfortunate that I joined late and found the leak only just now. It hasn’t been discussed, has it not? It is too late to reboot the competition.
My definition of leak: anything in the dataset that is not supposed to have predictive power.
Pardon my rudeness, but I am confident that everyone below 0.23 is knowingly exploiting the leak. I know leak is inevitable. On kaggle people generally accept them.
I am also aware that top 10 code will be reviewed, so could using a leak lead to disqualification?
In my opinion, the least damaging action we can take now is to acknowledge the leak and permit participants to use it. Top 10 teams please be honest in your solution about the leak and don’t try to hide it. We will still have a useful model to some degree in the end. This is not a complete disaster.
Since you're under 0.23 and are top 10, I can conlude that you are using a leak. We are not using no leak. So I will be happy for you to share
yeah, of course, otherwise I would not know.
I am happy to share too. what's the vibe on zindi if I do a super late sharing of leak? on kaggle i will be downvoted to hell.
This is an example of a recent data leak discovered a few days before competition closed. The sentiments of participants are in the comments.
There are no down votes on zindi. It's good to share. Zindi team is also very fair too during evaluation of top solution
https://zindi.africa/competitions/microsoft-learn-location-mention-recognition-challenge/discussions/22497
Thank you so much!
No you'are not going to be downvoted on Kaggle. I think on kaggle, it would be aprreciated and they are quick to take action when the leak is confirmed. Generally, they extend competition deadline. I've seen it in more than 3 challenges.
Sorry to disappoint you @snow but we are not even aware about any leak in the dataset , I'm speaking for my team .
yeah, it is possible. I realize I might need to eat my words about "knowly exploit"
I'm not aware of any leak (I haven't even made a submission, probably won't have time to). I'm assuming it involves using the satellite images to find the exact locations and then introducing additional features based on that, which completely negates the anonymization of the event_id strings.
no it is not in the image. Ok I'll share it later today.
Okay...
Not sure I want to pardon your rudeness.
You admitted this with no proof or no way of knowing. I really wish I had this kind of confidence.
Just like the others, we're not aware nor using any leaks
I apologize for what I said. But there is a way of knowing. I'll publish a function that boosts any submission. if a submission is not boosted, the leak is already in.
Hi @snow , I have watched the video and try to implement the function but I keep getting errors . Could you please share the link to the colab notebook used in the youtube video ???
Hi all, I published the leak. please find it in the edited main message.
I tested it, and it work and it's easy way to score !!!, but i think it must not be allowed.
I think what should not be allowed is the post processing step that is as a result of the leak no?
Btw what was your score before the leak? I have not tested mine yet as we are out of subs for the day
My bad
I've seen the video, I didn't even realize there was a pattern. Thank you, I think what you shared is very valuable as input for the competition organizers.
No worries bro :)
The score before the leak was 0.0024 and the score after were 0.0020. But it's was not the best score that was tested. And i understant that goal of this competition must be to give the raw probability of the model without any post processing.
Thank you for sharing this information.
Most top big wigs am sure they did not intentionally use the leak and are surprised their models could pick that pattern up... Now that you brought this discussion, am not sure how it can be handled-whether to close one eye and permit it or not- but the wisdom of @Amy_Bray, & whole @Zindi team will come in handy.
For now we could be guided by the competition intention in the data section:
"If there is a flood for that event it can happen on any ONE of those 730 days. We have done this to ensure that you do not always select the middle day as the flood day."
Once again maybe the data provided was not well prepared to meet the competition intention perfectly.
I have a different opinion. I don't agree that the model picked up the pattern because there is no explicit encoding of location order in any column. If the model were truly capturing a pattern, it would require some form of sequential or positional information of the location, which isn’t present in our features.
Additionally, once you apply stratified K-fold cross-validation, any inherent order in the data is effectively removed, making it even less likely for the model to learn a spatial pattern unintentionally.
I don’t consider this a data leak unless:
1. Post-processing is applied to exploit the pattern.
2. The location order is explicitly encoded in one of the features.
This is similar to the case discussed here:
https://zindi.africa/competitions/geoai-challenge-for-agricultural-plastic-cover-mapping-with-satellite-imagery/discussions/23270, where participants used an ID column to boost scores. That was a clear example of the model leveraging an unintended pattern.
However, in our case, if I speak for our team, we are not encoding any location order. So, the key question remains: what exactly constitutes a leak? In my view, a leak only occurs if post-processing exploits a discovered pattern or if location order is directly embedded in the features.
food for thought,
If indeed the model learnt the sequential order of the location wouldn't it be worse on the test set since the test set is reversed??
Correct position
Machine learning models will not pick up the leak unless the index is used as a feature, which I think no experienced ML engineer would do, this is against common sense. And this goes against the philosphy of 'If you can measure it, do not predict it', which is not the case here, we need to predict the flood yearday from the precipitation and visual data.
Trueeeeee!!!!!!!, Like the model picking up the leak itself doesn't make sense to me unless you explicitly introduce it as a feature. And also just like I had said before:
If indeed the model learnt the sequential order of the location wouldn't it be worse on the test set since the test set is reversed??
@Koleshjr Exactly, the argument contradicts itself
Kudos for finding the pattern and sharing it. Hopefully @Zindi will be proactive and quickly make an anouncement on this soon.
There are a few arguments I don't agree with, but I'go with this one.
The counter arguments are that:
I agree with you @Muhamed_Tuo
I agree with your logic @Muhamed_Tuo , the exploit is in post-processing, not in model training, I think most would agree with that
Really???
Thank you all for the good points! I want to apologize for my previous baseless claim that top 10 teams are knowingly using the leak. I was too excited. Your modelsing skills are amazing.
My thoughts on the leak:
1. if you apply the leak and the score is improved by about 0.0003 or even more, Congrats! Your model is confirmed to be leak free!
2. if you apply the leak and the score doesn't change, Congrats! Your model already learns to perfectly isolate flood locations vs non-flood locations, which is amazing! All you need to do is to justify the model learns that without the leak.
3. if you apply the leak and the score is improved by about 0.0001~0.0002, Congrats! Your model does a fairly good job at isolating flood locations vs non-flood locations. This is what is supposed to happen in a leak-free world, and what the images are intended for.
Upon furter analysis, most of the gain comes from normalizing probabilites for flood locations rather than setting probs to 0 for non-flood locations. Normalizing flood location probabilities is a legit step and it should be allowed. What admin should look at is how flood locations are found.
That's what I wanted to say. Normalizing probabilies systematically lower Logloss It is just the mathematics of it The true issue, is the identification of flood location.
Now, may the jumping game starts :) . The LB is going feel different
it's going to be a wild 3 days. Hats of to you @snow for finding and disclosing the leak.
I feel like exploiting this leak just to gain some positions in the leaderboard is wrong as it goes against the ultimate goal of the competition that is to have a strong model that generalizes as good as possible without relying on any bias in the data. Plus, in the description the organizers explicitly ask to leave the predicted probabilities as they are without clipping or thresholding. So I believe this should be taken into account when reviewing the top solutions. Just my two cents
While there's not much time left for the challenge, I'm quite interested in @Zindi 's position in normalizing the probabilities. The competition forbids from thresholding or rounding probabilities, but doesn't say anything about normalization, which mathematically can be a way of lowering the log loss of a model given that we predict flood locations correctly. Is ANY kind of post processing method forbidden in this challenge?
I know it's late to reply, but I don't see anything wrong with doing that, taking advantage of the fact that there are never more than 1 flood for one 'event_id'. However, I'm very skeptical that would actually improve your score; you could try that and see for yourself. Intuitively speaking, it's good to overshoot (giving a total probability of more than 1) for 'events' where you're confident you can pinpoint a narrow window where you think the flood must be, and just give up (giving all 0) for 'events' where you think any day is as likely as any other day to contain flood. A model would implicitly learn about this automatically.
Also, it's weird to forbid thresholding. In practice, it never works anyway (speaking from my experience with this competition and the recently-ended sepsis detection competition on Kaggle, both of which deal with extremely imbalanced labels), so why bother forbid it? Again, a model would be smart enough to adjust its scoring in order to minimize the loss; a naive improvement strategy such as thresholding its score can only mess up.
💯💯💯
It's not a leak, but you simply uncovered a regularity in the test set. One would have to have a good enough predictor to detect the pattern (for example, if you simply use the BenchmarkSubmission.csv produced by the StartedNotebook.ipynb provided by DeepMind, you wouldn't see such a pattern).
Good luck everyone. Last update: I'm not selecting any leak submissions. my best final sub is 0.00245 on public lb, which uses image models to classify flood vs non-flood locations.
Same as you, I just tried the apply_leak function you shared in submission, and my score increased significantly. But of course for the final submission I did not choose the submission that contained the leak. There might be a shakeup in the private leaderboard.
Thank you for your share @snow