Primary competition visual

Microsoft Learn Location Mention Recognition Challenge

$5 000 USD
Completed (over 1 year ago)
Natural Language Processing
Generative AI
1246 joined
365 active
Starti
May 16, 24
Closei
Oct 13, 24
Reveali
Oct 13, 24
User avatar
Koleshjr
Multimedia university of kenya
THIS COMPETITION IS HONESTLY FLAWED!!!!!!!!!!!!
Platform · 17 Sep 2024, 07:38 · 35

Okay for a very long time, I have struggled to pass the below 0.10 barrier, and I have watched as other teams beat this score and performing very well and wondered how the fuck are they doing thisss????

I have tried all different models, different techniques but I was not able to get scores below 0.10. Today as I was about to give up I remembered a discussion by @Nkosana_Daniel this here:

Competition data - Zindi

Thaat hinted that part of the test data is in the other timebased JSON data and I decided to test this hypothesis and my oh my I was astonished. 2384 test data are in the gold timebased data and only 558 are not present. I just replaced the locations of the 2384 with the actual locations and only predicted 558 and beholddddd near perfect score!!!!!! from 0.1338

My main concern is, how can people in the top know this for over a whole week or even more and not tell the organizers???? Okay I might give them a bit of grace , maybe they did not know but if they knew how mean of you guysssss! Now the competition is only remaining one day, is it too late??? and what will the organizers do now that almost the whole test set people can just see the actuall labels??

Update: Turns out most of the top guys did not genuinely know of this leak so I am sorry if I was mean to them

@Zindi @Amy_Bray

Discussion 35 answers
User avatar
Nayal_17

@Koleshjr Woww, Buddy with a Bammm.💪🤌🥶. I closed up with this competition long time ago, but didn't knew this flaw . Great dig up

17 Sep 2024, 07:52
Upvotes 0
User avatar
Koleshjr
Multimedia university of kenya

those scores were demotivating me mahn. Anyways maybe they actually did not use the timebased data, and if that is the case then that's impressive

User avatar
marching_learning
Nostalgic Mathematics

When we check the training data. There are some inconsistencies in the labels (not related to order this time, though there are few like this). So because of it, I suspected that models can't go any further than 0.11xx based on my CV experiences. So I assumed that going under 0.10xx, your model is probably learning by heart even the inconsistencies. But I was not that sure about a potential leak.

User avatar
Koleshjr
Multimedia university of kenya

I also doubt that scores below 0.10 don't have leakage. But either way people might be creative enough to get below 0.10 without leakage, we never know. The only easy way to know this is everyone with scores below 0.10 to tell us if they have used the timebased data or not, but I doubt they will.

User avatar
Nayal_17

Anytime I tried for a comeback in the last week after test data issue was resolved I feel soo demotivated because of very less time remaining and overwhelming lb scores.

User avatar
marching_learning
Nostalgic Mathematics

Thank you @Koleshjr. Normally when people found out a leak in the competition, they must declare it. If is the case, for some of the people topping the leaderboard, then we have an issue. But anyhow the codes will be checked , and if they used these data for training, they should be removed of the LB.

I suspected that but I dind't go any further, because with a part of the datasets, I identified ~7000 tweets of the training data already in the github datasets. But I dind't proceed any further.

So thank you for your honesty @Koleshjr

17 Sep 2024, 08:01
Upvotes 1

Furthemore, the timeline section on thehints towards the competition closing on the 13th of October 2024,

You can find the conflicting information on the following page: https://zindi.africa/competitions/microsoft-learn-location-mention-recognition-challenge

We would greatly appreciate it if you could clarify is the 13th October is the correct is accurate?

This could significantly impact our preparation and submission timeline, we kindly request your attention to this matter.

@zindi @marching_learning

Oh man, that explains :D

I was only using train data and hitting a wall...

Thanks for posting! Otherwise I would have gone crazy for searching for improvements

17 Sep 2024, 08:02
Upvotes 1
User avatar
hark99
Self-employed

The preprocessing also does not work on this dataset. Like, there are actual names in locations. If you omit them, the test score is penalized on inference for those names in the locations, and you will have the worst score. I tried different ways of preprocessing, all in vain.

17 Sep 2024, 08:08
Upvotes 1

I think in general it is against the rules to use test rows for submissions :) I hope zindi clarifies

17 Sep 2024, 08:10
Upvotes 1
User avatar
Muhamed_Tuo
Inveniam

@Koleshjr I had my doubt, but this explains so much. Thanks for the transparancy!!!

17 Sep 2024, 08:15
Upvotes 0
User avatar
Koleshjr
Multimedia university of kenya

The reall hero here is @Nkosana_Daniel , he was the first person to hint this discrepancy and he was kind enough to let the organizers know but seems like his discussion was ignored. If it was not for him, I would not have tested that hypothesis so thanks @Nkosana_Daniel

Okay that explains a lot, So what is gonna happen next?

17 Sep 2024, 08:15
Upvotes 0
User avatar
yassin104

This issue has serious implications for the fairness of the competition, as participants who used this data to train or adjust their models have an unfair advantage over those who adhered to the intended constraints.

To maintain the credibility of the competition, I believe that all participants who trained their solutions using this leaked data should have their submissions disqualified. Allowing such data to influence the competition results undermines the principles of fairness and merit-based evaluation.

I urge Zindi to act as soon as possible to address this issue and ensure a fair and transparent outcome for all participants. Taking swift action will reinforce the integrity of the competition and protect its reputation. @Zindi @Amy_Bray

17 Sep 2024, 08:24
Upvotes 2
User avatar
Koleshjr
Multimedia university of kenya

Not only that, given that the actual labels can be seen, people can use them to post process their results, so training with the timebased data and also post processing should not be allowed!!

User avatar
marching_learning
Nostalgic Mathematics

Post processing in general is not an issue, but it is problematic if it is leaked orientated.

User avatar
Koleshjr
Multimedia university of kenya

Yes it is not an issue in other competitions but in this it should not be allowed. If For example I use post processing how will i argue that I have not tailor it based on the leaked test data???

User avatar
Muhamed_Tuo
Inveniam

Agree with @marching_learning on this one. Post processing shouldn't just be disallowed unless there're clear signs of using the leak ( I believe this can easily be spotted with the labels being so inconsistent ).

User avatar
Koleshjr
Multimedia university of kenya

the question is how will you argue that you have not postprocessed based on the leak?? There are general post processing techniques that are based on the rules for example sorting which should indeed be allowed but one for example where you map the pred results with the leaked test results should not be allowed

User avatar
marching_learning
Nostalgic Mathematics

As @Muhamed_Tuo said : "illegal post-processing can be easily spotted". I just used a simple co-occurence probability (of training samples) to post process my sequences.

User avatar
Juliuss
Freelance

Interesting find out@Koleshjr! The ball back to @Amy_Bray and team @Zindi ! First of all the dates for the challenge were reviewed from initial to a shorter deadline. Time maybe to extend and sort this issue

17 Sep 2024, 09:07
Upvotes 2

I agree time should be extended.

Furthemore, the timeline section on thehints towards the competition closing on the 13th of October 2024,

You can find the conflicting information on the following page: https://zindi.africa/competitions/microsoft-learn-location-mention-recognition-challenge

We would greatly appreciate it if you could clarify is the 13th October is the correct is accurate?

This could significantly impact our preparation and submission timeline, we kindly request your immediate attention to this matter.

@zindi @marching_learning

About your remark, it's the first time i'am hearing somtheming like that, i worked with the dataset to train model and get the score that make it me on the top nothing more and nothing less. And i don't knew about this. but thank i seen now.

17 Sep 2024, 09:26
Upvotes 1
User avatar
Koleshjr
Multimedia university of kenya

Oh sorry about that

User avatar
HungryLearner

@koleshjr, The part about knowing about the leak never occurred to me or any of my team member. We still remain the only team to have maintained top position for so long until recently when new results started popping up here and there. And any form of update, postprocessing based on checking our prediction randomly does not improve our score. I was shocked by all these new scores on LB, leading me to sign off for this competition two days ago with my teammates. So, I think you did the right thing for pointing out the leak as you have did but pointing at people on top position to have been aware of this is a bit unfair to me for instance. Yes, we trained our models based on the github json data and not the given csvs, but it never occurred to me that we are only struggling for just 558 test instances???

17 Sep 2024, 10:46
Upvotes 0
User avatar
Koleshjr
Multimedia university of kenya

Oh sorry about that

User avatar
MICADEE
LAHASCOM

@HungryLearner In fact, i am just short of words with this new findings🤔 after so much efforts putting on this. I don't even know what to say again on this.

So is the 13th October is the up to date submission timeline? (as written in the main page)

I beleive it would benefits with all. Please update also in the toolbar so we can know for sure..

"Competition closes on 13 October 2024.

Final submissions must be received by 11:59 PM GMT."

@zindi

17 Sep 2024, 11:26
Upvotes 0
User avatar
Koleshjr
Multimedia university of kenya

@Zindi @Amy_Bray @meganomaly Can we atleast kindly get a response from you guys so that we know what to do? Time is running out

18 Sep 2024, 07:02
Upvotes 3
User avatar
yassin104

Now what?????? Zindi is not responding. Should we use gold timebased data to achieve a near-perfect score like the others, or what?

18 Sep 2024, 08:41
Upvotes 0

I am going with selecting one leaky submission and one that makes sense and only uses the provided train data.

Unless Zindi clarifies...

User avatar
Muhamed_Tuo
Inveniam

Unfortunately and in the eventuality of no response, I'll do that too in the last hours

User avatar
yassin104

I think this is the most logical move for now, but I will wait a bit longer in case Zindi responds.

User avatar
Sodiq_Babawale_
University of ibadan

The issue with this is, your solution using leaked data will probably be the best ranked after the competition ends and will be the only one Zindi will ask you to submit. If they decide to penalize those using the data, unfortunately, you will also be penalized. The best thing is for @Zindi to clarify the way forward.

It looks like finally we have some clarifications...