Primary competition visual

Microsoft Learn Location Mention Recognition Challenge

$5 000 USD
Challenge completed ~1 year ago
Natural Language Processing
Generative AI
1219 joined
365 active
Starti
May 16, 24
Closei
Oct 13, 24
Reveali
Oct 13, 24
User avatar
ZINDI
Update: dataset and rules clarification, plus challenge timeline
Platform · 18 Sep 2024, 11:46 · 23

Hello everyone,

Please accept the apologies of the whole Zindi team for the slow response time and timeline confusion in this challenge.

Potential data leak

Following recent discussions regarding data from the IDRISI repo, we've taken some time to review the potential for a leak in this challenge.

Please note that data for this competition was prepared using only the json files in the repo, with the test set curated by the team in charge of the repository. It is possible that some of the test set data may have been sourced from some of the other resources in the repo. However, please bear in mind that access to the repository is only granted for learning purposes. The data in the repo does not form part of the approved datasets usable in this challenge.

As with all Zindi challenges, your submissions in this challenge are subject to the challenge rules and regulations; in this case, the following rule is specifically relevant:

Zindi is committed to providing solutions of value to our clients and partners. To this end, we reserve the right to disqualify your submission on the grounds of usability or value. This includes but is not limited to the use of data leaks or any other practices that we deem to compromise the inherent value of your solution.

We encourage you to try as best as you can to adhere to and incorporate the challenge rules as you build out your solutions. As always, we will conducct a detailed review of top-performing solutions and any submission found to be contravening these rules will be disqualified.

Updated timeline

As raised by several participants, the timelines are currently contradictory. The official close date of this challenge is 13 October, and the platform has been updated to reflect that.

We wish you the best of luck for the remainder of the challenge.

Happy coding!

Discussion 23 answers

Thanks for clarifying the close date, now this gives me more time to tackle the competition challange

18 Sep 2024, 11:52
Upvotes 2

Hmm, I already started to write my solution... I guess I have to wait for another few weeks then.

18 Sep 2024, 11:55
Upvotes 1
However, please bear in mind that access to the repository is only granted for learning purposes. The data in the repo does not form part of the approved datasets usable in this challenge.

So correct me if I'm wrong but I assume that Train_1.csv and Test.csv from Data are the only approved datasets usable in this challenge, am I right?

18 Sep 2024, 12:07
Upvotes 4
User avatar
MICADEE
LAHASCOM

@davidreifferscheidt Yeah.. You're very right. We need to know if only Train_1.csv and Test,csv are the only approved datasets usable for this challenge. Because this said repository data is the main cause of this leak.

User avatar
Koleshjr
Multimedia university of kenya

The dataset in IDRISI that has the leak is the one present in Gold Time based sub-directory while the data approved for this challenge is the both the csvs and the json data present in Gold Random as explained in the Data Info page and I quote:

"""The data is available in JSONL format in the GitHub repository. (Full example). The full training, dev and test files, can be downloaded from here: https://github.com/rsuwaileh/IDRISI/blob/main/LMR/data/EN/gold-random-json/"""

So the contentious data here is the one in the time based one which has the leak and I think @ZINDI can clarify on this too

User avatar
MICADEE
LAHASCOM

@Koleshjr Okay. In that case, they should reaffirm this by pointing us to the datasets that must be used for this challenge inside this repository then.

User avatar
Koleshjr
Multimedia university of kenya

I second this @MICADEE

User avatar
Muhamed_Tuo
Inveniam

It might even be simpler to forbid all use of the repository, to avoid any further confusions. Even though, a lot of us - me included - relied on the json data.

@Zindi @Amy_Bray

User avatar
Koleshjr
Multimedia university of kenya

Same, I relied on the JSON data and I think forbidding it will be simpler to avoid confusions.

User avatar
ZINDI

Hello, you are correct that the only datasets usable in this challenge are Train_1.csv and Test.csv from the Data page. These correspond to the data in gold-randon-json, but we recommend not using datasets from the repo at all. The Data page has been updated to reflect this. Our apologies once again for any confusion caused.

User avatar
Muhamed_Tuo
Inveniam

Thanks !!!

Thank you for your response. Regarding the leaderboard, I assume it is no longer relevant to the competition and does not reflect the current standings ?

18 Sep 2024, 12:19
Upvotes 2

@Zindi Thank you for the clarification. There is also a question around the certification credits. Will they also only be presented after the close date or after the winner is announced?

18 Sep 2024, 12:19
Upvotes 0
User avatar
ZINDI

Hi @MakalaMabotja, we will share certification credits with those who have made a valid submission at the clsoing date of the challenge.

Thank you for the response

@Zindi, I think the leaderboard needs to be reset. Comment please!

18 Sep 2024, 12:25
Upvotes 4
User avatar
Muhamed_Tuo
Inveniam

Yes, I agree, it should. Because at present, we have "no real idea" of what the actual peak performance is.

@Zindi

This makes things a bit more understandable. I also want to clarify this from @ZINDI :

From the data page of the competion, we have the following statement: ""The data is available in JSONL format in the GitHub repository. (Full example). The full training, dev and test files, can be downloaded from here: https://github.com/rsuwaileh/IDRISI/blob/main/LMR/data/EN/gold-random-json/ ... The datasets have also been provided as CSV files if you would prefer to use CSV files. The choice is yours."".

Thus we are allowed to use only the EN-Gold-Random-BILOU-JSON dataset from IDRISI or the csv file provided right?

However, from this clarification, it is said that ""Following recent discussions regarding data from the IDRISI repo, ... The data in the repo does not form part of the approved datasets usable in this challenge."".

Does this mean that we are now only to use CSV file provided or the exception on the IDRISI dataset is that of the EN-Gold-Random-BILOU-JSON which we have been permitted to use in the data page?

This is still quite confusing to me.

18 Sep 2024, 12:31
Upvotes 2
User avatar
MacGee

I think to every Zindi challenge, there's always that private test data. We all are scored based on the part of that private test data which forms our public score on which our generated models are tested. On completion of the challenge, our models are tested on the larger set private test data.

I think this should be it. Either the CSV files to build your model or the github link announced.

User avatar
ZINDI

Hi @Ezino, we have updated the Data page to eliminate the confusion around this point. You should only use the Test.csv and Train_1.csv on the Data page, and no data from the repository can be used in your models.

@Zindi,what about the reset of the leaderboard?

User avatar
Kamenialexnea
Ecole nationale superieure polytechnique yaounde

What about the reset of the lb please ? @ZINDI

@ZINDI have the certification credits been shared with the participants?

31 Oct 2024, 09:54
Upvotes 0