AI in Focus: Update: dataset and rules clar...

Microsoft Learn Location Mention Recognition Challenge

$5 000 USD

Completed (almost 2 years ago)

Skills you will learn

Natural Language Processing

Generative AI

1271 joined

365 active

Info Data Chat Leaderboard

Start

May 16, 24

Oct 13, 24

Reveal

Oct 13, 24

ZINDI

Update: dataset and rules clarification, plus challenge timeline

Platform · 18 Sep 2024, 11:46 · 23

Hello everyone,

Please accept the apologies of the whole Zindi team for the slow response time and timeline confusion in this challenge.

Potential data leak

Following recent discussions regarding data from the IDRISI repo, we've taken some time to review the potential for a leak in this challenge.

Please note that data for this competition was prepared using only the json files in the repo, with the test set curated by the team in charge of the repository. It is possible that some of the test set data may have been sourced from some of the other resources in the repo. However, please bear in mind that access to the repository is only granted for learning purposes. The data in the repo does not form part of the approved datasets usable in this challenge.

As with all Zindi challenges, your submissions in this challenge are subject to the challenge rules and regulations; in this case, the following rule is specifically relevant:

Zindi is committed to providing solutions of value to our clients and partners. To this end, we reserve the right to disqualify your submission on the grounds of usability or value. This includes but is not limited to the use of data leaks or any other practices that we deem to compromise the inherent value of your solution.

We encourage you to try as best as you can to adhere to and incorporate the challenge rules as you build out your solutions. As always, we will conducct a detailed review of top-performing solutions and any submission found to be contravening these rules will be disqualified.

Updated timeline

As raised by several participants, the timelines are currently contradictory. The official close date of this challenge is 13 October, and the platform has been updated to reflect that.

We wish you the best of luck for the remainder of the challenge.

Happy coding!

Discussion 23 answers

MTH_Thee_Algorithmist

Thanks for clarifying the close date, now this gives me more time to tackle the competition challange

18 Sep 2024, 11:52

Upvotes 2

beluga

Hmm, I already started to write my solution... I guess I have to wait for another few weeks then.

18 Sep 2024, 11:55

Upvotes 1

mdsDR

However, please bear in mind that access to the repository is only granted for learning purposes. The data in the repo does not form part of the approved datasets usable in this challenge.

So correct me if I'm wrong but I assume that Train_1.csv and Test.csv from Data are the only approved datasets usable in this challenge, am I right?

18 Sep 2024, 12:07

Upvotes 4

MICADEE

LAHASCOM (Freelance)

@davidreifferscheidt Yeah.. You're very right. We need to know if only Train_1.csv and Test,csv are the only approved datasets usable for this challenge. Because this said repository data is the main cause of this leak.

replied to mdsDR18 Sep 2024, 12:10

Upvotes 0

Koleshjr

Multimedia university of kenya

The dataset in IDRISI that has the leak is the one present in Gold Time based sub-directory while the data approved for this challenge is the both the csvs and the json data present in Gold Random as explained in the Data Info page and I quote:

"""The data is available in JSONL format in the GitHub repository. (Full example). The full training, dev and test files, can be downloaded from here: https://github.com/rsuwaileh/IDRISI/blob/main/LMR/data/EN/gold-random-json/"""

So the contentious data here is the one in the time based one which has the leak and I think @ZINDI can clarify on this too

replied to MICADEE18 Sep 2024, 12:13

Upvotes 2

MICADEE

LAHASCOM (Freelance)

@Koleshjr Okay. In that case, they should reaffirm this by pointing us to the datasets that must be used for this challenge inside this repository then.

replied to Koleshjr18 Sep 2024, 12:18

Upvotes 2

Koleshjr

Multimedia university of kenya

I second this @MICADEE

replied to MICADEE18 Sep 2024, 12:21

Upvotes 3

Muhamed_Tuo

Inveniam

It might even be simpler to forbid all use of the repository, to avoid any further confusions. Even though, a lot of us - me included - relied on the json data.

@Zindi @Amy_Bray

replied to MICADEE18 Sep 2024, 12:49

Upvotes 1

Koleshjr

Multimedia university of kenya

Same, I relied on the JSON data and I think forbidding it will be simpler to avoid confusions.

replied to Muhamed_Tuo18 Sep 2024, 12:51

Upvotes 1

ZINDI

Hello, you are correct that the only datasets usable in this challenge are Train_1.csv and Test.csv from the Data page. These correspond to the data in gold-randon-json, but we recommend not using datasets from the repo at all. The Data page has been updated to reflect this. Our apologies once again for any confusion caused.

replied to mdsDR18 Sep 2024, 13:38

Upvotes 5

Muhamed_Tuo

Inveniam

Thanks !!!

replied to ZINDI18 Sep 2024, 14:35

Upvotes 0

Papito

Thank you for your response. Regarding the leaderboard, I assume it is no longer relevant to the competition and does not reflect the current standings ?

18 Sep 2024, 12:19

Upvotes 2

MakalaMabotja

@Zindi Thank you for the clarification. There is also a question around the certification credits. Will they also only be presented after the close date or after the winner is announced?

18 Sep 2024, 12:19

Upvotes 0

ZINDI

Hi @MakalaMabotja, we will share certification credits with those who have made a valid submission at the clsoing date of the challenge.

replied to MakalaMabotja18 Sep 2024, 13:42

Upvotes 0

MakalaMabotja

Thank you for the response

replied to ZINDI18 Sep 2024, 14:54

Upvotes 0

Joel99

@Zindi, I think the leaderboard needs to be reset. Comment please!

18 Sep 2024, 12:25

Upvotes 4

Muhamed_Tuo

Inveniam

Yes, I agree, it should. Because at present, we have "no real idea" of what the actual peak performance is.

@Zindi

replied to Joel9918 Sep 2024, 12:41

Upvotes 3

Ezino

This makes things a bit more understandable. I also want to clarify this from @ZINDI :

From the data page of the competion, we have the following statement: ""The data is available in JSONL format in the GitHub repository. (Full example). The full training, dev and test files, can be downloaded from here: https://github.com/rsuwaileh/IDRISI/blob/main/LMR/data/EN/gold-random-json/ ... The datasets have also been provided as CSV files if you would prefer to use CSV files. The choice is yours."".

Thus we are allowed to use only the EN-Gold-Random-BILOU-JSON dataset from IDRISI or the csv file provided right?

However, from this clarification, it is said that ""Following recent discussions regarding data from the IDRISI repo, ... The data in the repo does not form part of the approved datasets usable in this challenge."".

Does this mean that we are now only to use CSV file provided or the exception on the IDRISI dataset is that of the EN-Gold-Random-BILOU-JSON which we have been permitted to use in the data page?

This is still quite confusing to me.

18 Sep 2024, 12:31

Upvotes 2

MacGee

I think to every Zindi challenge, there's always that private test data. We all are scored based on the part of that private test data which forms our public score on which our generated models are tested. On completion of the challenge, our models are tested on the larger set private test data.

I think this should be it. Either the CSV files to build your model or the github link announced.

replied to Ezino18 Sep 2024, 12:52

Upvotes 0

ZINDI

Hi @Ezino, we have updated the Data page to eliminate the confusion around this point. You should only use the Test.csv and Train_1.csv on the Data page, and no data from the repository can be used in your models.

replied to Ezino18 Sep 2024, 13:44

Upvotes 2

Joel99

@Zindi,what about the reset of the leaderboard?

replied to ZINDI18 Sep 2024, 14:08

Upvotes 1