Primary competition visual

Microsoft Learn Location Mention Recognition Challenge

$5 000 USD
Completed (over 1 year ago)
Natural Language Processing
Generative AI
1246 joined
365 active
Starti
May 16, 24
Closei
Oct 13, 24
Reveali
Oct 13, 24
User avatar
Nkosana_Daniel
Competition data
Data · 7 Sep 2024, 00:02 · 2

Hi @Amy_Bray :)

It seems there are some discrepancies in the test data (csv). A significant amount of the tweets are not in gold-random-json , but rather they are training examples in different versions of the dataset. For example, the tweet with id: ID_1001154804658286592 (What is happening to the infrastructure in New England...)is actually a training example with labels as shown here

I haven't looked at the training data but judging from the number of tweets (~76k) it does look like it stretches all the different versions of IDRIS (total is approx. 77.5k) and not just the gold-random-json (20k). May you confirm which dataset we're required to use. I hope you'll also look into the test set as well. I suggest we use the test_unlabeled.jsonl for which we don't have access to the true labels.

Discussion 2 answers
User avatar
Koleshjr
Multimedia university of kenya

they are all there , the 76k is because most of them have nulls

7 Sep 2024, 04:47
Upvotes 0
User avatar
Nkosana_Daniel

Yeah. My concern is by using all the tweets from all datasets they ended up having some tweets in test.csv which are actually part of the training data and have labels.