AI in Focus: Competition data

Microsoft Learn Location Mention Recognition Challenge

$5 000 USD

Completed (over 1 year ago)

Skills you will learn

Natural Language Processing

Generative AI

1262 joined

365 active

Info Data Chat Leaderboard

Start

May 16, 24

Oct 13, 24

Reveal

Oct 13, 24

Nkosana_Daniel

Competition data

Data · 7 Sep 2024, 00:02 · 2

Hi @Amy_Bray :)

It seems there are some discrepancies in the test data (csv). A significant amount of the tweets are not in gold-random-json , but rather they are training examples in different versions of the dataset. For example, the tweet with id: ID_1001154804658286592 (What is happening to the infrastructure in New England...)is actually a training example with labels as shown here

I haven't looked at the training data but judging from the number of tweets (~76k) it does look like it stretches all the different versions of IDRIS (total is approx. 77.5k) and not just the gold-random-json (20k). May you confirm which dataset we're required to use. I hope you'll also look into the test set as well. I suggest we use the test_unlabeled.jsonl for which we don't have access to the true labels.

Discussion 2 answers

Koleshjr

Multimedia university of kenya

they are all there , the 76k is because most of them have nulls

7 Sep 2024, 04:47

Upvotes 0

Nkosana_Daniel

Yeah. My concern is by using all the tweets from all datasets they ended up having some tweets in test.csv which are actually part of the training data and have labels.

replied to Koleshjr7 Sep 2024, 08:17

Upvotes 1

Join the largest network for
data scientists and AI builders

About FAQs

Status