Trending Now: Inconsistency within the datas...

Microsoft Learn Location Mention Recognition Challenge

$5 000 USD

Challenge completed ~1 year ago

Skills you will learn

Natural Language Processing

Generative AI

1219 joined

365 active

Info Data Chat Leaderboard

Start

May 16, 24

Oct 13, 24

Reveal

Oct 13, 24

DanielTobi0

Ibadan city polyechnic

Inconsistency within the dataset

Data · 14 Aug 2024, 13:00 · 4

In the training dataset, the first example has the hashtag #EllicottCity, which matches the extracted location. However, this is not the case with the second example, where #EllicottCity is not among the extracted locations.

ID_1001172243605610496: "National Guardsman swept away by flash floods in Maryland after trying to rescue others: ▶ #EllicottCity," → Extracted location: EllicottCity, Maryland
ID_1001172851687378944: "News conference in #EllicottCity, Maryland, as public officials give an update on the flash floods and search and rescue efforts both last night and today," → Extracted location: Maryland

Should we ignore locations containing hashtags?

Also, why do we have rows like this?

ID_1001172460446867456,,Maryland

Discussion 4 answers

marching_learning

Nostalgic Mathematics

Yes there are too many inconsistencies in the labels. Unfornatunately I didn't put it at one place. That's why I think that to top performers will be decided by a lottery. @Amy_Bray

14 Aug 2024, 13:04

Upvotes 3

Koleshjr

Multimedia university of kenya

if the test dataset is clean of these inconsistencies I don't think this will be decided via lottery. its okay for train to be dirty and that's why there is the cleaning phase but if the test is dirty then I think there is nothing we competitors can do.

hopefully the test is as Amy had said in a previous discussion,

" the same order of appearance and the same casing "

replied to marching_learning14 Aug 2024, 14:39

Upvotes 1

marching_learning

Nostalgic Mathematics

Thank you @Koleshjr. Can you point to the discussion where Amy said that the test has no order inconsistencies. To me, I'm not that sure, The question is to know if the same people label both train and test datasets. If it is the case, the same inconsistencies will spread to the test dataset. And even though we can't do anything if it is the case. But It may be frustrating that lottery squanders the efforts of other valuable solutions. If it is the case, I THINK WE SHOULD CHANGE THE METRICS. FOR INSTANCE, ON KAGGLE WHEN PARTICIPANTS NOTICE INCONSISTENCIES THAT ARE TRUE, THE METRICS IS ALWAYS CHANGED. WHY DON'T WE USE WEIGHTED RECALL (my proposal). @Amy_Bray @Zindi.

replied to Koleshjr14 Aug 2024, 15:46

Upvotes 1

Koleshjr

Multimedia university of kenya

Im also asking the same question. Does test have the same inconsistencies?if it does then the Zindi team has to do something about it, but if the test doesn't have the inconsistencies then I don't think it's a problem honestly

replied to marching_learning14 Aug 2024, 17:01

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status