Let's Talk About: Competition data and objective

Zindi

Competition data and objective

Data · 5 Sep 2024, 09:09 · 28

Greetings all,

This post is to help elaborate on some of the issues raised in the discussions regarding the data for this challenge. Firstly, please note that the objective of this challenge is to build a model that scans through each text sample, identifies all locations mentioned in the text, and generates it as how it was in the sample, including the different capitalization used in the text.

This challenge is therefore perhaps best approached using a generative AI model, however, you are welcome to make use of any technique that you believe to be the most effective. As for the labeling technique used, the data was prepared using the json files in the repo, some of which have multiple locations in the "location_mentions".

As such, any unique entry, such as tweet_id ID_1001172243605610496, with more than one location mentioned in the sample text was grouped and the individual locations aggregated and separated by a single space character. This, along with the capitalization in the mentioned location is why the WER is used as a metric for this competition since your model will not only need to generate the right locations as identified by human annotators, but also be case sensitive and generate the mentions as they appeared in the text.

Hope this helps clarify the competition, and best of luck as you work through the challenge!

Discussion 28 answers

salim-benhamadi

University of Padova

@Amy_Bray Thank you so much for the explanation. However, I would like to comment on the order of location extraction. Using these two tweets from the training set as examples, the extracted locations do not follow the order of their appearance in the text, nor do they follow an alphabetical order.

Even if a participant extract the locations correctly, the incorrect order of words will negatively affect the Word Error Rate (WER) score. This issue is not limited to these two examples; it appears consistently in almost every target with more than one location.

--------------------------------------------------------------------------------

Tweet ID: ID_1176494137648013312

Text: Prime Minister @ ImranKhanPTI directs all the concerned departments to provide immediate assistance of all kinds to carry out relief activities in quake-hit areas. #APPNews #earthquake #Kashmir #AJK #Mirpur #Jatla

Locations: Kashmir Mirpur Jatla AJK

--------------------------------------------------------------------------------

Tweet ID: ID_721751771870326784

Text: Earthquake kills 233 in Ecuador, devastates coast zone #Africa #SouthAfrica

Locations: Ecuador SouthAfrica Africa

--------------------------------------------------------------------------------

Could you please verify the order of locations mentions in the test and train sets ?

5 Sep 2024, 09:24

Upvotes 5

replied to salim-benhamadi5 Sep 2024, 09:54

Multimedia university of kenya

To add on top of this @Amy_Bray When we build a csv using the JSON Data, we do not have the ordering issue. The main concern here is the CSV you provided us , the person who created it did not create it correctly, and we fear that the same mistakes have been carried over to the test that we are being evaluated on. To put it into context, The same text provided here by @salim-benhamadi when we build our data from the JSON in the github we get this:

Text: Prime Minister @ImranKhanPTI directs all the concerned departments to provide immediate assistance of all kinds to carry out relief activities in quake-hit areas. #APPNews #earthquake #Kashmir #AJK #Mirpur #Jatla

Locations: Kashmir AJK Mirpur Jatla

Text: Earthquake kills 233 in Ecuador, devastates coast zone #Africa #SouthAfrica

Locations: Ecuador Africa SouthAfrica

Which is the correct ordering. So please look into this @Amy_Bray The csvs provided are corrupted and does not follow the correct order as the JSONS on the github and this has been passed to the test as well based on my evaluations.

The effect of this:

I have just used the Golden random json data on Github , with the same strategy I used in the CSV data that got me 0.16 on the LB and I just got 0.28. So this is a valid point. Most probably the test dataset labelling has followed the Train csv provided format and not the JSON data on github.

Upvotes 4

onyinye

World Quant University

So should we use the Json data instead of the CSV?

replied to Koleshjr6 Sep 2024, 11:21

Upvotes 0

Thank you for the clarifications @Amy_Bray

5 Sep 2024, 11:15

Upvotes 0

replied to Shamso5 Sep 2024, 12:00

@Shamso It doesn't clarifies anything, problem still persists. The ideal situation will be test and train labels replaced with bilou or json labels.

Upvotes 4

replied to Nayal_175 Sep 2024, 13:42

I think the labels in the csv file follow some sort of location hierarchy, ie city,county,state,country and if more than one country appears, kinda rearranges that as well, to help identify locations easily. The issue raised by @Salim-benhamadi and @koleshjr is that if the objective is to get the locations as they appear in the text without any hierarchical considerations then the json labels will work perfectly with the given metrics and well, depending on the model used. So I kinda see your point. Cheers!

Upvotes 1

replied to Shamso5 Sep 2024, 14:06

Multimedia university of kenya

I had not thought about this concept of hierarchy. Now this lights this problem in another angle

Upvotes 0

replied to Koleshjr5 Sep 2024, 14:21

Exactly, that's what I got from it

Upvotes 0

replied to Nayal_175 Sep 2024, 14:35

Multimedia university of kenya

@Nayal_17 Can you confirm if these locations have been ordered hierarchically?

Kashmir Mirpur Jatla AJK

Upvotes 0

replied to Koleshjr5 Sep 2024, 14:43

This location as well Rajiv Gandhi Stadium Kadavanthra

Upvotes 0

replied to Koleshjr5 Sep 2024, 16:02

@Koleshjr Nope, it's not

Upvotes 0

replied to Nayal_175 Sep 2024, 16:17

What would the correct hierarchy be?

Upvotes 0

PapaSeydouWANE

FORCE-N

Hello everyone, I haven't received the credentials to access the virtual environment. Could you please help me? Kind regards, my email is: papaseydou.wane@unchk.edu.sn

5 Sep 2024, 12:40

Upvotes 0

PapaSeydouWANE

FORCE-N

Hello everyone, I haven't received the credentials to access the virtual environment. Could you please help me? Kind regards, my email is: papaseydou.wane@unchk.edu.sn

5 Sep 2024, 12:40

Upvotes 0

ihar

Adolf Würth GmbH & Co. KG

Thank you for the clarifications @Amy_Bray, but I see no difference in WER score with different capitalization. Confirm, please, that the capitalization is relevant. Thank you!

5 Sep 2024, 13:03

Upvotes 1

replied to Amy_Bray5 Sep 2024, 15:57

Zindi

Thank you everyone for highlighting this matter. Our team has rectified the issue and the locations are now ordered alphabetically in the reference files. We will commence rescoring the leaderboard and we will post on this thread once rescoring is completed.

For example, if the text was "South Africa is larger than South Wales but neither is bigger than South East Asia." You would need to return "South Africa South East Asia South Wales".

An updated train file has been uploaded with the locations in alphabetical order.

The impacted samples only constituted around 15% of the dataset and some of the updated samples include ID_1001208625895899136, ID_914820745024430080, and ID_1001218367443828736 in the train set.

Please find the updated datasets in the data section and do let us know if there are any more concerns.

Best!

5 Sep 2024, 14:58

Upvotes 3

yassin104

so the first priority is given to locations with a larger number of words, followed by locations in alphabetical order as the second priority ?

Upvotes 0

replied to yassin1045 Sep 2024, 16:03

You're onto something but I don't think it's merely about large number of words, more like large geographical area, see how south east asia which is alot of countries is larger than south africa which is a country and south Wales is part of a country

Upvotes 0

replied to Amy_Bray5 Sep 2024, 16:17

@yassin104 good point, and @Amy_Bray what about whether to include locations even if occured twice in a text or should we include only unique locations, and let say if a location occured twice with different case, then which one to pick.

Upvotes 0

replied to yassin1045 Sep 2024, 16:20

Multimedia university of kenya

I think thats a mistake: It should be

South Africa South East Asia South Wales

If I am not wrong given the updated train set.

And to answer your question Nayal I think all locations should be returned even if they are repeated. I am saying this because the updated train has repeated locations. You sshould probably check it out

Upvotes 0

replied to Shamso5 Sep 2024, 16:21

@Koleshjr thanks for the info, i will do it real quick.

Upvotes 0

replied to Koleshjr5 Sep 2024, 16:45

Wait, so alphabetic?

Upvotes 0

onyinye

World Quant University

Yes. I'll have to redownload the csv again

replied to Shamso6 Sep 2024, 11:27

Upvotes 0

onyinye

World Quant University

I think it's safe to not have duplicates @Nayal_17

replied to Nayal_176 Sep 2024, 11:29

Upvotes 0

replied to onyinye8 Sep 2024, 02:51

@onyinye i had explained the reason behind keeping the dubplicates for keeping WER metric valid in my previous discussion post. Kindly give it a look. And even in the new train csv, some duplicates are considered and some are not. This whole competition is about to end but data problem still exists, it's too disappointing.

Upvotes 0

Zindi

The leaderboard has been rescored with the new reference files.

My apologies for my previous typo, here is the correct alphabetical order:

For example, if the text was "South Africa is larger than South Wales but neither is bigger than South East Asia." You would need to return "South Africa South East Asia South Wales".

5 Sep 2024, 16:38

Upvotes 2

replied to Amy_Bray6 Sep 2024, 12:53

@Amy_Bray Repeated locations are not included in many samples, which are there in bilou data but not in csv.

Ex1:

Text: 'Aluva need urgent help. Pregnant lady in delivery condition. 8075806064 Location neerkode,near Alangad, Aluva. Pls everyone try to find how to inform authorities nd make sure authorities note dis. #KeralaFloods #KeralaFloodRelief #KeralaSOS @Democratrodrigu @Forumkeralam1'

Label: 'Alangad Aluva'

But Aluva appeared two times in this text, and also labelled twice in bilou data but not in latest csv data.

Edit:

Ex2:

Text: 'NOTE: THE AMERICAN CAJUN NAVY IS NOT THE SAME AS THE LOUISIANA CAJUN NAVY!! ONLY THE LOUISIANA CAJUN NAVY IS RECOGNIZED BY FEMA AND THE WHITE HOUSE! PLEASE #DONATE SO WE CAN #HELP LA. CAJUN NAVY HELP ALL IN THE EFFECTED AREAS.

Label: "LOUISIANA"

LOUISIANA occurs twice here too, and also marked on bilou data.

Ex3:

Text: Felt earthquake in Lahore, Alhamdulillah we are safe.shocked to see news of damage due to earthquake in Mirpur as u never know when you are going to die. #earthquake #Lahore'

Label: Lahore Mirpur

Again same issue, lahore occurred twice in the text and also marked twice in bilou data but not in csv.

There are many such examples.

Upvotes 0