Greetings all,
This post is to help elaborate on some of the issues raised in the discussions regarding the data for this challenge. Firstly, please note that the objective of this challenge is to build a model that scans through each text sample, identifies all locations mentioned in the text, and generates it as how it was in the sample, including the different capitalization used in the text.
This challenge is therefore perhaps best approached using a generative AI model, however, you are welcome to make use of any technique that you believe to be the most effective. As for the labeling technique used, the data was prepared using the json files in the repo, some of which have multiple locations in the "location_mentions".
As such, any unique entry, such as tweet_id ID_1001172243605610496, with more than one location mentioned in the sample text was grouped and the individual locations aggregated and separated by a single space character. This, along with the capitalization in the mentioned location is why the WER is used as a metric for this competition since your model will not only need to generate the right locations as identified by human annotators, but also be case sensitive and generate the mentions as they appeared in the text.
Hope this helps clarify the competition, and best of luck as you work through the challenge!
@Amy_Bray Thank you so much for the explanation. However, I would like to comment on the order of location extraction. Using these two tweets from the training set as examples, the extracted locations do not follow the order of their appearance in the text, nor do they follow an alphabetical order.
Even if a participant extract the locations correctly, the incorrect order of words will negatively affect the Word Error Rate (WER) score. This issue is not limited to these two examples; it appears consistently in almost every target with more than one location.
--------------------------------------------------------------------------------
Tweet ID: ID_1176494137648013312
Text: Prime Minister @ ImranKhanPTI directs all the concerned departments to provide immediate assistance of all kinds to carry out relief activities in quake-hit areas. #APPNews #earthquake #Kashmir #AJK #Mirpur #Jatla
Locations: Kashmir Mirpur Jatla AJK
--------------------------------------------------------------------------------
Tweet ID: ID_721751771870326784
Text: Earthquake kills 233 in Ecuador, devastates coast zone #Africa #SouthAfrica
Locations: Ecuador SouthAfrica Africa
--------------------------------------------------------------------------------
Could you please verify the order of locations mentions in the test and train sets ?
To add on top of this @Amy_Bray When we build a csv using the JSON Data, we do not have the ordering issue. The main concern here is the CSV you provided us , the person who created it did not create it correctly, and we fear that the same mistakes have been carried over to the test that we are being evaluated on. To put it into context, The same text provided here by @salim-benhamadi when we build our data from the JSON in the github we get this:
Text: Prime Minister @ImranKhanPTI directs all the concerned departments to provide immediate assistance of all kinds to carry out relief activities in quake-hit areas. #APPNews #earthquake #Kashmir #AJK #Mirpur #Jatla
Locations: Kashmir AJK Mirpur Jatla
Text: Earthquake kills 233 in Ecuador, devastates coast zone #Africa #SouthAfrica
Locations: Ecuador Africa SouthAfrica
Which is the correct ordering. So please look into this @Amy_Bray The csvs provided are corrupted and does not follow the correct order as the JSONS on the github and this has been passed to the test as well based on my evaluations.
The effect of this:
I have just used the Golden random json data on Github , with the same strategy I used in the CSV data that got me 0.16 on the LB and I just got 0.28. So this is a valid point. Most probably the test dataset labelling has followed the Train csv provided format and not the JSON data on github.
So should we use the Json data instead of the CSV?
Thank you for the clarifications @Amy_Bray
@Shamso It doesn't clarifies anything, problem still persists. The ideal situation will be test and train labels replaced with bilou or json labels.
I think the labels in the csv file follow some sort of location hierarchy, ie city,county,state,country and if more than one country appears, kinda rearranges that as well, to help identify locations easily. The issue raised by @Salim-benhamadi and @koleshjr is that if the objective is to get the locations as they appear in the text without any hierarchical considerations then the json labels will work perfectly with the given metrics and well, depending on the model used. So I kinda see your point. Cheers!
I had not thought about this concept of hierarchy. Now this lights this problem in another angle
Exactly, that's what I got from it
@Nayal_17 Can you confirm if these locations have been ordered hierarchically?
Kashmir Mirpur Jatla AJK
This location as well Rajiv Gandhi Stadium Kadavanthra
@Koleshjr Nope, it's not
What would the correct hierarchy be?
Hello everyone, I haven't received the credentials to access the virtual environment. Could you please help me? Kind regards, my email is: papaseydou.wane@unchk.edu.sn
Hello everyone, I haven't received the credentials to access the virtual environment. Could you please help me? Kind regards, my email is: papaseydou.wane@unchk.edu.sn
Thank you for the clarifications @Amy_Bray, but I see no difference in WER score with different capitalization. Confirm, please, that the capitalization is relevant. Thank you!
Thank you everyone for highlighting this matter. Our team has rectified the issue and the locations are now ordered alphabetically in the reference files. We will commence rescoring the leaderboard and we will post on this thread once rescoring is completed.
For example, if the text was "South Africa is larger than South Wales but neither is bigger than South East Asia." You would need to return "South Africa South East Asia South Wales".
An updated train file has been uploaded with the locations in alphabetical order.
The impacted samples only constituted around 15% of the dataset and some of the updated samples include ID_1001208625895899136, ID_914820745024430080, and ID_1001218367443828736 in the train set.
Please find the updated datasets in the data section and do let us know if there are any more concerns.
Best!
so the first priority is given to locations with a larger number of words, followed by locations in alphabetical order as the second priority ?
You're onto something but I don't think it's merely about large number of words, more like large geographical area, see how south east asia which is alot of countries is larger than south africa which is a country and south Wales is part of a country
@yassin104 good point, and @Amy_Bray what about whether to include locations even if occured twice in a text or should we include only unique locations, and let say if a location occured twice with different case, then which one to pick.
I think thats a mistake: It should be
South Africa South East Asia South Wales
If I am not wrong given the updated train set.
And to answer your question Nayal I think all locations should be returned even if they are repeated. I am saying this because the updated train has repeated locations. You sshould probably check it out
@Koleshjr thanks for the info, i will do it real quick.
Wait, so alphabetic?
Yes. I'll have to redownload the csv again
I think it's safe to not have duplicates @Nayal_17
@onyinye i had explained the reason behind keeping the dubplicates for keeping WER metric valid in my previous discussion post. Kindly give it a look. And even in the new train csv, some duplicates are considered and some are not. This whole competition is about to end but data problem still exists, it's too disappointing.
The leaderboard has been rescored with the new reference files.
My apologies for my previous typo, here is the correct alphabetical order:
For example, if the text was "South Africa is larger than South Wales but neither is bigger than South East Asia." You would need to return "South Africa South East Asia South Wales".
@Amy_Bray Repeated locations are not included in many samples, which are there in bilou data but not in csv.
Ex1:
Text: 'Aluva need urgent help. Pregnant lady in delivery condition. 8075806064 Location neerkode,near Alangad, Aluva. Pls everyone try to find how to inform authorities nd make sure authorities note dis. #KeralaFloods #KeralaFloodRelief #KeralaSOS @Democratrodrigu @Forumkeralam1'
Label: 'Alangad Aluva'
But Aluva appeared two times in this text, and also labelled twice in bilou data but not in latest csv data.
Edit:
Ex2:
Text: 'NOTE: THE AMERICAN CAJUN NAVY IS NOT THE SAME AS THE LOUISIANA CAJUN NAVY!! ONLY THE LOUISIANA CAJUN NAVY IS RECOGNIZED BY FEMA AND THE WHITE HOUSE! PLEASE #DONATE SO WE CAN #HELP LA. CAJUN NAVY HELP ALL IN THE EFFECTED AREAS.
Label: "LOUISIANA"
LOUISIANA occurs twice here too, and also marked on bilou data.
Ex3:
Text: Felt earthquake in Lahore, Alhamdulillah we are safe.shocked to see news of damage due to earthquake in Mirpur as u never know when you are going to die. #earthquake #Lahore'
Label: Lahore Mirpur
Again same issue, lahore occurred twice in the text and also marked twice in bilou data but not in csv.
There are many such examples.
The data for this competition was prepared using location mentions labeled by human annotators. Duplicate mentions, if any, should be treated as one and your model need only generate every unique location mentioned in a sample text. Only distinct locations mentioned in a text, and picked up by your model, should be grouped and ordered alphabetically.