I have noticed that the total number of rows for tweets for test_unlabeled.json files in every event is about 4066 , which means if we assume there are about 17 locations in every row it will result in 69,122. This contradicts to the rows in submission file that has about rows of 100028 which it has assumed to have tweets of 5,884. This causes the problem of missing IDs when making the submission.
I feel Competition need clarifications, also data about locations:
A)for example there is a california,harvey and many other folders, but in submission file there is "loc2 etc." data, how could be know which location is which? loc1 is California?
B) test data is 4066 rows, but in submission file 100028 rows?
if Organizers could clarify please?
The sample submission file contains the predictions for all tweets in the 19 disaster events (all test data), in the order they appear at GitHub.
For every tweet, you can predict up to 17 Location Mentions (LM). For every LM, you must provide the start and end offsets.
The sample submission file contains only an example for the submission format. We will update it to make it containing 138,244 rows to avoid confusion.