Dear Host,
The following important observation need to be addressed!!! 1. The number of available tweet_ids in the sample_submission is 2942 while the number of test tweets is 4066 2. Only 1861, of the 2942 sample_submission tweets exist in the test. The remaining 981 tweets are no where to be found. 3. 2205, out of 4066 test tweets do not take part in the sample_submission. It is highly recommended to double check.
@Zindi, it's 20 days since I made the above comment without any host comment or data update based on my query.
I really don't know why some competitions a simply dumped like this without any form of adequate monitoring of competitors query.
As said in the competition, there can be upto 17 LM per post. But by analysing the data, I can only find 12 LM (['Neighborhood', 'Other locations', 'State', 'County', 'Continent', 'Human-made Point-of-Interest', 'Island', 'Natural Point-of-Interest', 'Road/street', 'City/town', 'District', 'Country'] in training data.
@HungryLearner @Zindi Can you please let me know if I am missing something here?
@Milind, the 17 LM does not refer to the LM types but rather the number of possibilities per sentence/statement.
During my EDA, I found that there is a particular training I D where there is exactly 17 LM annotations.
Hope that clarifies your query.
Ohh okay got it. Thanks for the clarrification!
@HungryLearner, I didn't understand what you meant by 'the 17 LM.' It does not refer to the types of language models but rather to the number of possibilities per sentence/statement.
The number of LM types as mentioned by @Milind is 12.
However, the maximum number of LM to be predicted is 17 per tweet. In fact, it is mentioned that others should be filled with zeros if our model does not find up to 17 LM in a tweet.
That being said, the 17 LM mentioned is not the number of LM types. It is actually the maximum number of LM to be expected in a single tweet.
The necessity for using 17 however can be attributed to a particular tweet ID in the training set where we have 17 LM annotations. These 17 LMs however are just names of different cities or so. But since a list of cities is not a single location but a list, it was annotated as a list of LM with class type "City".
I know this point, but in the training phase, we should define the 17 LM. I found those 12 locations. If we don't have the location for this tweet, we assign the value of 0. You said that the 17 LM is the number of possibilities per sentence/statement.