I think it is common to assume that the test dataset will be harder than the data our model is trained on. But what if it is the opposite?
If we look at the cities in train, we have:
In the test dataset, we also have four cities, but only two are capitals. Accra seems to fit the description of the capitals in the training dataset, but Yaoundé is different. Not only is it not the biggest city in Cameroon, but also most of Yaoundé's economy is centered on the administrative structure of the civil service and the diplomatic services, which differ a lot from the previous cities.
The irony is that Accra and Yaoundé have so little data in the test set that even if the PM2.5 emissions of these two cities are close to the capitals in the training set, it does not matter much.
The real problem seems to be the cities for which we have considerable data in the test set. Kisumu (the 3rd largest city in Kenya) and Gulu (which seems to be a small city in Uganda) are both very different from the capitals we have seen so far.
Any thoughts on this?
I think that test is easier than train in the sense that i think that threre is less outliers in test than in train. Most outliers in the train set are from 2 sites of Lagos. Another think that hints at this is that my CV is always higher than LB
my cv is also higher than LB.