Hello fellow Zindians,
I would like to know how you all are dealing with these two features. I have a feeling that it is not correct to treat them as numerical features because GBDTs split one feature at a time. This univariate splitting can miss the complex interaction between latitude and longitude that represents true geographic proximity. However, at the same time, I have not yet found a strong reason not to use them. So far, I have combined the two into one categorical feature.
I have tried countless transformations and new features, but none have convinced me.
I think that the coordinates caused the model to memorize the patterns of the cities, but when we try to predict on the testing set, it doesn't work because there are different cities.
Good answer. I think the same applies to other features as well.
Given that the model will be applied to other locations at inference time, it generally doesn't make sense to train with any location based features even though the data curated seems to encourage it.
On the other hand there's isn't that much pollutant data to usefully train a model to predict pm2_5 concentrations solely relying on pollutant features: so training with latitude and longitude based features is what yields better scores for me.
Funny to see that other participants are also experiencing this dilemma. Super interesting competition so far!
Yes I also agree with your point. I also think we need other countries in the train set