I have a quick question regarding the usability of the model we're trying to build. The task is to train and to predict on the Census 2011 data, however, the model presumably would be used to manage policies in between each census event.
Do you know whether the test dataset for the private board contains data from the later Census and should we be aware of time-related leakage of geo data features?
The test set for the private leader board is on the same 2011 census data. In fact, as far as I can see, the train\test split is based on provinces, with 7 provinces in train and the other 2 in test.
Yep, I also see that. I just question it from a practical point of view. What's the point to predict on geo split if the goal mentioned in the description is to predict between events. It seems to be a purely ML skills training exercise per se. It is highly unlikely to run census on one part of the country but not the other.
Absolutely agree. A model in this arrangement is not usefull, unless the modeilling process yields some radical new insights. It is however a very interresting dataset, easy to overfit me thinks. I also feel it sheds light on an important issue.