Primary competition visual

Womxn in Big Data South Africa: Female-Headed Households in South Africa

Helping South Africa
$5 000 USD
Challenge completed over 5 years ago
Prediction
1161 joined
204 active
Starti
Nov 25, 19
Closei
Feb 23, 20
Reveali
Feb 24, 20
Model implementation
Help · 11 Feb 2020, 09:27 · 3

Hi all,

I have a quick question regarding the usability of the model we're trying to build. The task is to train and to predict on the Census 2011 data, however, the model presumably would be used to manage policies in between each census event.

Do you know whether the test dataset for the private board contains data from the later Census and should we be aware of time-related leakage of geo data features?

Many thanks

Discussion 3 answers

Hi davarix,

The test set for the private leader board is on the same 2011 census data. In fact, as far as I can see, the train\test split is based on provinces, with 7 provinces in train and the other 2 in test.

11 Feb 2020, 10:00
Upvotes 0

Yep, I also see that. I just question it from a practical point of view. What's the point to predict on geo split if the goal mentioned in the description is to predict between events. It seems to be a purely ML skills training exercise per se. It is highly unlikely to run census on one part of the country but not the other.

Absolutely agree. A model in this arrangement is not usefull, unless the modeilling process yields some radical new insights. It is however a very interresting dataset, easy to overfit me thinks. I also feel it sheds light on an important issue.