South African COVID-19 Vulnerability Map by #ZindiWeekendz
$300 USD
Can we infer important COVID-19 public health risk factors from outdated data?
341 data scientists enrolled, 179 on the leaderboard
3 April—5 April
60 hours
Adverserial Validation
published 5 Apr 2020, 16:40

I'm guessing I'm not the only one who found that the train and tests sets are very seperable.

So wanted to hear what you guys think could be a good way to use that information?

So far, I've tried weighing the samples in the training set by the probability of a sample being in the training (LGBM, XGBoost etc have functionality for this), but it doesn't seem to make much of an impact.

I've also tried selecting the train samples with the lowest probability of belonging to the train set and used those for validation, but also doens't help with generalizing to the leaderboard.

So... I'm stuck >_< - can't think of a way to use this information other than those.

Any ideas?

Would be great to revisit this after the close if you guys are reluctant to share your tactics.

Count me in for revisiting

tactics and code will surely be open source on this.

Like I said, I have researched the most common tactics and found they did not work as expected on this dataset. Hence the open "discussion".

Adverserial Validation did't work for me either.I then tried 2 approaches which didn't work as i expected either.First ,for each single model i tried a bit of hyparameter tunning and applied cross validation with 5 folds and trained the models on the whole set of features in the first layer.Then Analysis of pairwise correlations between out-of-fold test set predictions generated by first layer models shows LGBM has some diversity of being the least correlated with other models.In the final layer consist of LGBM and XGBOOST.I'm then choosing the predictions of the first level models as their input features.After tunning their parameters,I took their weighted average as the final predictor.This gives a lb of 3.98..

2nd Approach.Trained a single LGBM,Adaboost and Catboost.Blend their scores gives lb 3.93...

Will be happy to hear fwhat the rest are doing

A single Catboost model after manually tuning some parameters gave my best score so far, 3.87. for me It's quite difficult to improve LB score . Very anxious to hear from the top guys when the hackathon is over

will the winning solutions be posted online this time?

Hey, I think most of the top solutions have been shared in the "My Solution" discussion thread and the top 3 were shared in their own separate threads.