Yes, almost no rows in training data but a lot in test data. The same goes for some places and locations.
In other words, for some rows, we are supposed to make predictions for an unknown disease and an unknown place ! I'm puzzled about the train/test split.
Thank you for highlighting this. I think it's either there was an outbreak of cholera or the reporting regulations changed and every facility was supposed to inlude cholera in their monthly reports.
Yes, almost no rows in training data but a lot in test data. The same goes for some places and locations.
In other words, for some rows, we are supposed to make predictions for an unknown disease and an unknown place ! I'm puzzled about the train/test split.
Yea that's very confusing, how are we going to train the model for predicting cholera if we can't train the model with cholera instances
I think the test set gives you hints to handle the cholera classs
Thanks for the insight
You can get a hint with the test set. The tricky part is that the public leaderboard seems to not consider the cholera rows
You mean to say the public leaderboard is not evaluating the correctness of the cholera instances
Yes
I won't worry too much about cholera as no model will be able to predict out-of-distribution values very well.
Yeah that's true, the dataset is heavily skewed
Thank you for highlighting this. I think it's either there was an outbreak of cholera or the reporting regulations changed and every facility was supposed to inlude cholera in their monthly reports.