Hi There, Since the data is imbalanced, the metric should be used is not the accuracy or error rate, but instead F1-Macro or ROC-AUC. This way we will compete for better models.
Please zindi kindly look into this. ROC-AUC will be a better evaluation metrics to know how good our model is.
True ROC-AUC would be a better choice if this was a model that had to be put in real-life use, but isn't it a bit late for a metric change?
we can balance the data no ?
balancing the data by oversampling/undersampling or just weighting the classes properly will give a bad score using the error rate metric. If we were being tested on f1 score , balancing would be a good idea.
So what should we do then?
classify data points as they are now. You will miss a lot of '1's compared to their total number in the dataset, and less '0' in %. Balancing data or weighting is a business problem not a data science one. If the cost of missclassifying ones as zeros is bearable for a business, then a model like the ones on the leaderboard right now is viable for production. I hope it's clear enough. If this was a cancer detection challenge, then modelling this way would be wrong.