The test data has Ghana data which is not in your training data. I suspect the large difference is in the introduction of Ghana dataset in the test set.
Thanks For Replying!! Yeah you are right test data has alot of irregularity because of Ghana. Let's see what happens! Will have to wait till the deadline!!
I think it's becauseo the public leaderboard that we have now only has 30% of the total dataset. The test dataset also has a different distribution from the train and val distributions. These may result in your score being significantly lower n the public lb but can be a lot higher on the private lb. Looking at your CV score, I'd suggest you stick to your results and trust your CV. There will definitely be a massive shake-up in the private lb and you'll most likely be at the top.
getting a high score without cv 71 but low on groupfold and kfold. for those who are at 80 o lb, is it feature engineering or model tuning that you are using
u're using the blanced data for testing after u blace the data and train the model on it try to eval the model on data that is not balaced keep it close to the original
The 0.995 is achieved with F1 or ROC AUC, if is F1 then it is massive. I think the shake-up will be MAD.
It is achieved through F1.
Thanks for you answer. Is it proper cross validation or juste one fold. I'm trying to understand. If is proper CV, I will trust my CV if I was you.
It was proper CV with 5 folds, balanced with skfold
Perhaps investigate the features and confirm whether there's no unintentional leakage for your val set?
Thanks for your reply! I had alredy investigated it properly there is no unintentional leakage.
The test data has Ghana data which is not in your training data. I suspect the large difference is in the introduction of Ghana dataset in the test set.
Thanks For Replying!! Yeah you are right test data has alot of irregularity because of Ghana. Let's see what happens! Will have to wait till the deadline!!
I think it's becauseo the public leaderboard that we have now only has 30% of the total dataset. The test dataset also has a different distribution from the train and val distributions. These may result in your score being significantly lower n the public lb but can be a lot higher on the private lb. Looking at your CV score, I'd suggest you stick to your results and trust your CV. There will definitely be a massive shake-up in the private lb and you'll most likely be at the top.
Thanks For Replying! I understood it now. Let's hope for the best!! :))
Did you try features selection or without it
I tried feature selection!
getting a high score without cv 71 but low on groupfold and kfold. for those who are at 80 o lb, is it feature engineering or model tuning that you are using
Model tuning won't impact much if you haven't done feature engineering.
Did you address imbalance with Smote?
When i used smote (and feature engineering) i was getting a cv score of 0.996 and a val score of 0.9983.
But when i addressed the imbalance with a different method, the cv changed, i was getting a score of around 0.83
How'd your lb score change with both methods? I always get worse scores on both with smote so I don't even bother using it
I tried smote and addresses imbalanced too!
With smote and without smote the impact on lb wasnt much. Just a 0.01 difference was there.
Fair enough, when I set sample weights I got a much better score than smote
Have you tried folding or grouping techniques?
For cross_validation?
u're using the blanced data for testing after u blace the data and train the model on it try to eval the model on data that is not balaced keep it close to the original