My cv score has no relation at all with the lb score, sometimes some increase in cv leads to lb advancement but some times it is the opposite, I am using 5 folds stratified kfold, I am also using smote and asyasin, how to solve this issue?
Hello! The disconnect between cross-validation (CV) scores and testing score (leaderboard score) is not uncommon.
One reason is randomness. Both CV and SMOTE, for example, have an element of randomness. This can lead to variations in model performance because you are not always using the same training sets every time. So you may sometimes get a data set that the model performs better on.
One way to circumvent this is it to use `random_state` as explained in the documentation below:
Another problem is the metric auc - it can be a bit jumpy ... (or as "noisy" as it says on wikipedia here https://en.wikipedia.org/wiki/Area_under_the_curve_(receiver_operating_characteristic). It can also be trouble in your model ... here is a real story e.g. I've been playing this one very hard, and at some stage, trying to improve my model, realised I left out a few features / columns from the training set completely! So if all else fails, find those bugs, carefully check your code and logic.
Then, you know we are scored on only 20% of the test set, so on a relatively small sample? Things may change quite a bit when the final scores (on the full sample) gets revealed.
That is not the case with me at all. I'm doing cv=5 and it's accurate to 0.00X digital it is usually not the case in other challenges but for this one, it is very close
Thanks for your reply..I am using 5 folds also of StratifiedKFold, but I have no corelation at all with the lb, I have fixed the seed for all random processes but the problem is still there.
Hello! The disconnect between cross-validation (CV) scores and testing score (leaderboard score) is not uncommon.
One reason is randomness. Both CV and SMOTE, for example, have an element of randomness. This can lead to variations in model performance because you are not always using the same training sets every time. So you may sometimes get a data set that the model performs better on.
One way to circumvent this is it to use `random_state` as explained in the documentation below:
https://scikit-learn.org/stable/modules/cross_validation.html
https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html
Hope this helps.
Thank you very much!
Another problem is the metric auc - it can be a bit jumpy ... (or as "noisy" as it says on wikipedia here https://en.wikipedia.org/wiki/Area_under_the_curve_(receiver_operating_characteristic). It can also be trouble in your model ... here is a real story e.g. I've been playing this one very hard, and at some stage, trying to improve my model, realised I left out a few features / columns from the training set completely! So if all else fails, find those bugs, carefully check your code and logic.
Then, you know we are scored on only 20% of the test set, so on a relatively small sample? Things may change quite a bit when the final scores (on the full sample) gets revealed.
thank you very much!
That is not the case with me at all. I'm doing cv=5 and it's accurate to 0.00X digital it is usually not the case in other challenges but for this one, it is very close
Thanks for your reply..I am using 5 folds also of StratifiedKFold, but I have no corelation at all with the lb, I have fixed the seed for all random processes but the problem is still there.