I have tried all the popular machine learning algorithms such as XGB, CAT, LGB, SVM, LOGIT, NB. They all have a performance of more than 0.99% on 5 k-fold cross validation on the training set and have a performance of more than 0.70% on the test set. And the incredible thing is, each fold in the training set has a performance of 0.99%. Remember, I haven't done any hyperparameter tuning on any of those algorithms, Just the right Feature Engineering is sufficient. But also remember that the data is imbalanced that's why we are getting that level of accuracy.
what were there coresponding scores on the leaderboard?
it said 0.70 on his post
Maybe because the test set doesn't follow train distribution of classes. Not sure
If you look at the rules of this hackathon, it says 20% public leaderboard and 80% private. Therefore, you never know whether we guys at the top are overfitting the model.
I'm talking about the distribution of classes in testset and trainset not percentage of public and private leaderboard. It's different
And even there, if we randomly seperate a dataset, we can lose the class distribution.
That is what am saying, if the distribution is different and you manage to overfit your way to get higher score on the 20%, you might end up at the bottom of the leaderboard after the Competition ended.