hey there, i am using the below code get the AUC measure
1. fpr, tpr, thresholds = metrics.roc_curve(y_test, probs)
2. print(auc(fpr, tpr))
but i am getting a huge difference in AUC when i predict probalities on y_test (30% from trainning data that is used for predicting AUC) and the test.csv file when i upload the submission. like 0.9 when predict probalities on trainning file and 0.67 when predict on test.csv file on submitted solution. what does the problem seems to be?
Hi there,
The discrepancy between your test results and the actual submission means your model is not generalising properly and might be due to many factors.
One major reason could be overfitting. If your model is overfitting to the training data, it might have a higher accuracy on the subset of the training data you are using to test the model.
Also ensure there is no data leakage in your training and test splits. That is, all of the data you are testing on has never been seen by the model. I suggest the train_test_split function from sklearn for clean splits
I have used train_test_split to break the dataset to train and test but seems that overfitting might be because of some other reasons.