Validation vs. Test Data: the code uses 80/20 train-validation split, so validation F1 scores reflect only 20% of training data, not the full test set used for leaderboard scoring
Data Distribution Differences: Test set may have different feature distributions than the validation set
Overfitting: Model may perform well on validation but generalize poorly to unseen test data
Random Seeds: Different random states during training/splitting affect model performance
Feature Engineering: Missing test-specific preprocessing or feature selection optimization
Score differences are normal. The public leaderboard uses only part of the test set, while the private leaderboard uses the rest. Also, some models have randomness, so results may vary slightly even with the same code.
The core discrepancies occur due to:
Hi Buchi,
Score differences are normal. The public leaderboard uses only part of the test set, while the private leaderboard uses the rest. Also, some models have randomness, so results may vary slightly even with the same code.