I tried to ensemble n models that achieved a decent AUC score on CV (> 0.96), but the ensemble severely overfit on the LB(CV~ 0.94). This happened even though I implemented several measures to reduce overfitting.
Has anyone experienced a similar issue?
I also faced LB overfitting with ensembling. The key is diversity and validation alignment. I used models with different architectures (e.g., XGBoost, CatBoost, NN) trained on varied features/folds, and weighted them based on out-of-fold CV AUC (not LB probing). Also, I capped ensemble size — beyond 3–4 diverse models, gains were negligible and overfitting risk increased. TPU on Kaggle helped run fast experiments. My CV and LB eventually aligned when I ensured temporal split matched data distribution