Hi everyone,
I’d like to share a couple of important points that could improve the fairness of this competition and help reduce potential leakage:
1. **Cross-validation split method**
Since we are working with sequential **weather data**, shuffling in cross-validation introduces leakage across time.
Instead of:
```python
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
```
we should use:
```python
skf = StratifiedKFold(n_splits=5, shuffle=False)
```
This respects the time order and avoids information from the future leaking into training folds.
2. **Evaluation metric**
The target is highly imbalanced (**≈88% NORAIN**). With this imbalance, **micro** and **weighted F1** tend to overstate performance because they are dominated by the majority class.
Using **macro F1** would provide a fairer assessment, as it balances the importance of all classes (including MEDIUMRAIN, HEAVYRAIN, and SMALLRAIN).
---
I hope the organizers will take these points into consideration to ensure more robust and fair evaluation.
Thanks!
Hi everyone,
Thanks for sharing these insights, I completely agree that respecting the time order with shuffle=False in StratifiedKFold is important to avoid leakage with sequential weather data.
Also, using macro F1 makes sense for this imbalanced dataset, as it gives fair evaluation across all classes, not just the majority class.
I will try to apply these suggestions in my model hopefully it helps improve fairness and robustness.
Thanks again.
what about tscv = TimeSeriesSplit(n_splits=5) ?
Is Good
I get your point about shuffling causing leakage, but using StratifiedKFold(n_splits=5, shuffle=False) still mixes past and future data across folds. Since this is sequential weather data, the safer option is TimeSeriesSplit, which makes sure validation always comes after training.