Primary competition visual

Ghana’s Indigenous Intel Challenge [BEGINNERS ONLY]

Helping Ghana, Algeria
and 53 other countries
  • Ghana
  • Algeria
  • Angola
  • Benin
  • Botswana
  • Burkina Faso
  • Burundi
  • Cameroon
  • Cabo Verde
  • Central African Republic
  • Chad
  • Comoros
  • Congo (Republic of the)
  • Congo (Democratic Republic of the)
  • Djibouti
  • Egypt
  • Equatorial Guinea
  • Eritrea
  • Eswatini
  • Ethiopia
  • Gabon
  • Gambia
  • Guinea
  • Guinea-Bissau
  • Côte d'Ivoire
  • Kenya
  • Lesotho
  • Liberia
  • Libya
  • Madagascar
  • Malawi
  • Mali
  • Mauritania
  • Mauritius
  • Morocco
  • Mozambique
  • Namibia
  • Niger
  • Nigeria
  • Rwanda
  • Sao Tome and Principe
  • Senegal
  • Seychelles
  • Sierra Leone
  • Somalia
  • South Sudan
  • South Africa
  • Sudan
  • Tanzania
  • United Republic of
  • Togo
  • Tunisia
  • Uganda
  • Zambia
  • Zimbabwe
  • Scroll to see more
$2 500 USD
Challenge completed ~2 months ago
Prediction
910 joined
565 active
Starti
Aug 14, 25
Closei
Oct 12, 25
Reveali
Oct 12, 25
User avatar
EL_YOUNES
Recommendations: Preventing Leakage & Handling Class Imbalance
Connect · 1 Sep 2025, 18:31 · 4

Hi everyone,

I’d like to share a couple of important points that could improve the fairness of this competition and help reduce potential leakage:

1. **Cross-validation split method**

Since we are working with sequential **weather data**, shuffling in cross-validation introduces leakage across time.

Instead of:

```python

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

```

we should use:

```python

skf = StratifiedKFold(n_splits=5, shuffle=False)

```

This respects the time order and avoids information from the future leaking into training folds.

2. **Evaluation metric**

The target is highly imbalanced (**≈88% NORAIN**). With this imbalance, **micro** and **weighted F1** tend to overstate performance because they are dominated by the majority class.

Using **macro F1** would provide a fairer assessment, as it balances the importance of all classes (including MEDIUMRAIN, HEAVYRAIN, and SMALLRAIN).

---

I hope the organizers will take these points into consideration to ensure more robust and fair evaluation.

Thanks!

Discussion 4 answers
User avatar
hafsahassan23

Hi everyone,

Thanks for sharing these insights, I completely agree that respecting the time order with shuffle=False in StratifiedKFold is important to avoid leakage with sequential weather data.

Also, using macro F1 makes sense for this imbalanced dataset, as it gives fair evaluation across all classes, not just the majority class.

I will try to apply these suggestions in my model hopefully it helps improve fairness and robustness.

Thanks again.

2 Sep 2025, 12:43
Upvotes 2

what about tscv = TimeSeriesSplit(n_splits=5) ?

24 Sep 2025, 21:54
Upvotes 1
User avatar
EL_YOUNES

Is Good

I get your point about shuffling causing leakage, but using StratifiedKFold(n_splits=5, shuffle=False) still mixes past and future data across folds. Since this is sequential weather data, the safer option is TimeSeriesSplit, which makes sure validation always comes after training.

26 Sep 2025, 15:55
Upvotes 1