🚜 Join the Buzz: Recommendations: Preventing Le...

Ghana’s Indigenous Intel Challenge [BEGINNERS ONLY]

Helping Ghana, Algeria
and 53 other countries

Ghana
Algeria
Angola
Benin
Botswana
Burkina Faso
Burundi
Cameroon
Cabo Verde
Central African Republic
Chad
Comoros
Congo (Republic of the)
Congo (Democratic Republic of the)
Djibouti
Egypt
Equatorial Guinea
Eritrea
Eswatini
Ethiopia
Gabon
Gambia
Guinea
Guinea-Bissau
Côte d'Ivoire
Kenya
Lesotho
Liberia
Libya
Madagascar
Malawi
Mali
Mauritania
Mauritius
Morocco
Mozambique
Namibia
Niger
Nigeria
Rwanda
Sao Tome and Principe
Senegal
Seychelles
Sierra Leone
Somalia
South Sudan
South Africa
Sudan
Tanzania
United Republic of
Togo
Tunisia
Uganda
Zambia
Zimbabwe
Scroll to see more

$2 500 USD

Challenge completed ~2 months ago

Skills you will learn

Prediction

910 joined

565 active

Info Data Chat Leaderboard

Start

Aug 14, 25

Oct 12, 25

Reveal

Oct 12, 25

EL_YOUNES

Recommendations: Preventing Leakage & Handling Class Imbalance

Connect · 1 Sep 2025, 18:31 · 4

Hi everyone,

I’d like to share a couple of important points that could improve the fairness of this competition and help reduce potential leakage:

1. **Cross-validation split method**

Since we are working with sequential **weather data**, shuffling in cross-validation introduces leakage across time.

Instead of:

```python

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

```

we should use:

```python

skf = StratifiedKFold(n_splits=5, shuffle=False)

```

This respects the time order and avoids information from the future leaking into training folds.

2. **Evaluation metric**

The target is highly imbalanced (**≈88% NORAIN**). With this imbalance, **micro** and **weighted F1** tend to overstate performance because they are dominated by the majority class.

Using **macro F1** would provide a fairer assessment, as it balances the importance of all classes (including MEDIUMRAIN, HEAVYRAIN, and SMALLRAIN).

---

I hope the organizers will take these points into consideration to ensure more robust and fair evaluation.

Thanks!

Discussion 4 answers

hafsahassan23

Hi everyone,

Thanks for sharing these insights, I completely agree that respecting the time order with shuffle=False in StratifiedKFold is important to avoid leakage with sequential weather data.

Also, using macro F1 makes sense for this imbalanced dataset, as it gives fair evaluation across all classes, not just the majority class.

I will try to apply these suggestions in my model hopefully it helps improve fairness and robustness.

Thanks again.

2 Sep 2025, 12:43

Upvotes 2

anas22

what about tscv = TimeSeriesSplit(n_splits=5) ?

24 Sep 2025, 21:54

Upvotes 1

EL_YOUNES

Is Good

replied to anas2224 Sep 2025, 21:59

Upvotes 0

anas22

I get your point about shuffling causing leakage, but using StratifiedKFold(n_splits=5, shuffle=False) still mixes past and future data across folds. Since this is sequential weather data, the safer option is TimeSeriesSplit, which makes sure validation always comes after training.

26 Sep 2025, 15:55

Upvotes 1

Join the largest network for
data scientists and AI builders

About FAQs

Status