🌊 Data Talk: What worked?

Inundata: Mapping Floods in South Africa

Helping South Africa

$10 000 USD

Completed (over 1 year ago)

Skills you will learn

Classification

1342 joined

314 active

Info Data Chat Leaderboard

Start

Nov 22, 24

Feb 16, 25

Reveal

Feb 17, 25

enigmatic

What worked?

Help · 17 Feb 2025, 10:57 · 11

This competition was very interesting and challenging at the same time. I tried heavy feature engineering, undersampling, oversampling, applied images, used several boosting models and adjusted the data(especially when the precipitation was 0 and it was labelled as flood(1) in the train data) but my result was just not improving,

Discussion 11 answers

MuhammadQasimShabbeer

Engmatix

same was for me as well Can top ladder postion holder share their insight.

17 Feb 2025, 11:32

Upvotes 0

private_1x

Summary of the 15th place solution from the private leaderboard:

1. Data Processing

Processed multi-band satellite imagery (Blue, Green, Red, NIR, SWIR, Slope)
Image resizing and flattening

2. Feature Engineering

Time-based features (year, month, day)
Precipitation shifts (-300 to +300)
Hierarchical statistical aggregations

3. Modeling

XGBoost Classifier
GPU acceleration
StratifiedGroupKFold (5 splits)
Group-based prediction normalization
Log loss optimization

4. Key Parameters

Tree method: hist
Device: cuda
Scale positive weight: 2
Max depth: 5

Kaggle link :

https://www.kaggle.com/code/onurkoc83/floods-study

17 Feb 2025, 13:20

Upvotes 6

private_1x

Don't forget to upvote the Kaggle notebook :)) , and feel free to ask me if you have any questions.

replied to private_1x17 Feb 2025, 14:10

Upvotes 2

enigmatic

Thank you, great approach! I didn't use many lag and lead features, just previous 20 days and next 21 days.

replied to private_1x17 Feb 2025, 14:39

Upvotes 1

CodeJoe

Will do that🔥. Thanks for sharing.

replied to private_1x18 Feb 2025, 00:13

Upvotes 1

snow

I will make several videos recap my solutions and what I learned.

Here is the first one: https://www.youtube.com/playlist?list=PLTTjhaP30APfgB-hqzw85olc6w6h8TO43

17 Feb 2025, 15:21

Upvotes 2

CodeJoe

Thank you for sharing. We will be very glad on receiving the rest. Big ups!

replied to snow18 Feb 2025, 00:02

Upvotes 0

CodeJoe

I was affected by the leak and got a huge shakeup but anyways that's all part of the game😅. This is my solution:

I made 730 lag features (this is for all the days),

Little feature engineering in addition.

The Images didn't help me from my opinion.

Groupkfold of 10 folds,

Xgboost (mostly default parameters, with early stopping and n_estimators of 1000).

Ensemble methods didn't work quite well for me.

This was the score before using the leak:

Public: 0.002575628

Private: 0.002630546

https://www.kaggle.com/code/dukekojokongo/zindi-inundata-floods

18 Feb 2025, 00:11

Upvotes 1

snow

Summary of my Progression and Results:

Rule-based baseline (LB: 0.004810): An initial baseline based on flood days and locations only.
Simple XGBoost baseline (LB: 0.004218): A basic XGBoost model is trained, using the initial features in the dataset without feature engineering.
Add precipitation rolling mean feature (LB: 0.00318): This significant improvement comes from adding rolling mean features of precipitation. The fe function (cv.py, lines 21-45) calculates rolling means for various window sizes (w):for w in list(range(2,100,2)) + list(range(100, 250, 10)): This creates rolling mean features with windows from 2 to 98 (step 2) and 100 to 240 (step 10). These are stored as columns named rm_{w}_precipitation. This captures trends in precipitation over different time scales.
Rolling mean center=True (LB: 0.00296): The center=True argument in the rolling() function within fe (cv.py, line 23) is crucial. Centering the rolling window means that the average at a given time point considers both past and future values. This reduces lag and provides a smoother representation of the precipitation trend.
Add diff and rolling mean of diff (LB: 0.00271): This step adds features related to the change in precipitation. The fe function (cv.py, lines 27-41) calculates lagged differences and their rolling means:df[f'rainfall_lag_{lag}_{col}' = ... .diff(lag).fillna(0): Calculates the difference in precipitation between the current day and lag days prior. fillna(0) handles the initial NaN values. Nested loops create lag_rm_{w}_{lag}_precipitation features. These are rolling means (windows w) of the lagged differences (lag). This captures the trend of precipitation changes over different time scales. This is done for lags of 2, 8, 14, and 28 days, and a variety of window sizes.
Use Gaussian smooth label regression as base margin (LB: 0.00252): This step employs a Gaussian smoothing technique to transform the binary flood labels (0 or 1) into a continuous, "soft" target variable. This smoothed representation is then used as the target variable for an XGBoost regression model. This approach is beneficial because it provides a more nuanced representation of flood risk and helps the model learn a smoother decision boundary. The predictions of this regression model is then used as the base margin for the final classification model
Train image model to classify flood vs non-flood locations and normalize probability (LB: 0.00245): A YOLO image classification model (train_cls.py) is trained to distinguish flood-prone locations. Training uses 128x128 images, data augmentation, and 10 epochs. The model's output probabilities flood_a1 are for each location, which is used to select flood locations and normalize the probabilities.

https://youtu.be/fzPJHU3KYfU?si=mIRRU9JrEqegmerV

18 Feb 2025, 03:01

Upvotes 5

enigmatic

Thank you @snow

replied to snow18 Feb 2025, 09:44

Upvotes 0

CodeJoe

Well Documented! You deserve a thumbs up. I am definitely subscribing to your channel.

replied to snow18 Feb 2025, 13:07

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status