Hi all and thank you to Zindi and everyone. My solution to the challenge was built on a simple but important insight: traditional rainfall prediction is not just about meteorology—it’s about people, patterns, and place. By treating the problem as one of behavioural modelling, I built a solution that didn’t just predict rainfall—it understood the cultural logic behind those predictions.
Approach The dataset presented a severe class imbalance: 88% of all entries were “NORAIN”, with the remaining three categories making up just 12%. Rather than treating this imbalance as noise, I treated it as signal. It reflected real-world behaviour: farmers naturally make more “no rain” predictions. This shaped my modelling strategy from the outset.
I focused on features that captured who was predicting, where, and when:
This approach allowed the model to learn from behavioural and spatial patterns embedded in traditional forecasting practices.
Model / Code I compared three tree-based ensemble classifiers:
All models were evaluated using stratified cross-validation and macro/weighted F1 scores. XGBoost emerged as the most stable and generalisable.
Final training setup:
from xgboost import XGBClassifier
Encode target labels
y_full_numeric = label_encoder.transform(y) X_features = X
Final model configuration
final_model = XGBClassifier( n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42, n_jobs=-1, eval_metric='mlogloss', use_label_encoder=False, scale_pos_weight=scale_weights )
Train on full dataset
final_model.fit(X_features, y_full_numeric)
Evaluation The model performed reliably in identifying non-rain events but struggled with light rain. A modest 2.6% gap between training and cross-validation F1 scores indicated strong generalisation and minimal overfitting.
Using SHAP and LIME, I analysed what the model had truly learned:
This confirmed the model wasn’t just forecasting rain—it was decoding the behavioural logic of Ghanaian farmers.
This was an initial model designed to establish a clean, interpretable baseline. I did not apply any model enhancements like hyperparameter tuning or model stacking.
Thanks again to Zindi and the community. Hope this helps.
Thank you for sharing and congratulations once again!
Congratulations 🎊 . Thank you for sharing. Please, how did you handled the missing values?
No need to, if you are using gradient boosting.
I dropped the 'time_observed' and 'indicator_description' columns they had too many missing values. For the 'indicator' column, I filled missing values with a constant ('unknown'), considering it's a challenge to predict rain using traditional methods, and knowing indicators like clouds are good predictors. Given the target has more instances of no rain events, filling with 'unknown' in a way supports the no rain cases.
Thank you for sharing and congratulations.
thumbs up