Hello guys, I hope you're enjoying this challenge. I have been scratching my heed with both neural nets and boosting models. Still I couldn't break under 0.003. I will appreciate if you mind sharing some tips.
Looks like boosting models are the successful approach here
I've got 0.0027 on leaderboard with a LightGBM model averaging the predictions of a 10-fold CV, stratifiying the time series according to whether or not they contain a flood
For each day I simply used as features the day number (0-729), the precipitation value, and all the other days precipitation values (729 lags)
I'm sure with some parameter tuning and better featurization the score can improve
I'm curious if any numeric feature calculated from images can help, in all my experiments the images were of no help
I also wrap around, so for day t I use the previous t-1 precipitations and the following 730 - t ones, as well as the precipitation of the day itself obviously
Very interesting, mine also works very well with lightgbm, in general I see classifiers work better than regressors. So far I am still trying to combine lightgbm and MLPClassifier. It seems that both models work quite well.
I am using boosting models and I have tried placing lags as you said. I have tried winsorization, i have tried removing outliers, I have tried groupkfold, I have tried stratifiedkfolds, I did extensive feature engineering and still no significant boost. Am I missing something here?
Sorry to hear bro but the basic setup is really simple, just create lags for each day (I used cycling but also padding yields the same results) of each time series and binary classify each day, no particular preprocessing as tree models are not sensitive to data range.
As for the split, for each time series id, assign 1 if it contains a flood and 0 otherwise, then split the dataset so that both training and validation have the same percentage of time series with floods. Nothing else
Same here, struggled to break under this range too regardless of the approach.
I'm here for the tips too 👍
Atleast you got 0.003 mine is a disaster
Looks like boosting models are the successful approach here
I've got 0.0027 on leaderboard with a LightGBM model averaging the predictions of a 10-fold CV, stratifiying the time series according to whether or not they contain a flood
For each day I simply used as features the day number (0-729), the precipitation value, and all the other days precipitation values (729 lags)
I'm sure with some parameter tuning and better featurization the score can improve
I'm curious if any numeric feature calculated from images can help, in all my experiments the images were of no help
Thank you for sharing. So for a given day let say day t, you are using lags from day t-1, day t-2,....., to day 1.
I also wrap around, so for day t I use the previous t-1 precipitations and the following 730 - t ones, as well as the precipitation of the day itself obviously
Very interesting, mine also works very well with lightgbm, in general I see classifiers work better than regressors. So far I am still trying to combine lightgbm and MLPClassifier. It seems that both models work quite well.
Sounds Great. Are you applying anything to the data, any tips?
I am using boosting models and I have tried placing lags as you said. I have tried winsorization, i have tried removing outliers, I have tried groupkfold, I have tried stratifiedkfolds, I did extensive feature engineering and still no significant boost. Am I missing something here?
Sorry to hear bro but the basic setup is really simple, just create lags for each day (I used cycling but also padding yields the same results) of each time series and binary classify each day, no particular preprocessing as tree models are not sensitive to data range.
As for the split, for each time series id, assign 1 if it contains a flood and 0 otherwise, then split the dataset so that both training and validation have the same percentage of time series with floods. Nothing else
Worked like magic ! I'm really grateful.