Thank you data_style_bender and congratulations for the win 🏆 . Well done for your dedication and and hardworking and to the rest of the winners. You guys really tried
I am also curious to learn about the winning ideas, especially how they approached cross validation and how they aggregated their data/labels.
Data
I took a slightly different approach from the starter notebook .
I rounded the 5-minute intervel readings to the nearest hour (made no difference?), then aggregated data by Datetime (or Date - made no difference?) and Source.
I dropped all duplicates in the full train set and only kept first/last data points, then merged with weather data. Experimented with both first and last and decided to stick with first because it gave my best public LB score.
I engineered new set of features from both weather data and full train data - cyclic time features, statistical features, lag features (dropped due to overfitting), season, wind speed, etc etc.
I experimented with 2 parallel ideas - 2 stream of experiments on data with and without capped weather variables.
The categorical features were label encoded - Season and Source
for i, (_, val_index) in enumerate(strat_kfold.split(all_data, stratify)):
all_data.iloc[val_index, -1] = i
Modeling
Models: XGB (dropped categoricals - Source and Season), LGBM (categorical features = ['Source','season', 'month','data_user','mday', 'consumer_device']), ENSEMBLE MODEL on all features(12 models in total using random forest and extra tree regressors , 6 each where each model was built with different # of trees from 50 to 175 at 25 intervals )
Experiments
I will just describe my selected private score submission
I eventually experiemnted with only the data without capping the weather variables because all experiments that used the raw data scored well on public LB in comparison with capped data.
Final submission = 2 XGB (dropped source and season) + ENSEMBLE MODEL (0.5*RANDOM FOREST + 0.5*EXTRA TREE)
Each Model was trained on same data to generate out of fold predictions
Ensemble
Used scipy.optimize to search best weights from OOF predictions
res = scipy.optimize.minimize(min_func, [1/3]*3, method='Nelder-Mead', tol=1e-6)
forecast['kwh'] = ypredtest*0.1 (didn't play around with this parameter because I sensed the predictions were just too large perhaps due to the un-normalised weights suggested by scipy optimize)
Selected submissions
Best manual ensemble : 6.34/5.36
Scipy optimize ensemble : 6.68/5.16
Key learnings:
- I spent lot of my time thinking about cross validation and target representation so much so that I even forgot what I did to get my best LB score (Always find a way to log/keep track of good turn arounds)
- Few hours to end of comp, I had to dump every other experiment and dig my colab version history to find my best LB notebook.
- Regret for not adding more models that I have worked on in the final ensemble - Catboost, Lightgbm and NN
- Build a grounded intuition and trust it
- You never know if something works until you try it
I think they'd have to wait till the end of code review
Okay. Thanks
I will gladly share my solutions once all have sent their code for review
Thank you data_style_bender and congratulations for the win 🏆 . Well done for your dedication and and hardworking and to the rest of the winners. You guys really tried
Not topping private LB, but here's my approach.
My Soln:
I am also curious to learn about the winning ideas, especially how they approached cross validation and how they aggregated their data/labels.
Data
I took a slightly different approach from the starter notebook .
I rounded the 5-minute intervel readings to the nearest hour (made no difference?), then aggregated data by Datetime (or Date - made no difference?) and Source.
I dropped all duplicates in the full train set and only kept first/last data points, then merged with weather data. Experimented with both first and last and decided to stick with first because it gave my best public LB score.
I engineered new set of features from both weather data and full train data - cyclic time features, statistical features, lag features (dropped due to overfitting), season, wind speed, etc etc.
I experimented with 2 parallel ideas - 2 stream of experiments on data with and without capped weather variables.
The categorical features were label encoded - Season and Source
Total # of features : 69
Cross Validation Strategy
all_data['season_month_group'] = all_data['season'].astype(str)+'_' + all_data['month'].astype(str)
all_data['bins'] = pd.cut(all_data['kwh'], bins=num_bins, labels=False)
all_data ['bins'].hist()
all_data['fold'] = -1
stratify = all_data['season_month_group'].astype(str) + all_data['bins'].astype(str)
strat_kfold = StratifiedKFold(n_splits=10, random_state=42, shuffle=True)
for i, (_, val_index) in enumerate(strat_kfold.split(all_data, stratify)):
all_data.iloc[val_index, -1] = i
Modeling
Models: XGB (dropped categoricals - Source and Season), LGBM (categorical features = ['Source','season', 'month','data_user','mday', 'consumer_device']), ENSEMBLE MODEL on all features(12 models in total using random forest and extra tree regressors , 6 each where each model was built with different # of trees from 50 to 175 at 25 intervals )
Experiments
I will just describe my selected private score submission
I eventually experiemnted with only the data without capping the weather variables because all experiments that used the raw data scored well on public LB in comparison with capped data.
Final submission = 2 XGB (dropped source and season) + ENSEMBLE MODEL (0.5*RANDOM FOREST + 0.5*EXTRA TREE)
Each Model was trained on same data to generate out of fold predictions
Ensemble
Used scipy.optimize to search best weights from OOF predictions
res = scipy.optimize.minimize(min_func, [1/3]*3, method='Nelder-Mead', tol=1e-6)
ypredtest= res.x[0]*modelA['kwh'] + res.x[1]*modelB['kwh'] + res.x[2]*modelC['kwh']
forecast['kwh'] = ypredtest*0.1 (didn't play around with this parameter because I sensed the predictions were just too large perhaps due to the un-normalised weights suggested by scipy optimize)
Selected submissions
Best manual ensemble : 6.34/5.36
Scipy optimize ensemble : 6.68/5.16
Key learnings:
- I spent lot of my time thinking about cross validation and target representation so much so that I even forgot what I did to get my best LB score (Always find a way to log/keep track of good turn arounds)
- Few hours to end of comp, I had to dump every other experiment and dig my colab version history to find my best LB notebook.
- Regret for not adding more models that I have worked on in the final ensemble - Catboost, Lightgbm and NN
- Build a grounded intuition and trust it
- You never know if something works until you try it
I appreciate this."You never know if something works until you try it", I love this statement. Thank you for sharing
100i never disappoints. Well done big man.
Thank you CodeJoe. Congrats on your win! I learnt a lot from your notebooks. Keep doing more bro!