Hey, I hope you are all doing well. I don't know if my approach will be in the final prize area, but I am sharing a part of it before the final score.
During this competition I tried several approaches that I was expecting to work (at least a little) but it didn't work at all (the score was around 200000) :').
My final approach was based on a catboost model since it was doing well during the cross validation step and the score was stable across the testing sets.
-------
- If there is no data for november and 2 different values for october I assumed that one of them was for november.
- Then I took the mean of the actual value & months around, to make it look like other years.
- Lag features of almost all numeric variables for each item.
- Centrality measures of the sellin variables and some of the other ones (sellout, sellout_ch1, etc). Like the mean of each item during the last 3/6/9/12 months.
- Aggregate by date, like the average sellin of each date (and I took the lag).
- I went for one hot encoding for the month feature
- Features based on the price (binary variable if the price increases/decreases)
- Skewness & kurtosis of the last 6/12 sellin periods
- A binary variable if the item is a new item (no sellin before)
- The mean of sellin of the same month of the last years for each item.
-------
During the competition I also tried to include (as features) predictions made by an ARIMA or linear models to help extrapolate but it didn't work, so I removed them.
I think I didn't retrieve all information inside the channel features since I didn't join this competition early and I lost a lot of time trying & improving some approaches that didn't work at all lol. I also made some mistake at the beginning that made me lose time and motivation hahaha :)
I hope it was clear, I know it's not complete (and my english isn't that good) but I will try to answer all the questions if you have any :)
ps : I had a lot of fun during the competition, thank you Zindi !
Great approach.thanka for sharing
Have you checked the feature importance?
Yep. I tried to look at the feature importance during cross validation and remove those without any impact on the model.
Thanks for sharing, great breakdown! I think thats the pitfall of this test set though, that you could train models for 1-month, 2-month etc predictions to boost LB score when I thought the goal was 4-month forecasting.. unless I m wrong here
Did you do any feature selection by any chance or simply run with all the features? I personally found that with all those features my catboost model would overfit. Wondering if you pruned the model by cutting some features in the end?
I am not sure to understand your point. Maybe my explanation wasn't clear. But the idea here was to use a direct multi-step forecasting strategy. I would be happy to learn if there is anything wrong about it :)
Yeah, I did some feature selection, cutting the features without any importance during the cross val
Thanks for the response!
Regarding the solution here is what I mean:
If I understood the competition correctly (I could be wrong!) but the goal was to forecast demand 4-months in advance - ie. predicting for november 2021 using data up to july (07.2021). So if a model is trained to use data from october to predict for november thats a 1-month forecast not a 4-month forecast. So maybe the competitions description wasn't clear enough what it wanted, but if the goal was to forecast 4-months in advance then they should have started test data from 2/2022 not 11/2021 if that makes sense?
Thank you for answering ! I understand what you mean.
I think you misunderstood, the goal is to predict the next 4 months, t+1, t+2, t+3, t+4.
>but if the goal was to forecast 4-months in advance then they should have started test data from 2/2022 not 11/2021 if that makes sense?
Agree :)
Yeah I think I got that wrong haha. I wish I read that differently, I reckon I would have been right up there. Thanks anyway :) learned a few things from your post
Np, good luck for next time :)
Thanks for sharing!
Thanks for sharing, it was a nice approach!
I've tried several things like you, Boosted Trees, lags, descriptive stats of the target... but I felt my validation wasn't robust enough since several features would overfit CV or LB.
Would you mind sharing your validation approach?
great question
I was wondering the same question,
Hey Mario, I think we all struggled to find a robust validation method :')
At the beggining, I went with a fast approach (in terms of computation time), the idea was : to train on the data before 2021-02-01 (excluded), and test on the four next months. Then train on the data before 2021-07-01 and test on the four next months. And take the mean of the errors.
The first one wasn't enough to keep going, so the second idea was : to train on the data before 2021-01-01 (excluded), and test on the four next months. Then you add one month to the training set ( train on the data before 2021-02-01) and test on the four next months. And you continue until 2021-07-01. The mean of all errors was close to the LB so I went with this method.