🤝 Must-Read: My approach

Fossil Demand Forecasting Challenge

$5 000 USD

Completed (almost 4 years ago)

Skills you will learn

Forecast

1012 joined

200 active

Info Data Chat Leaderboard

Start

May 24, 22

Aug 28, 22

Reveal

Aug 28, 22

anasr

My approach

Help · 4 Sep 2022, 15:16 · 14

Hey, I hope you are all doing well. I don't know if my approach will be in the final prize area, but I am sharing a part of it before the final score.

During this competition I tried several approaches that I was expecting to work (at least a little) but it didn't work at all (the score was around 200000) :').

My final approach was based on a catboost model since it was doing well during the cross validation step and the score was stable across the testing sets.

-------

In my approach I built a model for each date that we needed to predict. Let's say I want to predict november 2021, my model was fitted on the data from 01-2018 to 10-2021 and predicts november 2021. And I did the same for the other dates.

All of us noticed that there is some incorrect data for october & november 2019. In order to correct this :

- If there is no data for november and 2 different values for october I assumed that one of them was for november.

- Then I took the mean of the actual value & months around, to make it look like other years.

And of course I did lot of feature engineering , like :

- Lag features of almost all numeric variables for each item.

- Centrality measures of the sellin variables and some of the other ones (sellout, sellout_ch1, etc). Like the mean of each item during the last 3/6/9/12 months.

- Aggregate by date, like the average sellin of each date (and I took the lag).

- I went for one hot encoding for the month feature

- Features based on the price (binary variable if the price increases/decreases)

- Skewness & kurtosis of the last 6/12 sellin periods

- A binary variable if the item is a new item (no sellin before)

- The mean of sellin of the same month of the last years for each item.

-------

During the competition I also tried to include (as features) predictions made by an ARIMA or linear models to help extrapolate but it didn't work, so I removed them.

I think I didn't retrieve all information inside the channel features since I didn't join this competition early and I lost a lot of time trying & improving some approaches that didn't work at all lol. I also made some mistake at the beginning that made me lose time and motivation hahaha :)

I hope it was clear, I know it's not complete (and my english isn't that good) but I will try to answer all the questions if you have any :)

ps : I had a lot of fun during the competition, thank you Zindi !

Discussion 14 answers

aninda_bitm

Great approach.thanka for sharing

4 Sep 2022, 15:31

Upvotes 2

krzjoa

Have you checked the feature importance?

4 Sep 2022, 19:41

Upvotes 1

anasr

Yep. I tried to look at the feature importance during cross validation and remove those without any impact on the model.

replied to krzjoa4 Sep 2022, 19:50

Upvotes 1

danpietrow

Thanks for sharing, great breakdown! I think thats the pitfall of this test set though, that you could train models for 1-month, 2-month etc predictions to boost LB score when I thought the goal was 4-month forecasting.. unless I m wrong here

Did you do any feature selection by any chance or simply run with all the features? I personally found that with all those features my catboost model would overfit. Wondering if you pruned the model by cutting some features in the end?

4 Sep 2022, 20:30

Upvotes 1

anasr

I am not sure to understand your point. Maybe my explanation wasn't clear. But the idea here was to use a direct multi-step forecasting strategy. I would be happy to learn if there is anything wrong about it :)

Yeah, I did some feature selection, cutting the features without any importance during the cross val

replied to danpietrow5 Sep 2022, 11:29

Upvotes 0

danpietrow

Thanks for the response!

Regarding the solution here is what I mean:

If I understood the competition correctly (I could be wrong!) but the goal was to forecast demand 4-months in advance - ie. predicting for november 2021 using data up to july (07.2021). So if a model is trained to use data from october to predict for november thats a 1-month forecast not a 4-month forecast. So maybe the competitions description wasn't clear enough what it wanted, but if the goal was to forecast 4-months in advance then they should have started test data from 2/2022 not 11/2021 if that makes sense?

replied to anasr5 Sep 2022, 11:46

Upvotes 0

anasr

Thank you for answering ! I understand what you mean.

I think you misunderstood, the goal is to predict the next 4 months, t+1, t+2, t+3, t+4.

>but if the goal was to forecast 4-months in advance then they should have started test data from 2/2022 not 11/2021 if that makes sense?

Agree :)

replied to danpietrow5 Sep 2022, 12:02

Upvotes 0

danpietrow

Yeah I think I got that wrong haha. I wish I read that differently, I reckon I would have been right up there. Thanks anyway :) learned a few things from your post

replied to anasr5 Sep 2022, 12:09

Upvotes 0

anasr

Np, good luck for next time :)

replied to danpietrow5 Sep 2022, 16:31

Upvotes 0

thatdataanalyst

Thanks for sharing!

5 Sep 2022, 00:27

Upvotes 1

Mario_Filho

Thanks for sharing, it was a nice approach!

I've tried several things like you, Boosted Trees, lags, descriptive stats of the target... but I felt my validation wasn't robust enough since several features would overfit CV or LB.

Would you mind sharing your validation approach?

6 Sep 2022, 19:36

Upvotes 2

danpietrow

great question

replied to Mario_Filho6 Sep 2022, 20:32

Upvotes 1

thatdataanalyst

I was wondering the same question,

replied to Mario_Filho7 Sep 2022, 00:51

Upvotes 0

anasr

Hey Mario, I think we all struggled to find a robust validation method :')

At the beggining, I went with a fast approach (in terms of computation time), the idea was : to train on the data before 2021-02-01 (excluded), and test on the four next months. Then train on the data before 2021-07-01 and test on the four next months. And take the mean of the errors.

The first one wasn't enough to keep going, so the second idea was : to train on the data before 2021-01-01 (excluded), and test on the four next months. Then you add one month to the training set ( train on the data before 2021-02-01) and test on the four next months. And you continue until 2021-07-01. The mean of all errors was close to the LB so I went with this method.

replied to Mario_Filho7 Sep 2022, 09:18

Upvotes 3

Join the largest network for
data scientists and AI builders

About FAQs

Status