Your baseline is here: 24.8 RMSE

Notebooks · 13 May 2024, 21:40 · 6

Every time I enter a competition, I like to begin with simple baselines. For time series competitions, I believe that one of the simplest baselines is the moving average, and that's exactly how you can achieve an RMSE of around 24.

df = pd.read_csv('Train.csv')

submission = pd.read_csv('SampleSubmission.csv')


df = df.dropna(subset=['clicks'])

df.reset_index(drop=True, inplace=True)


df = df[['ID', 'date','clicks']]
df = df.sort_values(by=['ID', 'date']).reset_index(drop=True)

For this competition, you need to decide on the granularity of data you want to work with. I've been conducting experiments with both df.groupby(['date', 'ID']) and df.groupby(['ID', 'date', 'ad_type'])

When working at the granularity of ads, you have more data, but you may encounter gaps in the time series or insufficient data for specific combinations of ID and ad_type.
When working at the ID granularity, you have less data, and the time series becomes noisy because you are combining trends from different ad types. The upside is that you have continuous data for extended periods of time.

For this simple baseline, the choice of granularity does not matter, so I will proceed with df.groupby(['date', 'ID'])

grouped_df = df.groupby(['date', 'ID']).sum()

grouped_df = grouped_df.reset_index()

grouped_df['date'] = pd.to_datetime(grouped_df['date'])

This is how our data looks like:

Now we need a code that will identify the last date for each ID and forecast two weeks ahead using the moving average. The little trick here is that the forecast must be dynamic, so each new forecasted row will also be included in the calculation of the subsequent forecasts and so on.

There are two parameters that you can play with:

window_size = 13

forecast_horizon = 16

With a window size of 13, you get an RMSE of 24.8. Feel free to try other numbers.

Note that I have created a new column indicating whether a given row is from the historical period or not. The reason for this is to make it easy to identify our train and test dataframes.

train = grouped_df[grouped_df['is_forecast'] == False]

test = grouped_df[grouped_df['is_forecast'] == True]

After that, we simply need to identify the correct dates in test_df that match the submission dates, and we are all set.

The complete code is here:

https://github.com/yanteixeira/forecast_ads_clicks/blob/main/moving_average_24.ipynb

Discussion 6 answers

Alkhamed

appreciated

13 May 2024, 22:44

Upvotes 1

Koleshjr

Multimedia university of kenya

thanks @yanteixeira

14 May 2024, 02:14

Upvotes 1

ahmedo42

I was wondering if you had any luck integrating one of the popular timeseries libraries with this specific dataset or are you just doing your experiments "manually" and using sklearn instead

14 May 2024, 18:47

Upvotes 1

Jaw22

Zindi africa

Yan, you set it off, LB got about 10+ scores of 24.8.

Your notebook and approach is impressive, Yan!!!

14 May 2024, 21:43

Upvotes 1

AdeptSchneider22

Kenyatta University

@yanteixeira The reason you set your window_size to 13 and forecast_horizon to 16 is it based on the fact that there are some IDs that we're forecasting for 8 days etc or it is just the window size we play with to see how the moving average which is a mean naive forecasting technique will behave.

15 May 2024, 07:13

Upvotes 2

michaelawe

Thanks

17 May 2024, 07:13

Upvotes 1

Join the largest network for
data scientists and AI builders

About FAQs

Status