Every time I enter a competition, I like to begin with simple baselines. For time series competitions, I believe that one of the simplest baselines is the moving average, and that's exactly how you can achieve an RMSE of around 24.
df = pd.read_csv('Train.csv')
submission = pd.read_csv('SampleSubmission.csv') df = df.dropna(subset=['clicks'])
df.reset_index(drop=True, inplace=True) df = df[['ID', 'date','clicks']] df = df.sort_values(by=['ID', 'date']).reset_index(drop=True)
For this competition, you need to decide on the granularity of data you want to work with. I've been conducting experiments with both df.groupby(['date', 'ID']) and df.groupby(['ID', 'date', 'ad_type'])
For this simple baseline, the choice of granularity does not matter, so I will proceed with df.groupby(['date', 'ID'])
grouped_df = df.groupby(['date', 'ID']).sum()
grouped_df = grouped_df.reset_index()
grouped_df['date'] = pd.to_datetime(grouped_df['date'])
This is how our data looks like:
Now we need a code that will identify the last date for each ID and forecast two weeks ahead using the moving average. The little trick here is that the forecast must be dynamic, so each new forecasted row will also be included in the calculation of the subsequent forecasts and so on.
There are two parameters that you can play with:
window_size = 13
forecast_horizon = 16
With a window size of 13, you get an RMSE of 24.8. Feel free to try other numbers.
Note that I have created a new column indicating whether a given row is from the historical period or not. The reason for this is to make it easy to identify our train and test dataframes.
train = grouped_df[grouped_df['is_forecast'] == False]
test = grouped_df[grouped_df['is_forecast'] == True]
After that, we simply need to identify the correct dates in test_df that match the submission dates, and we are all set.
The complete code is here:
https://github.com/yanteixeira/forecast_ads_clicks/blob/main/moving_average_24.ipynb
appreciated
thanks @yanteixeira
I was wondering if you had any luck integrating one of the popular timeseries libraries with this specific dataset or are you just doing your experiments "manually" and using sklearn instead
Yan, you set it off, LB got about 10+ scores of 24.8.
Your notebook and approach is impressive, Yan!!!
@yanteixeira The reason you set your window_size to 13 and forecast_horizon to 16 is it based on the fact that there are some IDs that we're forecasting for 8 days etc or it is just the window size we play with to see how the moving average which is a mean naive forecasting technique will behave.
Thanks