Given the lack of right hand side variables for this in test, it is debatable if one can actually gain by using them ... because they are not known in future. The same for the left hand side variables ... even if you can model the autocorrelation, you have to predict so far into the unknown that I am not sure it is going to be useful ... so I did a little test ...
... below is an interesting and useful and extremely simple baseline. The host may find this useful also ... on public it scores around 0.2 (putting you firmly into the top 20 atm) ... and ... wait for it ... it only uses dummy variables! This will simply predict a repetitive cycle for each beam.
There are of course many ways in which to improve this ... that is left as an exercise to the reader :-)
# Simple model for each beam
import numpy as np
import pandas as pd
from sklearn.ensemble import HistGradientBoostingRegressor
# Read data
ss = pd.read_csv ( "SampleSubmission.csv" )
ty = pd.read_csv ( "traffic_DLThpVol.csv" )
sub_fn = "simple dummy sub.csv"
# Configuration
n_base = 30
n_cell = 3
n_beam = 32
rs = 123
# Dummies
def make_dummies ( n ) :
i = np.arange ( n, dtype = int )
df = pd.DataFrame ( index = i )
# Hour
for j in range ( 24 - 1 ) :
df [ f"h{ j }" ] = 1.0 * ( ( i % 24 ) == j )
# Week
for j in range ( 7 - 1 ) :
df [ f"w{ j }" ] = 1.0 * ( ( ( i // 24 ) % 7 ) == j )
return df
x_train = make_dummies ( len ( ty ) )
x_test = make_dummies ( len ( ty ) + 1008 ).iloc [ len ( ty ) : ]
# Prepare sample submission
ss = ss.set_index ( "ID", drop = False )
ss [ "Target" ] = ss [ "Target" ].astype ( "float16" )
# Fit one model for each beam
for base in range ( n_base ) :
for cell in range ( n_cell ) :
for beam in range ( n_beam ) :
rs += 1
mod_col = f"{ base }_{ cell }_{ beam }"
model = HistGradientBoostingRegressor ( random_state = rs, loss = "absolute_error" )
model.fit ( x_train, ty [ mod_col ] )
pred = np.clip ( model.predict ( x_test ), 0, 255 )
# Load predictions into sample submission
for k in range ( 168 ) :
ss.at [ "traffic_DLThpVol_test_5w-6w_" + str ( k ) + "_" + mod_col, "Target" ] = pred [ k ]
ss.at [ "traffic_DLThpVol_test_10w-11w_" + str ( 168 - k - 1 ) + "_" + mod_col, "Target" ] = pred [ - k - 1 ]
# Save submission
ss.to_csv ( sub_fn, index = False )
Thank you @skaak for sharing. But what is the exact score of this baseline ? And howmuch time does it take to train.
For me, this scored 0.2016, but I used a different seed (I cleaned it up a bit when I copied it in here).
This really does not take long, just a few minutes to run.
Thank you.
thanks ....just a small clarification on submission
are only supposed to submit one file?
Yes
Well, one at a time, not sure I understand your question? You can submit many times, but only one file at a time ...
.