I know not many people have read the Model Integrity discussion but the organizer has just clarified that the Model MUST not take 'future' info as input just to let you guys know
https://zindi.africa/competitions/aiml-for-5g-energy-consumption-modelling/discussions/18079
This is the weird rule, because need to predict target from the past for some base station in test. Data absent from begin of time series and need to recover 'Energy' by the model on future data in fact.
The problem is on runtime/ or in real world you won't be having a sneak peak to the future data
I understand data leakege problem, but as i understand the task - goal to recover missing data and predict target for new base which not present in train. If we go with mentioned rule - we should use only calendar features from time column....
I was wondering about this - because then it is interpolation problem and you can get a very good score. So I suppose zindi will check to make sure you do not use future values.
This makes sense of course, you want to predict at a given time without having seen the future, thanks for also clarifying it here.
I assume this only applies to the energy values?
The model is not intended to be used for live forecasting - the host even stated that it is not a time series problem. The goal is to be able to estimate how a particular BS configuration, especially new ones, will behave in terms of energy consumption.
Therefore, an application scenario could be that at test time (at a static point in time), a day or a week of measurements needs to be evaluated in terms of energy consumption.
Since energy consumption depends on the time of day, temporal information is obviously required. Since this is not a time series forecasting problem, it should not be a problem to use, for example, future load values as input. These would be given at the test time anyway. If I understand the scope correctly, only the energy values should be considered problematic.
Hmmmm - interesting argument. I do not agree with you entirely, and understand the rule to state that you can only look backwards ... even if you have a new configuration, you can train perhaps on forward values but once trained, the model then is not useful as in a live environment you do not have access to those values even if you may have had a set containing them after a week. But perhaps the host can clarify? @Zindi ? @Koleshjr ? Does this restriction apply to just energy or to all data? And how will this be verified eventually?
Okay I'm not the best person to answer this but @nicolapiovesan
Your point would be applicable in a live environment. However the model is not meant to be used in a live Environment as the three stated objectives are more focused in accurate static prediction and generalization. Hence, I assume the model will be used to assess different Base station configurations under otherwise equal conditions. This is also why the test set does not strictly contain future samples. In fact using time information might even be important to accurately disentangle configuration effects from random time effects.
In general, I understand the first intuition that using future values is somehow wrong.
But if this would be a competition about predicting future values, then the whole train/test split and framing of the challenge would be wrong and ideally would have to be redefined. On the other hand, if it's not about seeing into the future, there is no need to prohibit using future values - it is even unclear whether this helps generalization.
I observed that using future values the CV validation score of a simple LightGBM model is ~.75. When not using the CV validation score it is ~1.15. On the leaderboard, both perform similarly. Hence, using future values leads to strong overfitting on the train set. But take this with a grain of salt, as this was only a small experiment with a simple baseline model.
Totaly agree, data were provided in test prove it. We have base station which not presented in train. For instance B_828 we have all data with timeline(cell info file) for feature generation only for predicted period, so in this case mentioned rule are not follow. So business value of this competition exactly descripted by @atschalz or goal in recovery missing data. In opposite case organizers of this competition provide "not correct" data ( if goal is timeseries problem).