Hello,
I've just seen a lot of ridiculous scores, so I tried using future values, and my score decreased from 1.32 without feature values to 0.98. In the final selection, I will choose the predictions without the future values, but I want to inform @nicolapiovesan to check the first 20 solutions on the leaderboard.
Best regards,
Good point ! But why 20 not 30 or even 50!
Best solution in my opinion: it is to allow usine future data and extend the deadline one month. @nicolapiovesan
What do yo mean by "it is to allow usine future data"?
I guess, it should be based on the rules of the competition, we know there is a Gray area, but the hosts should comment on this and clear the doubt so that we can share the correct solution.
The host has done this severally though. Their stand is do not use future values in your final submissions as the solutions will be disqualified
Hi, thanks for pointing out this problem.
As stated in many discussions, the goal of the challenge is to model how multiple instantaneous features collected in each hour affect the energy consumption in such hour, and it must be clear that using future values as input does not make sense, as in the real world such values will not be available.
To answer your question, could the current top 10 participants in the leaderboard please confirm if they are using or not future values in their solutions? @Yisakberhanu, @rafael_zimmermann, @Krishna_Priya, @NxGTR, @LROUZZ, @heyyou, @tomy4reel, @imakarov, @Koleshjr, @Hakim04
Finally, I'd like to remind that, at the end of the competition, the top participants will be required to submit a report and the code to train/test the model, which will be used to provide the final score. Solutions in which future values are taken as inputs of the model will not be considered.
Our current score uses future values, but we won't select that since as you have already clarified Many times that they won't be considered and thank you for confirming that again
Thank you for bringing up this important issue. To be fully transparent, my best score on the leaderboard does indeed involve the use of future values. However, as you clearly outlined in the competition guidelines, only models that do not use future data will be considered for final submissions and validation. The focus is truly on creating a model that is applicable in the real world, where such future data would not be available.
My team's current best score on LB does NOT use future value features as input to the model. As this rule was already established a month back, I stopped creating features using future values.
I'm completely confident that no model can achieve a score below 1.2 without using future data, let alone get down to 0.8. Just take a look at the feature importance plot to see for yourself.
What feature engineering are they doing which we aren't , this is so demotivating haha,
please does this include aggregate base station features like mean, median, std....
If you decide to calculate aggregate features, ideally any central tendency should be calculated using values of the past. you should not just calculate the mean without filtering the data.
PS: This is my opinion, otherwise it would just be an alternate way to leak the future data.
I absolutely agree
So @Krishna_priya your current score , the aggregate are from the previous hours ???
Hey @Koleshjr, For now, I cannot comment on whether I am using agg features, but yes any feature being used only has the data from the previous hours.
Damn you are good👏👏 but we will get there with time.
All the best bro. Let's keep learning from each other. Anyway, we will see a lot of shuffling in the private leaderboard in this one. Fingers crossed, May the best approach win.
@tomy4reel you have to use .shift(1) to ensure no data leakage.
yes, i used future value but there is not much difference
Looking forward to seeing the best solutions and/or approaches. On this one, I'm completely lost 🙌🏿
Me too @ff 😂😂 I have given up seeing people getting 0.82 with no future values whatttt!!! That's freaking impressive tbh and I don't think I can get there Even if I was added 30 more days 😂
😂😂 Day after tomorrow there will be a terrible shake up in the ranking!
The use of future data is a problem that can be subjective if the rules aren't clear, such as issues related to aggregation or how to handle null values. It's not necessarily just about using lead functions; training on the complete dataset is also a form of using future data. It would be interesting and fair to have an objective rule to justly choose the top 10.