Primary competition visual

AirQo Low-Cost Air Quality Monitor Calibration Challenge

Helping Uganda
$1 000 USD
Challenge completed over 4 years ago
Prediction
311 joined
161 active
Starti
Apr 30, 21
Closei
Jun 06, 21
Reveali
Jun 06, 21
A big concern and a suggestion - Data Split and Future Values
Data · 14 May 2021, 13:28 · edited 2 minutes later · 7

I have a serious concern. Due to the nature of the data split, we will not be able to get correct estimate of the performance of the model. In simple words the leaderboard claim that the RMSE of the predicted values using cheaper devices to predict PM2_5 will be around 8 is not at all close to reality.

For the same day in the data we have some hour in train, and some hour in test set.

A simple example is even if you aggregate the pm2_5 each day you are basically using future values to predict the past, and the model will easily exploit this information to give you good scores. This will be a powerful feature in your model's feature importances but this information will not be helpful in real usecase of Airqo. The data split should have been past-future to give a correct estimate.

At the end of the competition everyone's models will be using future data knowingly or unknowingly and hence will be bad estimates of the model performance in reality. Generally for any competition where a time component is involved, we should think about how will we use the model in production. In this case I will want my model to run for future days using past data, so the competition data split should simulate it.

Good Luck to everyone !

Discussion 7 answers

so are you suggesting using 'created_at' column as index and running timeseries models on the same

14 May 2021, 15:58
Upvotes 0
User avatar
underfitting
Church of christ

Thank you @devnikhilmishra for raising this issue. I agree with you that the we should predict the future and not the past. I have heard scenarios where the training and test have been split according to dates. In time based splits, training set usually contains past dates and test set usually contain future dates. I have also heard of scenarios where train and test sets have been made according to IDs. There could also be another scenario where splits are done randomly.. I am not sure which kind of split has been used in the competition. Maybe we can hear from anyone who has something to say concerning this issue.

User avatar
otoosakyidavid
Solina

A nice observation from you... @Zindi might want to clarify that too...

14 May 2021, 21:37
Upvotes 0

On the info tab we have "The objective of this challenge is to develop a model that will take low cost device data and other supplementary data and transform it as accurately as possible to the reference value."

So regardless of the time, or so I think, the objective is to determine if these devices correctly measure air quality given the various conditions and having in mind conditions such as weather keep changing with time.

15 May 2021, 06:56
Upvotes 0

Hi Nikhil, thanks for participating in the competiton and for contributing to the discussion. I would not want to influence the approach taken to the challenge but in this case although channel_id is a timestamp as other contestants have pointed out this is not a time series exercise for predicting future values. In fact the model generally will not be used to predict. In the first instance it will be used on all historic values we have from old measurements, it will then be used in real time to convert data. So no forecasting but time of day, time of year etc MIGHT have an influence, maybe not. The test set is a stratified sample from the full dataset, not future values. Hope this helps All the best!!

16 May 2021, 10:18
Upvotes 0

Hi Paul, if the model is used on historical data, then definitely it will work good. Thanks for the clarification.

Train test split shoud simulate a real world situation in order to have relevant results when deploying the model in reality. So even if the model will be trained on historical data(past) it will be used in real time to predict the current result where we don't have access to future values. if the aim of the competition is to replace from now on reference sensors with low cost devices coupled with this model then i think it won't work. is this the objective of this compition or it is someting else ?