Layer.ai Air Quality Prediction Challenge
Can you use Sentinel 5P data to predict air quality in Kampala for AirQo?
Prize
$3 000 USD
Time
Ended 4 months ago
Participants
258 active · 663 enrolled
Helping
Uganda
Good for beginners
Prediction
Safety
1st place Approach – 2 And a Half Id!ots!
Notebooks · 7 Oct 2022, 21:28 · 10

Overview

Huge thanks to Zindi, Layer and AirLayerQo for hosting such an interesting competition. I'd also want to thank my teammate (@Robert_Selemani), without whose perseverance and grit we may have easily given up after only a few submissions.

In this post, I'll present our winning solution as well as insights obtained from the competition that I feel the organizers will find useful.

Understanding the problem statement, asking a lot of questions about the data, trying a lot of things, and, most importantly, never giving up are the most critical elements in our solution.

A Note on validation

Our first step before even trying to code any solution was to use adversarial validation to get a good picture of how much we can expect our model to be to generalise to unseen test data. We dropped the date, device and pm2_5 columns and built a simple Random Forest classifier to predict whether a sample belonged to train or test set. The ROC-AUC score was 0.9988, that’s almost a perfect score! Our simple classifier could perfectly distinguish the train and test sets, thus we can conclude the train and test sets come from dfferent distributions. Cross validation strategies based on random sampling, e.g KFold weren’t the best of choices for this kind of problem.

Our Approach

At first we tried GBMs (XGBoost, LightGBM, CatBoost) and it was difficult to evaluate the actual quality of our models. Lower scoring models on local validation tended to score better on the Public LB. Our best xgboost model had a score of ~13.7 on the Private LB, not really a bad score now that i think of it.

Our experience with GBMs was rather shortlived, we decided to rethink the problem all over again. We started asking questions like: What do we really care about the most in this challenge? Can we achieve the outcome with other approaches? That led us into my favourite part of the data science workflow – RESEARCH! MORE RESEARCH! AND MORE RESARCH! Paperswithcode was a really useful tool to get an idea of things to try out. We came across different LSTM-based architectures like the De-Noising Autoencoder Deep Network (DAEDN). We couldn’t get an open source implementation of that particular architecture, and with the amount of time we had we thought it wasn’t the best of ideas to try and code it from scratch – but it gave us an idea about using deep learning techniques. We looked at the benchmarks and came across the Temporal Fusion Transformer architecture proposed by the teams at the University of Oxford and Google Cloud AI, which has a pretty good open-source implementation in Pytorch forecasting.

Before we go into modeling we’ll cover some preliminary work we did;

Imputing Missing Values: We used the Bidirectional Reccurrent Imputation for Time Series (BRITS). This particularly worked well because of the ability to deal with multiple correlated missing values in time series data, and also it doesn’t impose any assumptions about the underlying data generating process, so it could learn the dynamics perculiar to our dataset. We did adversarial testing using this imputed data and we got an ROC-AUC score of 0.7498, which translates to better generalisability as compared to imputing using the mean. We could use this insight to generate a holdout set composed of samples most similar to the test set (validation by mimicking the test set). We decided not to use this approach because we thought scoring highly on the leaderboard was somehow also conflicting with building a good model that could more accurately forecast the PM2.5 concentrations. We sort of wanted to find a balance between a good score on the leaderboard and a really good intepretable model. More on that later!

Feature Engineering:

A. Time-related features: We extracted raw time-related features like year, month, day, etc, as well as trigonometric features like the sine and cosine transformation of month and weekday. We also extracted cyclic or peridodic spline features from month and weekday. The rationale is that Monday is as close to Sunday as Saturday is, label encoding the days from 1 to 7 wouldn’t expose our model to this kind of periodicity.

B. geo-spatial features: We treated the site latitude and site longitude as retangular cartesian coordinates and transformed them into polar form.

We also extracted rotational cartesian coordinates by rotating through 30, 45 and 60 degrees.

rot_x = x * cosθ + y * sinθ
rot_y = x * sinθ — y * cosθ

Lastly we performed Principal Component Analysis (this is not really feature engineering but rather data transformation)

We also frequency encoded the is_weekend feature.

C. Binning temperature and humidity: We discretized the temp_mean and humidity columns into 5 bins. We hoped this would help remove noise and errors in the data (we really couldn’t make sense of some of the values, like temperature of 0 when there are already NANs, or even worse 0.07796).

What we are thinking: either they are errors from the data collection process, or errors generated through data wrangling, or both. We think the competition organiser would benefit more from a data centric approach, i.e having strict data quality guidelines to adhere to so they can mitigate the occurence of such errors. Models can only be as good as the data, as they say “garbage in, garbage out.”

D. Time series features: We created lagged features on humidity and temp_mean colums, as well as rolling and expanding window statistics like mean, max, min, sum and standard deviation.

We also noticed that the monitoring station at latitude 0.332609275 and longitude 32.610047 was only found in the test set. To create it’s time series features we located the 3 nearest stations and simulated the station to belong in each of those 3, then we averaged the result.

E. Target encoding: We target encoded the device column, to represent the average expected PM2.5 value for that device. For devices aq_91 and aq_98 belonging to the unique location we averaged values from the 3 nearby locations as stated earlier.

Modeling:

We used the Pytorch forecasting and Pytorch lightning libraries. We set the maximum prediction length to be equal to the number of unique dates in the test set so we can predict the entire 134 days. The maximum encoder length was the number of unique dates in the train, but this could be lowered slightly to accomodate for lower computational resources without severely hurting the accuracy. We grouped our time series dataset on the stations column, therefore since we have 34 stations the output dimension of our predctions would be 34 * 134.

The 2 unique stations in the test set are treated as being part of the time series of each of the 3 nearby locations, and we average the result by weighted distance averaging. Rationale: the characteristics of a particular station are more similar to the closest neighbouring station)

We also log transformed the target based on the groups, thus removing skewness. Model was evaluated using the symmetric mean absolute percentage error (SMAPE).

The discussion is already too long so we’ll resort to leaving the finer details to be explained in the notebook.

TL;DR

1. We tried different things and not everything worked. And that's absolutely fine!

2. We struggled with creating a good validation scheme, but a good choice of the imputation technique kinda made the overfitting less pronounced.

Thank you for taking your time going through this discussion. We hope it was worth your time, and you learnt one or two things. We are also looking forward to reading about how you approached the problem, and the insights you gained so we can all learn from it.

Happy hacking!

Discussion 10 answers

Epic!. Thanks for sharing the approach !

7 Oct 2022, 21:33
Upvotes 0

Amazing. Thanks for sharing.

7 Oct 2022, 22:15
Upvotes 0

Outstanding !!!

7 Oct 2022, 22:35
Upvotes 0

Wow! This is a lot of hard work, congratulations once again! 🥂

8 Oct 2022, 03:01
Upvotes 0

Wow this is soooo amazing and congrats👏

Just a question, while you were creating the time series features like lag & roll features, did you impute the missing dates?

8 Oct 2022, 03:46
Upvotes 0

We didn't impute for the missing dates, but that seems like a good idea. Not sure how its going to work with that amount of missing values, but it's definitely worth trying. Thanks

Awesome! This a great contribution, Kudos to the work you put in and the dedication to keep going.

8 Oct 2022, 03:58
Upvotes 0

Amazing, what an approach

10 Oct 2022, 11:49
Upvotes 0

Thank you for sharing!! an insight into the complexity required to do so well!!

19 Oct 2022, 10:33
Upvotes 0