| Zn | 9.7168 | B | 0.6794 | dp2a9PUF
Data and Features
For this submission, I integrated Sentinel 1 and 2 data (merged on PID, lat and lon)
I engineered features using the lat and lon - embeddings, clusters, umap, pca, angular rotations. Additionally, I engineered distance-related features - haversine, manhattan and euclidean distance. I also added the calculated mean lat and lon and calculated the distance from each coordinate point as an additional feature.
I inverted the pH column to the respective ion concentration as extra feature
I took the mean and pca of the ‘bio*’ features - ['bio1', 'bio12', 'bio15', 'bio7']
Number of features : 71
CV Split
KFold with 5 splits on the targets, nothing fancy here.
Model & Training
I used random forest regressor with 100 estimators on just a single fold (fold 0)
| Cu | 3.2162 | 9SfR74jK
Data and Features
For this submission, I did not integrate any extra data
I only replicated feature engineering as above
CV Split
Same as above
Model & Training
I picked randomly initialised catboost , lgbm and xgboost regressors as base regressors to train a voting regressor on all features.
No tuning of model parameters was done.
PS: I noted since joining the challenge (in the last week) some visible columns in the dataset (such as 'x' and 'y') as seen in the starter notebook that were absent in the datasets I downloaded. So I began to wonder whether the datasets got updated ? I think a related discussion was brought up but no feedback was given.
Hello @100i, thanks for sharing your approach. I still have a question, how did you deal with the big number of missing values in Sentinel-2 ? (Sites=39.16%, PIDs=30.75%, Train_sites=42.49%, Train_PIDs=33.28%, Test_sites=31.32%, Test_PIDs=22.66%)
Thank you @100i