Our Solution Can be divided into two Approaches :
The first Approach can be divided into 4 parts :
Step 1 :
The Vegetation dataset index is created from some statistics over : NDVI , GRNDVI , EVI , SAVI , CCCI . statistics used are :
Step 2 : Referred to some visualization we discover that NDVI, SAVI, GRNDVI have the same distribution for specific months . so we created a function that create over those months :
Step 3 : apply Yeo-Johnson transformation to transform data distribution to a NORMAL Distribution .
After doing some research and referring to some experts, we found that :
so we created a function that transforms the Additional data by Calculating the mean over 4 years from 3rd 3 to month 10. For example :
we take month 3 and then we create a feature average_per_4Years_on_month_3 which is the mean over [ month_3_2016,month_3_2017,month_3_2018,month_3_2019 ] and like that ....
We created features from statistics over relation between Red Bands (this is also recommended by some experts in this field) For example, we calculate for month 5 :
And Finally, we concatenated those Datasets to get a 233 features dataset.
Btw in this approach, we're using only quality 1 and 3, adding quality 2 improves CV but makes LB very Bad.
in this Approach Our cv is 1.59 - LB is 1.65.
The second approach involved splitting the training set into two, good quality fields and bad quality fields.
Only good quality data was used for validation as the test set had only quality 2 and 3 fields.
Vegetation features used in this approach included : WDRVI, GNDVI, NDVI and NDRE only. Raw image pixel data was not used in training. Lightgbm was trained across five-folds with a CV score of 1.66 and a private LB score of 1.64
Very well done. 👏
Nice, well done, it must have been tough
Great feature engineering
Nice work well done. This should have been mentioned somewhere in the competition Maize Season in Kenya is from march to October. As my approach was taking the vegetation index with NDVI , GRNDVI , EVI and red bands for the whole year.
For clarification in the first approach we applied median on vegetation index for the whole year, but for max and min we play around the maize season.
Just to add to the wonderful work done by my teammates, we also applied a bit of post processing at the end, as per my analysis, the model was performing poorly on high yields. So post processing was applied to increase the yield of fields with high values(>4). This helped boost our score both in cv and leaderboard.
Thanks for sharing!
Did you manage cloudy data somehow before incexes calculation?
I read about it but I really didn't find time to try it.
Thank you all 🤜🤛, for those who want to learn with code, and as the solution can be used by the sponsor, I will post next days some code which give a place in top 20