Catch up with Zindi superstars Azer Ksouri (ASSAZZIN), Darius Moruri (Brainiac) and Nikhil Mishra (devnikhilmishra) as they tell us a bit more about how they created the model that won 1st place in the CGIAR Crop Yield Prediction Challenge.
Team a_MAIZE_ing claimed top place in the recent CGIAR Crop Yield Prediction Challenge, beating 635 oher competitors for the $3000 USD prize pool. We asked them about their winning solution.
Tell us about your solution for the CGIAR Crop Yield Prediction Challenge?
Our solution can be divided into two approaches:
The first approach can be divided into 4 parts :
1st Part: Vegetation Index Data Creation
Step 1: The vegetation dataset index is created from some statistics over : NDVI, GRNDVI, EVI, SAVI, CCCI. Statistics used are:
Step 2: Referring to some visualisation, we discovered that NDVI, SAVI, and GRNDVI have the same distribution for specific months. So we created a function that create over those months:
Step 3: apply Yeo-Johnson transformation to transform data distribution to a NORMAL Distribution .
2nd part: Transform the Additional Data
After doing some research and referring to some experts, we found that:
So we created a function that transforms the additional data by calculating the mean over 4 years from month 3 to month 10. For example :
we take month 3 and then we create a feature average_per_4Years_on_month_3 which is the mean over [ month_3_2016,month_3_2017,month_3_2018,month_3_2019 ] and so on.
3rd part: Create Red Bands DataSet
We created features from statistics over relation between Red Bands (this is also recommended by some experts in this field) For example, we calculate for month 5:
And finally, we concatenated those datasets to get a dataset with 233 features. In this approach, we're using only quality 1 and 3, adding quality 2 improves CV but makes LB very bad.
4th part: Modeling
in this approach, our cv is 1.59, and LB is 1.65.
The second approach involved splitting the training set into two, good quality fields and bad quality fields.
Only good quality data was used for validation as the test set had only quality 2 and 3 fields.
Vegetation features used in this approach included : WDRVI, GNDVI, NDVI and NDRE only. Raw image pixel data was not used in training. Lightgbm was trained across five-folds with a CV score of 1.66 and a private LB score of 1.64.
For more information on following this solution, please see our GitHub repository.