27 May 2021, 10:56

Meet Team a_MAIZE_ing, winners of the CGIAR Crop Yield Prediction Challenge

Catch up with Zindi superstars Azer Ksouri (ASSAZZIN), Darius Moruri (Brainiac) and Nikhil Mishra (devnikhilmishra) as they tell us a bit more about how they created the model that won 1st place in the CGIAR Crop Yield Prediction Challenge.

Team a_MAIZE_ing claimed top place in the recent CGIAR Crop Yield Prediction Challenge, beating 635 oher competitors for the $3000 USD prize pool. We asked them about their winning solution.

Tell us about your solution for the CGIAR Crop Yield Prediction Challenge?

Our solution can be divided into two approaches:

First Approach:

The first approach can be divided into 4 parts :

1st Part:  Vegetation Index Data Creation

Step 1: The vegetation dataset index is created from some statistics over : NDVI, GRNDVI, EVI, SAVI, CCCI. Statistics used are:

  • MEDIAN (recommended by experts)  [ NDVI, GRNDVI, EVI]

Step 2: Referring to some visualisation, we discovered that NDVI, SAVI, and GRNDVI have the same distribution for specific months. So we created a function that create over those months:

  • Products of NDVI, SAVI, GRNDVI features
  • std of NDVI, SAVI, GRNDVI features
  • mean of NDVI, SAVI, GRNDVI features

Step 3: apply Yeo-Johnson transformation to transform data distribution to a NORMAL Distribution .

2nd part:  Transform the Additional Data

After doing some research and referring to some experts, we found that:

  • Maize season in Kenya is from March to October.
  • Precipitation, minimum temperature, maximum temperature are key
  • Soil features are very useful

So we created a function that transforms the additional data by calculating the mean over 4 years from month 3 to month 10. For example :

we take month 3 and then we create a feature average_per_4Years_on_month_3 which is the mean over [ month_3_2016,month_3_2017,month_3_2018,month_3_2019 ] and so on.

3rd part: Create Red Bands DataSet

We created features from statistics over relation between Red Bands (this is also recommended by some experts in this field) For example, we calculate for month 5:

  • step 1: b7_b6_array = 5_S2_B7 / 5_S2_B6
  • step 2: we calculate the median over the resulting array

And finally, we concatenated those datasets to get a dataset with 233 features. In this approach, we're using only quality 1 and 3, adding quality 2 improves CV but makes LB very bad.

4th part: Modeling

  • Using  5 Kfold splits with shuffle =True.
  • Wrking with Xgboost with colsample_bytree = 0.65 .

in this approach, our cv is 1.59, and LB is 1.65.

Second Approach:

The second approach involved splitting the training set into two, good quality fields and bad quality fields.

  • Good quality data included quality 2 and 3
  • bad quality data included quality 1 fields.

Only good quality data was used for validation as the test set had only quality 2 and 3 fields.

Vegetation features used in this approach included : WDRVI, GNDVI, NDVI and NDRE only. Raw image pixel data was not used in training. Lightgbm was trained across five-folds with a CV score of 1.66 and a private LB score of 1.64.

For more information on following this solution, please see our GitHub repository.