Primary competition visual

CGIAR Crop Yield Prediction Challenge

Helping Kenya
$3 000 USD
Completed (~5 years ago)
Prediction
Earth Observation
890 joined
195 active
Starti
Oct 21, 20
Closei
Feb 07, 21
Reveali
Feb 07, 21
User avatar
ASSAZZIN
First Place Solution
Connect · 10 Feb 2021, 15:33 · edited less than a minute later · 11

All thanks to my Amazing teammates @brainiac and @devnikhilmishra, we were able to secure 1st place in this amazing competition.

Our Solution Can be divided into two Approaches :

First Approach :

The first Approach can be divided into 4 parts :

1st Part :  Vegetation Index Data Creation

Step 1 :

The Vegetation dataset index is created from some statistics over : NDVI , GRNDVI , EVI , SAVI , CCCI . statistics used are :

  • MEDIAN ( recommended by experts )  [ NDVI , GRNDVI , EVI ]
  • MAX  [NDVI , GRNDVI , EVI , SAVI , CCCI ]
  • MIN [NDVI , GRNDVI ]

Step 2 : Referred to some visualization  we discover that NDVI, SAVI, GRNDVI have the same distribution for specific months  . so we created a function that create over those months :

  • Products of NDVI, SAVI, GRNDVI features
  • std of NDVI, SAVI, GRNDVI features
  • mean of NDVI, SAVI, GRNDVI features

Step 3 : apply Yeo-Johnson transformation to transform data distribution to a NORMAL Distribution .

2nd part :  Transform the Additional DATA

After doing some research and referring to some experts, we found that :

  • Maize Season in Kenya is from mars to October.
  • Precipitation, Minimum temperature, Maximum temperature.
  • soil features are very useful.

so we created a function that transforms the Additional data by Calculating the mean over 4 years from 3rd 3 to month 10. For example :

we take month 3 and then we create a feature average_per_4Years_on_month_3 which is the mean over [ month_3_2016,month_3_2017,month_3_2018,month_3_2019 ] and like that ....

3rd part : Create Red Bands DataSet

We created features from statistics over relation between Red Bands (this is also recommended by some experts in this field) For example, we calculate for month 5 :

  • step 1 : b7_b6_array = 5_S2_B7 / 5_S2_B6
  • step 2 : we calculate the median over the resulting array

And Finally, we concatenated those Datasets to get a 233 features dataset.

Btw in this approach, we're using only quality 1 and 3, adding quality 2 improves CV but makes LB very Bad.

4th part : Modeling

  • Using  5 Kfold splits with shuffle =True.
  • Wokring with Xgboost with colsample_bytree = 0.65 .

in this Approach Our cv is 1.59 - LB is 1.65.

Second Approach :

The second approach involved splitting the training set into two, good quality fields and bad quality fields.

  • Good quality data included quality 2 and 3
  • bad quality data included quality 1 fields.

Only good quality data was used for validation as the test set had only quality 2 and 3 fields.

Vegetation features used in this approach included : WDRVI, GNDVI, NDVI and NDRE only. Raw image pixel data was not used in training. Lightgbm was trained across five-folds with a CV score of 1.66 and a private LB score of 1.64

Discussion 11 answers
User avatar
Muhamed_Tuo
Inveniam

Very well done. 👏

10 Feb 2021, 15:44
Upvotes 0
User avatar
University of lagos

Nice, well done, it must have been tough

10 Feb 2021, 15:58
Upvotes 0

Great feature engineering

10 Feb 2021, 16:07
Upvotes 0

Nice work well done. This should have been mentioned somewhere in the competition Maize Season in Kenya is from march to October. As my approach was taking the vegetation index with NDVI , GRNDVI , EVI and red bands for the whole year.

10 Feb 2021, 16:08
Upvotes 0
User avatar
ASSAZZIN

For clarification in the first approach we applied median on vegetation index for the whole year, but for max and min we play around the maize season.

User avatar
University of uyo

SOURCE CODE PLEASE SO I CAN LEARN

ASSAZZIN

10 Feb 2021, 23:58
Upvotes 0

Just to add to the wonderful work done by my teammates, we also applied a bit of post processing at the end, as per my analysis, the model was performing poorly on high yields. So post processing was applied to increase the yield of fields with high values(>4). This helped boost our score both in cv and leaderboard.

11 Feb 2021, 04:25
Upvotes 1

Thanks for sharing!

Did you manage cloudy data somehow before incexes calculation?

11 Feb 2021, 16:51
Upvotes 0
User avatar
ASSAZZIN

I read about it but I really didn't find time to try it.

User avatar
ASSAZZIN

Thank you all 🤜🤛, for those who want to learn with code, and as the solution can be used by the sponsor, I will post next days some code which give a place in top 20

this is great , id like to connect with you and partner on the next hack , i have sent an inbox