Primary competition visual

Digital Green Crop Yield Estimate Challenge

Helping India
€9 400 EUR
Completed (over 2 years ago)
Prediction
1368 joined
678 active
Starti
Sep 04, 23
Closei
Dec 03, 23
Reveali
Dec 03, 23
Winners solutions
Notebooks · 4 Dec 2023, 07:33 · 2

We'd like to see the top 3 solutions for the sake of learning

Discussion 2 answers
User avatar
GIrum
Adama Science and Technology University

and from all i wanna see the first one from

4 Dec 2023, 09:44
Upvotes 0
User avatar
rapsoj
University of Oxford

My team was third :)

We used an ensemble method that took the average of three predictions made using tree-based methods, specifically: XGBoost, CatBoost, and LightGBM. All three of these are tree-based methods, which are great for handling tabular data with non-linear relationships. We also used cross validiation to train and test our model in order to reduce overfitting.

However, I think the most important part was the data cleaning stage. We probably spent 95% of the time on data cleaning, which involved working to understand the variables and reading literature on which variables affect crop yield in Bihar/India. As a result, we learned about different agircultural traditions in North Bihar vs. South Bihar (which is more agriculturally productive and employs the Ahar Pyne agricultural system, leveraging channels and retention ponds to manage water resources and adapt to Bihar's unpredictable weather). We also learned about the importance of the monsoon in Bihar agricultural cycles. Kharif crops, such as rice, are sown during the monsoon season from June to September and are watered by monsoon rainfall. These crops do well with high rain in Winter. Rabi crops, such as wheat, are sown in mid-November – preferably after the monsoon rains are over – and are watered by percolated rainfall. These crops are spoiled by high rain in winter. We also learned about nitrogen cycles, fertilizer application methods, and irrigation techniques.

With all of this information, we had a pretty good idea that the region in which the crops was grown was important (North vs. South Bihar), as were the various dates on which key agricultural steps were taken, as were the fertilisation choices. We engineered our data to reflect this, and selected the top variables using recursive feature elimination with cross validation.

Overall, we only spent around 5% of the time building the actual model (and only started fine-tuning the model two days before the deadline). The trick is to understand the data and not overfit to the public test set.

4 Dec 2023, 11:38
Upvotes 5