🚜 Join the Buzz: 17th Place Solution (3rd on pu...

Digital Green Crop Yield Estimate Challenge

Helping India

€9 400 EUR

Completed (over 2 years ago)

Skills you will learn

Prediction

1370 joined

677 active

Info Data Chat Leaderboard

Start

Sep 04, 23

Dec 03, 23

Reveal

Dec 03, 23

cliff003

17th Place Solution (3rd on public)

Notebooks · 4 Dec 2023, 15:04 · 0

Hello, everyone. I hope you're all enjoying this competition, which feels like a lottery game.

I've been using basic models with some tweaks to score high on the public leaderboard, and surprisingly, these models are doing well on the private leaderboard, too (17th place with an RMSE of 120).

My approach involved using the DBSCAN method to identify outliers. I then adjusted these outliers by either multiplying or dividing them by 10 to fit the main regression line better. For the predictions, I used three simple models - Extra Trees, Catboost, and LightGBM - without any special adjustments to their settings. These models worked fast, giving results in less than 20 seconds.

Initially, this method gave me an RMSE score above 400 without any manual changes, which wasn't very good. However, by manually adjusting two outliers that I found from the public leaderboard, I saw a big improvement. I changed the value of ID_PMSOXFT4FYDW, which many people discussed, to 8000. The second one, ID_BI4VNVU7JAXF, was harder to figure out, but I estimated it to be 3200. These changes helped me climb to 3rd place on the public leaderboard.

I know that these manual changes won't help on the private leaderboard. So, I tried another method to make the model work better for the entire dataset, but unfortunately, it didn't succeed. The original simple model has better performance for the private dataset.

Among my simple models, I found that Extra Trees gave the best results. But I don't think the choice of model is the main issue. After looking at other top solutions, I realized this competition's unpredictability makes it feel like a lottery. Different models and settings (or even seed selection) can lead to different outcomes. Additionally, many competitors will discover that some 'poor' results they previously submitted might actually achieve better final scores due to the unreliable nature of public scores and local cross-validation.

I've shared my basic models on GitHub at (https://github.com/cliff003/Digital-Green-Crop-Yield-Estimate-Challenge ). I hope they are helpful and explain how I did well on the public leaderboard.

Discussion 0 answers

Join the largest network for
data scientists and AI builders

About FAQs

Status