Primary competition visual

Turtle Recall: Conservation Challenge

Helping Kenya
$10 000 USD
Completed (almost 4 years ago)
Classification
Computer Vision
753 joined
247 active
Starti
Nov 19, 21
Closei
Apr 21, 22
Reveali
Apr 21, 22
Cross validation scores vs leaderboard
Help · 28 Feb 2022, 18:30 · 4

Most of you have expressed concerns about the cross-validation scores being way better than the leaderboard scores; thank you for raising the concerns.

  1. Most of the people I have talked to are using stratified folds, kfold, etc., to validate the scores locally, and there is a huge difference in the scores.

2. How many people are using the train/test split way(i.e., completely setting aside 30% of the train data) train the model and now use it to validate) to validate their scores? Are you still getting the same huge difference between your score and that on the leaderboard?

I have set aside 490 rows from the train data as part of my validation test, then trained my model(using the starter notebook) on 1656 rows of data - the score I get for the 490 rows is 0.054965986394557825, which is not very different from the leaderboard. This is just my observation. My request is could you faithfully do the same; please share the scores you get(no folds, please) :)

Let's work together to get the best solutions to conserve the turtles.

Discussion 4 answers

'' Are you still getting the same huge difference between your score and that on the leaderboard?''

The answer is yes. Please read carefully the discussion below where people checked it by labelling some part of the test set. The metric should be way higher. It is first. Second is super strange behaviour of Zindi beckend - I changed a huge part of my submission and LB score haven't changed! That means only one thing - public score is esteemated based on very few samples of data with probably not the same number in denominator. And finally. If you just look at the predictions on test set it'll turn out the predictions are very good (turtles are the same), model is getting the right answer in majority of samples.

Setting aside ~430 rows, training on the rest, scoring on the rows that were held back during training: scores 0.54 locally, so drastically different to the leaderboard. Notebook: https://colab.research.google.com/drive/1Ijn9CpBaekJIZ6rYQ_Mw0jbq-PyrbXbq?usp=sharing

Manually reviewing some random predictions from a submission showed close to half that seemed right: https://colab.research.google.com/drive/1HJsQbj7pCvokgP9yvbGSKUnGZxiUwD9k?usp=sharing

I suggest picking a submission from someone high on the leaderboard and reviewing the predictions yourself using the notebook above - it should quickly become obvious whether they're right ~5% of the time or ~50% of the time. In which case, there is an issue with the scoring ;) Are you sure there isn't an errant .head(30) or something in the scoring code that's only scoring the first N rows and not the whole submission?

1 Mar 2022, 12:23
Upvotes 0

I got a responce from Zindi support saying they'll review the situation this week. Hope they can debug their backend:)

Thanks, @kiryusha . My local validation score does not correlate with the LB score at all. I strongly believe the LB score calculation process is not correct.