NASA Harvest Field Boundary Detection Challenge
Can you detect field boundaries for Rwandan smallholder farmers in a satellite image dataset?
$5 000 USD
~1 month to go
126 active · 568 enrolled
Earth Observation
Error metric for semantic segmentation
Platform · 6 Jan 2023, 04:19 · 1

The way the error metric (F1 score) is being calculated for the competition seems to make it very sensitive to small deviations in predictions. The F1 score as a harmonic mean of precision and recall is a good metric for the classification task. However, for semantic segmentation, the Dice Coefficient (or Sorensen-Dice index, also called the F1 score, creating some confusion there) should be calculated using the *areas* of the masks of individual fields. The Dice coefficient is similar to the IoU (Intersection over Union) metric. Identifying individual masks for individual fields could also have received some importance in the evaluation. I wonder what is the primary motivation to choose the F1 metric for evaluation in this competition.

Discussion 1 answer

To effectively complain, you should describe the problem and provide solutions.

Returning to the topic of the selected error metric for this competition, it can be frustrating to try various approaches and get different leaderboard results, as if it's a matter of luck (there is low correlation between local and leaderboard results). I think the issue isn't with the metric, but rather with the way we are evaluating the models locally. That is the criterion we are using to select a model over others. As shown in the attached figure, false positives negatively impact the score because the ground truth borders are only 1 pixel wide. Notice how a 1x and 2x dilated ground truth mask reduces the F1-score from 1.00 to 0.54 and 0.37, respectively. Therefore, based on this information, you should carefully reconsider the criterion used to select a model over others.

6 Jan 2023, 09:19
Upvotes 3