The current reliance on the ROUGE score is fundamentally misaligned with the competition’s goals. ROUGE rewards surface-level word overlap, not clinical correctness, reasoning quality, or safety ,all of which are vital in healthcare applications. This misguides participants to optimize for phrasing tricks rather than real, explainable medical logic.
As a participant, I experienced this first-hand: a stronger submission was mistakenly not uploaded in time due to a file mix-up. Since the leaderboard is driven by a flawed metric, this error now unfairly penalizes efforts that genuinely focused on safe, structured clinical reasoning.
I urge the organizers to consider reopening and extending the competition, and to adopt a more appropriate evaluation method, one that aligns with clinical standards and captures real-world utility. This would not only ensure fair judgment for all contributors but also serve the end-users with more trustworthy, high-impact solutions.
Warm regards, Khushi Malik