I've recently returned to working on the Kenya Clinicians Response Generation competition, and after taking a closer look at how submissions are evaluated, I genuinely believe we need to reconsider ROUGE as the primary metric.
While ROUGE is popular for summarization tasks, it’s not well-suited for assessing clinically meaningful, accurate responses. In this competition, it’s possible to achieve high ROUGE scores without truly answering the clinical question.
Here’s a quick experiment I'd encourage anyone to try: Take the input prompt, remove references to the nurse, and submit it back as your responseessentially just repeating and lightly editing the original input, then clean it to remove the new lines and punctuations. I tried this, and it scored 0.3983, nearly 0.40. That’s alarmingly close to what thoughtful, well-crafted answers from advanced LLMs are getting.
And here's the kicker:
Our best model doesn’t even use an LLM. It relies purely on clever statistical combinations of the input prompt and it's performing near the top of the leaderboard.
This tells me that the metric rewards lexical overlap more than clinical reasoning. It also explains why T5-style models perform surprisingly well they're just good at summarizing or echoing the prompt. Meanwhile, more capable LLMs that generate nuanced responses (with lower surface-level overlap) actually get penalized.
My suggestion is that once the competition ends, we consider adopting a more robust evaluation approach. This could include:
I believe we should aim for more clinically meaningful evaluation that better reflects the task’s real-world purpose.After all, this isn’t just academic these models are being built to support real clinicians, for real patients. The evaluation metric needs to reflect that level of responsibility.
Curious to hear what others think and please do try the prompt-repeating trick to see the metric’s behavior for yourself.
I also had this feeling @Koleshjr. I wondered why t5 was surprisingly doing better than other llms. I studied a bit about the rouge score and realized the rouge score just look out for exact words which might be problematic if you are just giving back a preprocessed input prompt. I think a more justifiable evaluation must be done then.
I do agree that rouge score is a bad metric for this challenge and that LLM-as-a-judge is a better evaluation strategy.
Since this is clinical data, treatments to different conditions are known, therefore evaluations should focus on how grounded the model responses are to some true biological source text and how helpful/concise the response is to a user. LLM evaluators are much better at measuring this semantic quality, as opposed to ROUGE score which is better suited for translation and summarization tasks.
Yes, i agree, i also wondered why a summarizer model like T5 is giving better score then other language model, also the data set is too small is a problem for generalisation
For sure the dataset is also small for generalisation. That's why i was looking forward to do finetuning and see how to avoid generalisation at all.
can't we use early stopping for overfitting?
I think this is the reason why there is a second evaluation phase.
Nearly all top 50 models are T5 models which are not answering anything, How will they filter for superior solutions before the second evaluation? We need a better metric after the competition ends
I'm not sure if a reliable metric for such a challenge really exists. I personally searched for cases where similar work was done. I could not find cases where free flowing text was being evaluated against some kind of ground truth. Most of the cases involve multi choice questions which are easy to evaluate.
That's why I recommended LLM as a judge which is a very popular evaluation method in the LLM space. You use a superior model to judge inferior models. And given that we have very powerful open source reasoning models like qwen3 32b , I dont think it's hard for Zindi to implement this after the competition ends.
yes you are correct even though the respose is clinically good rouge was giving low score by simply copy and paste the prompt itself giving 40 score this is unfare evaluation they must change evaluation metrics
Excellent point on the ROUGE score limitation @Koleshjr
Your prompt-echoing experiment is a perfect demonstration of why lexical overlap metrics can be misleading for clinical tasks.I've been working on a T5-based approach for this competition, and I've observed similar patterns. My models achieve decent ROUGE scores, but when I manually review the outputs, there's often a disconnect between what scores well and what would actually be clinically useful. The metric rewards surface-level similarity rather than clinical reasoning, diagnostic accuracy, or appropriate treatment recommendations.
A few additional observations from my experiments:
For post-competition evaluation, I'd suggest a multi-faceted approach:
The real-world implications are crucial. These models are intended to support healthcare workers in resource-constrained settings. A model that echoes prompts well but provides poor clinical guidance could be genuinely harmful.
well said btw using a LLM and generating a response of like 64 tokens which means the answer has been cut off performs better than one with 256 tokens which is nearly the whole response
That's right @Koleshjr
The most ridiculous thing was that I had set my maximum tokens as 380.
And that's because after preprocessing all I wanted was the model to paraphrase and regenerate longer sentences with a high prediction rate, that way boosting performance.
But, in hindsight, that wasn't the right thing to do, simply because I was training the model to memorize not attain contextual reasoning...