🩺 Challenge Chat: Dear Clinician AI Builders: We...

Kenya Clinical Reasoning Challenge

Helping Kenya

$10 000 USD

Completed (~1 year ago)

Skills you will learn

Prediction

Natural Language Processing

SLM

1672 joined

439 active

Info Data Chat Leaderboard

Start

Apr 03, 25

Jun 29, 25

Reveal

Jun 30, 25

Koleshjr

Multimedia university of kenya

Dear Clinician AI Builders: We Need a Better Metric

Platform · 26 Jun 2025, 11:49 · 13

I've recently returned to working on the Kenya Clinicians Response Generation competition, and after taking a closer look at how submissions are evaluated, I genuinely believe we need to reconsider ROUGE as the primary metric.

While ROUGE is popular for summarization tasks, it’s not well-suited for assessing clinically meaningful, accurate responses. In this competition, it’s possible to achieve high ROUGE scores without truly answering the clinical question.

Here’s a quick experiment I'd encourage anyone to try: Take the input prompt, remove references to the nurse, and submit it back as your responseessentially just repeating and lightly editing the original input, then clean it to remove the new lines and punctuations. I tried this, and it scored 0.3983, nearly 0.40. That’s alarmingly close to what thoughtful, well-crafted answers from advanced LLMs are getting.

And here's the kicker:

Our best model doesn’t even use an LLM. It relies purely on clever statistical combinations of the input prompt and it's performing near the top of the leaderboard.

This tells me that the metric rewards lexical overlap more than clinical reasoning. It also explains why T5-style models perform surprisingly well they're just good at summarizing or echoing the prompt. Meanwhile, more capable LLMs that generate nuanced responses (with lower surface-level overlap) actually get penalized.

My suggestion is that once the competition ends, we consider adopting a more robust evaluation approach. This could include:

An LLM-as-a-judge setup (with stronger open-source LLMs(qwen3-32 B with reasoning or With paid options like the O models from openai),

I believe we should aim for more clinically meaningful evaluation that better reflects the task’s real-world purpose.After all, this isn’t just academic these models are being built to support real clinicians, for real patients. The evaluation metric needs to reflect that level of responsibility.

Curious to hear what others think and please do try the prompt-repeating trick to see the metric’s behavior for yourself.

Discussion 13 answers

CodeJoe

I also had this feeling @Koleshjr. I wondered why t5 was surprisingly doing better than other llms. I studied a bit about the rouge score and realized the rouge score just look out for exact words which might be problematic if you are just giving back a preprocessed input prompt. I think a more justifiable evaluation must be done then.

26 Jun 2025, 12:03

Upvotes 1

Mugisha_

I do agree that rouge score is a bad metric for this challenge and that LLM-as-a-judge is a better evaluation strategy.

Since this is clinical data, treatments to different conditions are known, therefore evaluations should focus on how grounded the model responses are to some true biological source text and how helpful/concise the response is to a user. LLM evaluators are much better at measuring this semantic quality, as opposed to ROUGE score which is better suited for translation and summarization tasks.

26 Jun 2025, 12:27

Upvotes 0

Aman_Deva

Yes, i agree, i also wondered why a summarizer model like T5 is giving better score then other language model, also the data set is too small is a problem for generalisation

26 Jun 2025, 12:36

Upvotes 0

OrionV

For sure the dataset is also small for generalisation. That's why i was looking forward to do finetuning and see how to avoid generalisation at all.

replied to Aman_Deva26 Jun 2025, 12:48

Upvotes 0

mrfanyntom

can't we use early stopping for overfitting?

replied to OrionV26 Jun 2025, 22:01

Upvotes 0

analyst

I think this is the reason why there is a second evaluation phase.

26 Jun 2025, 16:51

Upvotes 0

Koleshjr

Multimedia university of kenya

Nearly all top 50 models are T5 models which are not answering anything, How will they filter for superior solutions before the second evaluation? We need a better metric after the competition ends

replied to analyst26 Jun 2025, 18:38

Upvotes 2

analyst

I'm not sure if a reliable metric for such a challenge really exists. I personally searched for cases where similar work was done. I could not find cases where free flowing text was being evaluated against some kind of ground truth. Most of the cases involve multi choice questions which are easy to evaluate.

replied to Koleshjr27 Jun 2025, 13:01

Upvotes 0

Koleshjr

Multimedia university of kenya

That's why I recommended LLM as a judge which is a very popular evaluation method in the LLM space. You use a superior model to judge inferior models. And given that we have very powerful open source reasoning models like qwen3 32b , I dont think it's hard for Zindi to implement this after the competition ends.

replied to analyst27 Jun 2025, 13:04

Upvotes 2

purandeswar

yes you are correct even though the respose is clinically good rouge was giving low score by simply copy and paste the prompt itself giving 40 score this is unfare evaluation they must change evaluation metrics

27 Jun 2025, 08:21

Upvotes 0

kinsDev

Excellent point on the ROUGE score limitation @Koleshjr

Your prompt-echoing experiment is a perfect demonstration of why lexical overlap metrics can be misleading for clinical tasks.I've been working on a T5-based approach for this competition, and I've observed similar patterns. My models achieve decent ROUGE scores, but when I manually review the outputs, there's often a disconnect between what scores well and what would actually be clinically useful. The metric rewards surface-level similarity rather than clinical reasoning, diagnostic accuracy, or appropriate treatment recommendations.

A few additional observations from my experiments:

Length bias: ROUGE heavily favors longer responses that repeat key terms from the input, even if they're clinically redundant
Medical terminology overlap: Models get rewarded for echoing medical terms from the prompt rather than demonstrating understanding
Missing clinical logic: A response can score well while completely missing critical clinical reasoning steps

For post-competition evaluation, I'd suggest a multi-faceted approach:

LLM-as-judge (as you mentioned)
Clinical accuracy scoring - evaluating diagnostic reasoning, treatment appropriateness, safety considerations, e.t.c
Task-specific metrics - measuring whether the response actually addresses the clinical question asked

The real-world implications are crucial. These models are intended to support healthcare workers in resource-constrained settings. A model that echoes prompts well but provides poor clinical guidance could be genuinely harmful.

30 Jun 2025, 08:54

Upvotes 1

Koleshjr

Multimedia university of kenya

well said btw using a LLM and generating a response of like 64 tokens which means the answer has been cut off performs better than one with 256 tokens which is nearly the whole response

replied to kinsDev30 Jun 2025, 09:44

Upvotes 1

kinsDev

That's right @Koleshjr

The most ridiculous thing was that I had set my maximum tokens as 380.

And that's because after preprocessing all I wanted was the model to paraphrase and regenerate longer sentences with a high prediction rate, that way boosting performance.

But, in hindsight, that wasn't the right thing to do, simply because I was training the model to memorize not attain contextual reasoning...

replied to Koleshjr30 Jun 2025, 12:59

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status