Primary competition visual

Kenya Clinical Reasoning Challenge

Helping Kenya
$10 000 USD
Completed (9 months ago)
Prediction
Natural Language Processing
SLM
1664 joined
440 active
Starti
Apr 03, 25
Closei
Jun 29, 25
Reveali
Jun 30, 25
User avatar
mail_liw
Challenging ROUGE: Why We Need Better Metrics for Clinical Text Generation
Help · 8 Apr 2025, 07:51 · 21

@Amy_Bray After working deeply with this Kenyan clinical dataset, I’m becoming increasingly convinced that ROUGE is not the right metric to evaluate our models' performance.

Here’s why.

This challenge asks us to replicate real clinician reasoning in rural, resource-constrained environments. That means our models should generate medically accurate, coherent, and context-sensitive responses—not just repeat phrases that appear in the gold answers.

Yet ROUGE, the primary metric being used, rewards surface-level token overlap. It penalizes deeper reasoning, and even medically sound elaborations if they don’t match the reference verbatim. Ironically, a generic summarization model (which doesn't even attempt a clinical decision) can sometimes score higher than a clinically sound response—simply because it rephrases the vignette in a way that overlaps more with the reference. That’s not just misleading; it actively discourages progress.

So what’s a better alternative?

I’d strongly argue for BERTScore (Zhang et al., 2019). It uses contextual embeddings from pre-trained language models (like BERT) to compare the semantic similarity between the model’s output and the reference. This allows it to reward meaning over exact wording.

In a clinical context, this is a game changer. Two doctors can give equally valid answers in totally different words. BERTScore captures that. ROUGE doesn’t.

What I’m seeing in my own submissions:

I ran an experiment: one submission was the starter notebook with score of ~0.30 that just summarized the case—it got a 0.30 ROUGE. Another was a carefully tuned model that generated structured clinical reasoning—it scored lower (0.28). Why? Because it tried harder. Because it thought harder. But ROUGE couldn’t see it.

This isn't just a modeling issue—it’s a metric problem.

Let’s reframe what “good” looks like

In a task this important—emulating clinicians in real-world Kenya—our evaluation should prioritize clinical soundness and reasoning quality, not just textual overlap. ROUGE might help us get a rough sense of performance, but it should not be the final word. We need to combine it with semantically-aware metrics like BERTScore, or even use human review on a subset of outputs to validate that we’re heading in the right direction.

We owe it to the nurses and clinicians this dataset represents to build models that not only sound right—but think right.

P. S.: I can share my submission file to be manually checked against the starter notebook's submission file. Thanks!

Discussion 21 answers
User avatar
MuhammadQasimShabbeer
Engmatix

I think it sounds right to me also.

8 Apr 2025, 08:02
Upvotes 0
User avatar
Ashesi university

I agree

8 Apr 2025, 08:16
Upvotes 0
User avatar
isaacOluwafemiOg
Kwame nkrumah university of science and technology

@marching_learning (and I) was looking for the exact kind of rouge to use.

How did you come about rouge-l?

8 Apr 2025, 08:18
Upvotes 0
User avatar
mail_liw

It is on the info page. Click on the rogue in the evaluation. It will take you to the hyperlink

Evaluation

The evaluation metric for this challenge is the ROUGE Score.

All clinician responses have been turned to lower case, punctuation removed and all paragraphs replaced with a space.

Edit: we don't know which rogue is being used.

User avatar
nymfree

Great observation. It is a long competition and I hope that @Zindi will be flexible enough to change the metric. Otherwise we will have to optimize for something else.

8 Apr 2025, 08:21
Upvotes 0
User avatar
mail_liw

Yeah. Hope so too

User avatar
Koleshjr
Multimedia university of kenya

great explanation.

8 Apr 2025, 08:26
Upvotes 0

I had also observed that, it will take time for them to notice the flaw, for now we will have to tackle it differently, optimize something else

10 Apr 2025, 08:24
Upvotes 0
User avatar
chater_marzougui
Sup'com

Hey, I just wanted to say I completely agree with your take here. I’ve been working on this dataset too, and I ran into exactly the kind of situation you're describing.

I initially trained a model with max_input_length=512 and max_target_length=128, which gave me a decent score of ~0.40. But here's the catch — when I increased the max_target_length (to fully include the reference clinician responses), my score actually dropped to 0.33, despite the outputs being more complete, structured, and clinically sound.

After analyzing the dataset, I found that around 50% of the reference responses fall under 128 tokens. So technically, I was only covering half the scope. Then I made a better plan using actual token length stats to cover the full context — but ROUGE ended up penalizing me for going deeper into clinical reasoning.

It’s honestly frustrating to see models getting punished for generating clinically thoughtful answers that don’t just regurgitate phrasing from the gold responses.

@Amy_Bray please update the metric before we got deeper into the models so we can be sure we're on the right track.

13 Apr 2025, 17:53
Upvotes 2
User avatar
Amy_Bray
Zindi

Hello, this thread is well noted and we are talking to researchers. We will provide an update when we have one.

14 Apr 2025, 07:12
Upvotes 3
User avatar
hark99
Self-employed

any updates

User avatar
mail_liw

It would take quite some time. I hope the time would be added to the competition timeline though

User avatar
Koleshjr
Multimedia university of kenya

@Amy_Bray any updates?

User avatar
Aman_Deva

@Amy_Bray any updates? or the metrics are fine,?

User avatar
mail_liw

We will be keeping the error metric as is, ROUGE. ROUGE is “simple”, widely used for this task, and easy to interpret. Keeping the same error metric allows us to stay consistent and transparent with you, our users.

User avatar
okonp07

Thanks for pointing this out. I agree with your worry regarding ROUGE omitting richer semantic meaning, particularly in a sensitive, reasoning-intensive task such as clinical decision support. However, I'd like you to look at it from a different perspective. I believe ROUGE likely was chosen here for ease of use and simplicity. It is easy to compute, stable, and allows for quick iteration over a wide range of models. In that sense, it acts more like a comparison baseline.

That said, blending ROUGE with BERTScore, or even human judgment at later stages or post-competition, would be a great way to build on the work done during this competition. This process would be rather involving, as you would imagine, but such a combined strategy would afford both scalability and semantic reasonableness, which is critical for the implementation of a project like this one in real life.

Perhaps a compromise is to submit ROUGE scores for the competition as we are doing now, while at the stage of selection of the winners or outstanding models, BERTScore would be required as well. This would be useful to inform model development in parallel as the subtleties of clinical reasoning can be maintained while the process of ranking remain straightforward.

Interested to know what you think about this approach.

18 Apr 2025, 13:04
Upvotes 0
User avatar
Koleshjr
Multimedia university of kenya

I tend to disagree with you.

I agree that ROUGE is convenient and allows for quick iteration. However, I believe that relying primarily on ROUGE, even as an early-stage filter, poses a significant risk in tasks where semantic reasoning is critical, like clinical decision support.

The issue is that optimizing for ROUGE can actively bias model development toward surface-level similarity rather than deeper understanding. ROUGE rewards n-gram overlap, which is not necessarily correlated with clinical accuracy, completeness, or sound reasoning. In fact, it's quite possible for a model to score highly on ROUGE while producing responses that are semantically shallow or even clinically misleading.

Introducing BERTScore or another semantically informed metric only at the final evaluation stage doesn't fully solve the problem. By that point, the models that might have excelled in reasoning but underperformed on ROUGE may already have been excluded. This creates a misalignment between what’s being optimized during development and what actually matters in deployment.

User avatar
Amy_Bray
Zindi

Thank you for your patience while we viewed this internally. This sparked an interesting discussion within our team, allowing us to pull on experts in the field and start thinking about future challenges.

For this challenge, we were impressed and grateful for the well-laid-out discussion; it was very professional and prompted genuinely interesting comments. Thank @mail_liw for your input to Zindi and for pushing the internal team.

We will be keeping the error metric as is, ROUGE. ROUGE is “simple”, widely used for this task, and easy to interpret. Keeping the same error metric allows us to stay consistent and transparent with you, our users.

BertScore is more advanced in some way, but not a metric we can implement “overnight”. It will need a lot of testing.

Going forward, we will be doing more benchmarking before a challenge starts to ensure we choose the right metric for new and exciting challenges in the GenAI space, especially those that rely on semantic understanding, rather than exact phrasing.

Once again, thank you to your commitment to Zindi and for creating a thought-provoking discussion for you and Zindi!

13 May 2025, 12:32
Upvotes 2
User avatar
Koleshjr
Multimedia university of kenya

I have another concern regarding the initial training data.

Some models are being trained on text that lacks punctuation entirely that was presented earlier. My question is: how will these models be used in production? Personally, I find such output very difficult to read and interpret. If these models are intended for real-world applications, then shouldn’t they be trained with the human-readable input-output pairs that includes proper punctuation?

It’s generally easier to convert well-punctuated, human-readable output into a cleaner, stripped-down format when needed, rather than trying to reconstruct punctuation and structure from raw, unpunctuated text.

I personally think the output coming from the model should be well punctuated outputs that are easier to understand and we can then do post processing to clean the human readable output into the stripped down version for evaluation.

User avatar
mail_liw

Interesting!

It Only Measures Overlap

  • Problem : ROUGE focuses on surface-level n-gram overlaps (like n-grams of size 1 = unigrams, 2 = bigrams).
  • Consequence : A model can score high on ROUGE by copying large parts of the input verbatim or repeating phrases — even if the output is redundant , incoherent , or off-topic .
Example: Reference: "The cat sat on the mat." Model Output: "The cat sat on the mat mat mat."
ROUGE-2 score will still be high due to repeated bigrams like "cat sat", "sat on", etc.

Synonymy and Paraphrasing

  • Two valid summaries may use different terms with the same meaning.
  • Example:Reference: "Patient has hypertension." Model Output: "Patient suffers from high blood pressure."
  • ROUGE will give a low score , even though both are correct.
25 May 2025, 12:42
Upvotes 0