🩺 Hot Topic: TOP 20 WITH A ONE LINER!!

Kenya Clinical Reasoning Challenge

Helping Kenya

$10 000 USD

Completed (~1 year ago)

Skills you will learn

Prediction

Natural Language Processing

SLM

1672 joined

439 active

Info Data Chat Leaderboard

Start

Apr 03, 25

Jun 29, 25

Reveal

Jun 30, 25

Koleshjr

Multimedia university of kenya

TOP 20 WITH A ONE LINER!!

Platform · 30 Jun 2025, 09:33 · 21

test = pd.read_csv('Data/test_raw.csv')

test['Clinician'] = test['Prompt'].apply(lambda x: "summary "+ " ".join(x.split('.')[1:]).replace('\n', ' '))

test[['Master_Index', 'Clinician']].to_csv("Data/test_baseline.csv", index=False)

And I repeat, ROUGE WAS NOT THE CORRECT METRIC FOR THIS CHALLENGE!

The prize money should warrant a better evaluation metric to be honest. Kindly organizers look into this please.

And if you look at all your T5 submissions this is what the T5 models were trying to do. They were not generating any medical responses but rather they were trying to paraphrase the input prompt as a summary hence why they were performing well. I don't think the client will benefit with any of the T5 submissions

More detailed discussions:

Dear Clinician AI Builders: We Need a Better Metric - Zindi

Challenging ROUGE: Why We Need Better Metrics for Clinical Text Generation - Zindi

Discussion 21 answers

nymfree

Lol. My number 132 submission says hello. Glad I didn't put any effort in this one.

30 Jun 2025, 09:45

Upvotes 0

Koleshjr

Multimedia university of kenya

I can understand the pain. I have no doubt that people who have 0.37 to 0.39 scores using proper LLMS have better medical responses that anyone who has over 0.40 and has used t5 models. And people can challenge me in this, I am happy to be proven wrong

replied to nymfree30 Jun 2025, 09:56

Upvotes 1

nymfree

The choice of error metric was not great in this competition. The challenge would have been very interesting, had we had a proper metric.

replied to Koleshjr30 Jun 2025, 10:03

Upvotes 0

Koleshjr

Multimedia university of kenya

To be honest, I understand the challenge there are currently no universally reliable automatic metrics for evaluating tasks of this nature. This is why many researchers and large organizations, including OpenAI, Anthropic, and Meta, have increasingly turned to using large language models (LLMs) themselves as evaluators. For instance, papers like MT-Bench (Zheng et al., 2023) discuss how GPT-4 can provide consistent and high-quality comparative judgments between model outputs, and are already widely adopted in evaluation pipelines. Right now we have reasoning models (even open source) which actually can be the best judges for this kind of task

Given that LLM-based evaluations can be computationally expensive and not scalable across all submissions, my suggestion would be for the organizers to extend the evaluation window say by a week and allow participants to select their top two submissions they believe best represent clinically sound responses. The organizers can then run LLM-based evaluations only on these final submissions. This approach balances feasibility, cost, and evaluation quality, while still giving space for deeper medical reasoning to be rewarded.

Leaving it to rouge and given the prize money , to me sounds really unfair because I believe people have good submissions that even score 0.35 but they are getting penalized by this metric.

replied to nymfree30 Jun 2025, 10:11

Upvotes 3

mail_liw

It is truly disappointing that a competition with such immense potential to revolutionize clinical reasoning was ultimately undermined by the choice of ROUGE-1 as its core evaluation metric.

30 Jun 2025, 10:07

Upvotes 1

Koleshjr

Multimedia university of kenya

Honestly your discussion was very goood. Thank you so much for that. I wish they had changed the metric after that discussion

replied to mail_liw30 Jun 2025, 10:14

Upvotes 1

stefan027

Agree! That discussion was excellent

replied to Koleshjr30 Jun 2025, 10:18

Upvotes 2

Semaka_Mathunyane

University of South Africa

I agree I think if one of the top 20 can share their solution and how they got there will assist in analysing the matter

30 Jun 2025, 10:09

Upvotes 0

Koleshjr

Multimedia university of kenya

You will be very surprised that you can get 0.44 with no LLM!!

replied to Semaka_Mathunyane30 Jun 2025, 10:13

Upvotes 0

Semaka_Mathunyane

University of South Africa

lol really im gonna try that thanks

replied to Koleshjr30 Jun 2025, 10:24

Upvotes 0

stefan027

Yep, first thing I did when I joined is this challenge was to calculate ROUGE on the training data for the GPT-4.0, LLAMA and GEMINI responses that were provided. I thought that will give me some rough baseline of what a language model might achieve without any finetuning. The ROUGE scores were ridiculously low, while ROUGE for the prompt was like 5x or more higher. At that point I decided to not spend much time on this challenge because the metric is meaningless. The metric should at least have some correlation with what human preference for the better answer should be, and no human would say just repeating the prompt is a better answer than those from GPT-4.0, LLAMA and GEMINI that were provided. It's a pity because the premise of the challenge was interesting and deserved better.

30 Jun 2025, 10:09

Upvotes 4

Koleshjr

Multimedia university of kenya

💯

replied to stefan02730 Jun 2025, 10:14

Upvotes 0

hark99

Self-employed

That's true regarding T5. How can these models provide better clinical reasoning, as standard models (before LLMs) are never good in specific areas like medical? The advancement in AI is sometimes too much to cop with, and still we stick to rogue metrics, that's surprising. I applied even RL fine tuning to improve the model, but we have to judge in the end, word by word, like only focusing on the line of sight, hilarious.

30 Jun 2025, 10:17

Upvotes 0

hark99

Self-employed

I don't know whether some competitors are still producing good results despite the metric and solution constraints issue. Otherwise, which is difficult, I think, and maybe the right way is to relaunch the competition with a 1 to 2 week deadline, and a semantic-based evaluation criterion. With a better metric, the solutions will be far better in terms of true accuracy.

30 Jun 2025, 10:29

Upvotes 0

Koleshjr

Multimedia university of kenya

I agree , also regarding the 100ms generation , I found it pretty hard to meet that constraint. It should be be set to something like 30 seconds per vignette , maximum one minute

replied to hark9930 Jun 2025, 10:36

Upvotes 0

stefan027

I can barely run a BP tokenizer in 100ms for a long prompt 😂

replied to Koleshjr30 Jun 2025, 10:49

Upvotes 0

Koleshjr

Multimedia university of kenya

Truee😂

replied to stefan02730 Jun 2025, 10:50

Upvotes 0

yassin104

I completely agree with you @Koleshjr , T5 models are not the right type of models if the host is truly looking for a medical answer. That’s why I’ve used an LLM model throughout this competition.

30 Jun 2025, 10:57

Upvotes 0

Koleshjr

Multimedia university of kenya

I also used an LLM for one of my subs but to get to near ttop scores I had to post process by adding non sensical statistic combinations which massively improved the score given how flawed the metric was. If the raw llm output was compared to the post processed one with a good semantic evaluation metric or LLM as a judge, a 100% sure that the raw would outperform the post processed one.

The LLM is not essentially the issue, the Metric is because if the Metric was a better metric then T5 would have been the worst performing models for this task compared to recent LLMs

replied to yassin10430 Jun 2025, 11:07

Upvotes 0

yassin104

Yes, exactly the metric used isn’t the most suitable for this competition. That’s why I focused on fine-tuning the LLM to generate answers that score well under this metric. Hopefully, @zindi will take these constraints into account during the second evaluation phase.

replied to Koleshjr30 Jun 2025, 11:19

Upvotes 1

jmsmuigai

The primary flaw was misalignment between the evaluation metric (ROUGE) and the task objective (clinical reasoning). ROUGE rewarded superficial overlaps, not semantic or clinical depth, leading to model and strategy choices that undermined the challenge’s intended purpose.

30 Jun 2025, 11:41

Upvotes 2

Join the largest network for
data scientists and AI builders

About FAQs

Status