test = pd.read_csv('Data/test_raw.csv')
test['Clinician'] = test['Prompt'].apply(lambda x: "summary "+ " ".join(x.split('.')[1:]).replace('\n', ' '))
test[['Master_Index', 'Clinician']].to_csv("Data/test_baseline.csv", index=False)
And I repeat, ROUGE WAS NOT THE CORRECT METRIC FOR THIS CHALLENGE!
The prize money should warrant a better evaluation metric to be honest. Kindly organizers look into this please.
And if you look at all your T5 submissions this is what the T5 models were trying to do. They were not generating any medical responses but rather they were trying to paraphrase the input prompt as a summary hence why they were performing well. I don't think the client will benefit with any of the T5 submissions
More detailed discussions:
Dear Clinician AI Builders: We Need a Better Metric - Zindi
Challenging ROUGE: Why We Need Better Metrics for Clinical Text Generation - Zindi
Lol. My number 132 submission says hello. Glad I didn't put any effort in this one.
I can understand the pain. I have no doubt that people who have 0.37 to 0.39 scores using proper LLMS have better medical responses that anyone who has over 0.40 and has used t5 models. And people can challenge me in this, I am happy to be proven wrong
The choice of error metric was not great in this competition. The challenge would have been very interesting, had we had a proper metric.
To be honest, I understand the challenge there are currently no universally reliable automatic metrics for evaluating tasks of this nature. This is why many researchers and large organizations, including OpenAI, Anthropic, and Meta, have increasingly turned to using large language models (LLMs) themselves as evaluators. For instance, papers like MT-Bench (Zheng et al., 2023) discuss how GPT-4 can provide consistent and high-quality comparative judgments between model outputs, and are already widely adopted in evaluation pipelines. Right now we have reasoning models (even open source) which actually can be the best judges for this kind of task
Given that LLM-based evaluations can be computationally expensive and not scalable across all submissions, my suggestion would be for the organizers to extend the evaluation window say by a week and allow participants to select their top two submissions they believe best represent clinically sound responses. The organizers can then run LLM-based evaluations only on these final submissions. This approach balances feasibility, cost, and evaluation quality, while still giving space for deeper medical reasoning to be rewarded.
Leaving it to rouge and given the prize money , to me sounds really unfair because I believe people have good submissions that even score 0.35 but they are getting penalized by this metric.
It is truly disappointing that a competition with such immense potential to revolutionize clinical reasoning was ultimately undermined by the choice of ROUGE-1 as its core evaluation metric.
Honestly your discussion was very goood. Thank you so much for that. I wish they had changed the metric after that discussion
Agree! That discussion was excellent
I agree I think if one of the top 20 can share their solution and how they got there will assist in analysing the matter
You will be very surprised that you can get 0.44 with no LLM!!
lol really im gonna try that thanks
Yep, first thing I did when I joined is this challenge was to calculate ROUGE on the training data for the GPT-4.0, LLAMA and GEMINI responses that were provided. I thought that will give me some rough baseline of what a language model might achieve without any finetuning. The ROUGE scores were ridiculously low, while ROUGE for the prompt was like 5x or more higher. At that point I decided to not spend much time on this challenge because the metric is meaningless. The metric should at least have some correlation with what human preference for the better answer should be, and no human would say just repeating the prompt is a better answer than those from GPT-4.0, LLAMA and GEMINI that were provided. It's a pity because the premise of the challenge was interesting and deserved better.
💯
That's true regarding T5. How can these models provide better clinical reasoning, as standard models (before LLMs) are never good in specific areas like medical? The advancement in AI is sometimes too much to cop with, and still we stick to rogue metrics, that's surprising. I applied even RL fine tuning to improve the model, but we have to judge in the end, word by word, like only focusing on the line of sight, hilarious.
I don't know whether some competitors are still producing good results despite the metric and solution constraints issue. Otherwise, which is difficult, I think, and maybe the right way is to relaunch the competition with a 1 to 2 week deadline, and a semantic-based evaluation criterion. With a better metric, the solutions will be far better in terms of true accuracy.
I agree , also regarding the 100ms generation , I found it pretty hard to meet that constraint. It should be be set to something like 30 seconds per vignette , maximum one minute
I can barely run a BP tokenizer in 100ms for a long prompt 😂
Truee😂
I completely agree with you @Koleshjr , T5 models are not the right type of models if the host is truly looking for a medical answer. That’s why I’ve used an LLM model throughout this competition.
I also used an LLM for one of my subs but to get to near ttop scores I had to post process by adding non sensical statistic combinations which massively improved the score given how flawed the metric was. If the raw llm output was compared to the post processed one with a good semantic evaluation metric or LLM as a judge, a 100% sure that the raw would outperform the post processed one.
The LLM is not essentially the issue, the Metric is because if the Metric was a better metric then T5 would have been the worst performing models for this task compared to recent LLMs
Yes, exactly the metric used isn’t the most suitable for this competition. That’s why I focused on fine-tuning the LLM to generate answers that score well under this metric. Hopefully, @zindi will take these constraints into account during the second evaluation phase.
The primary flaw was misalignment between the evaluation metric (ROUGE) and the task objective (clinical reasoning). ROUGE rewarded superficial overlaps, not semantic or clinical depth, leading to model and strategy choices that undermined the challenge’s intended purpose.