The primary flaw was misalignment between the evaluation metric (ROUGE) and the task objective (clinical reasoning). ROUGE rewarded superficial overlaps, not semantic or clinical depth, leading to model and strategy choices that undermined the challenge’s intended purpose.