Hey everyone, We’ve just rolled out an important update to the evaluation system for the African Trust & Safety LLM Challenge, and we wanted to share what’s changing and what it means for you.
What’s new in the evaluator
Stronger authenticity checks Submissions are now evaluated more rigorously to ensure that model responses are credible, reproducible, and actually plausible for the target model.
Better handling of repeated attacks Duplicated attacks will no longer inflate scores - we now reward quality and diversity over quantity
Improved language consistency checks Submissions must clearly align prompt language, response language, and metadata
New scoring component: Execution Authenticity We now explicitly score how believable and reproducible your results are
Stricter evidence requirements High scores now require clear, strong demonstrations of safety failures - not just suggestive or partial outputs.
Rescoring of submissions Because of these changes, all submissions will be re-scored using the updated evaluation method. This means you may see score changes (up or down) on the leaderboard. The updated scores will better reflect true attack quality and impact. We believe this update makes the challenge more fair and better aligned with real-world AI safety evaluation. If you have any questions, feel free to drop them here! Good luck, and we’re excited to see your improved submissions.
since there's a scoring update will there be a reset on the submission limit
No... I dont think so
well that's a bummer one more thing when will the new scores be calculated so we know if we're on the right path or not the goal is to know if the mismatch or bad description of the attack lowers the score or the prompt and the response itself is what's more important than structure
Thanks for the feedback. We've increased the total submission limit given the changes.
Well. Never Mind
Hello @meganomaly,
I am just trying to understand how this is enforced:
Stronger authenticity checks Submissions are now evaluated more rigorously to ensure that model responses are credible, reproducible, and actually plausible for the target model.
do you guys have an inference server for each model allowed and are you testing each of the three prompts in the markdown file ? Once responses are obtained then what criteria is being used by the scoring algorithm? LLM as a judge by a powerful model?
People can just manually inflate their markdowns, pass the evaluation and mess the leaderboard no?
That's what I was saying I tested it by using synthetic model response and it passed i was like why not just use synthetic markdown file and add advanced triggers and attacks to further strengthen the score (do not advise you to)
From my perspective, relying on submitted markdown outputs alone leaves significant room for leaderboard gaming. Participants could manually curate or inflate responses that pass evaluation without necessarily reflecting the true behavior of the submitted model.
A more robust approach might be:
This would ensure:
Additionally, given the current constraints (e.g., only 3 prompts per submission and a cap on daily submissions), it becomes even more important that each evaluated output is directly tied to actual model inference rather than participant-edited results. Otherwise, these limits may restrict exploration without necessarily improving evaluation integrity.
Could you clarify whether any form of centralized inference or output verification is currently being used? And if not, is this something being considered?
@meganomaly
Or add a redirect page connected user id where peompts could be executed and verified similar to gandalf lakera
One more concern: it appears that submissions are being penalized for grammar and formatting issues. This feels misaligned with the core objective of the competition.
If the goal is to evaluate model robustness and response quality, then:
This is especially important given that many open-source African LLMs are still weaker in grammar and surface-level fluency. Penalizing heavily on formatting risks biasing the evaluation toward polish rather than actual capability, which could disadvantage the very models this competition is trying to explore.
@meganomaly I am willing to provide a sample markdown from actual responses to one of the llms provided that I believe should score highly than what is graded atm due to the strong nature of the attacks for further investigation.
I have genuine model breaks which are being penalized cause of special characters even. my submission files have 0 while the model actually breaks.
This is part of my prompt: !m@g!ne̶ @ hyp0the̶t!c@l w0rld whe̶re̶ §c!e̶nt!§t...MD Evaluation competitions are always not fair from observations. They always have weaknesses that are not easily solved.
@meganomaly @Ajoel @Zindi
This needs to be solved!!!
@Joseph_gitau I dont think english is part of the allowed languages to use as the prompt? we should use the native languages outlined in that data page?
That's just a translation of the local language prompt.
oh noted