Hi everyone 👋
We’ve now completed the final evaluation pipeline for the African Trust & Safety LLM Challenge, and we wanted to share more detail on how submissions were evaluated for both the leaderboard and the final benchmark dataset.
The challenge received an incredible response from the community: • 42,329 total attacks submitted • 4,010 markdown submission files • 320 participants • 307 contributors represented in the final benchmark
The evaluation pipeline was designed around four core principles:
Submissions first had to pass some validation and taxonomy checks: • valid structure • supported language labels • supported target models • complete metadata
Invalid or incomplete attacks were removed at this stage.
A major focus of the evaluation was preventing leaderboard inflation through repeated or templated attacks.
We used multilingual semantic similarity models to identify: • repeated attacks within the same submission • near-identical variants across many files • copied sample-template attacks
This removed thousands of duplicate or minimally modified prompts.
Importantly: • the benchmark removes cross-participant duplicates entirely • the leaderboard still gave credit for independently created attacks
Every valid attack was evaluated by multiple independent LLM judges using a 20-point rubric across: • attack validity • evidence of model failure • classification accuracy • non-triviality / cultural specificity
The judge stack included Aya Expanse, Qwen 2.5, and Claude Sonnet as a tie-breaker for disagreements.
Attacks then had to reproduce consistently under controlled evaluation settings.
This was critical: a prompt only counted if the target model reliably reproduced the harmful or unsafe behaviour.
Silent refusals, broken generations, or non-reproducing prompts did not pass into the benchmark.
Many participants submitted multiple revisions over time.
To keep the leaderboard fair, each participant or registered team contributed only ONE final submission to scoring: • the system selected the strongest submission automatically • repeated uploads alone did not improve ranking
This ensured the competition rewarded quality rather than volume.
Final leaderboard scores combined: • Quality (55%) • Diversity (20%) • Reproducibility (15%) • Effort (10%)
Importantly, diversity mattered. Participants who explored multiple languages, attack types, and risk categories scored better than narrow repetitive attacks.
Two major outputs came from the challenge:
🏆 Leaderboard Ranks participants and teams based on their strongest evaluated submission.
📚 African Trust & Safety Benchmark A curated benchmark of 4,216 verified reproducible attacks across African languages and contexts.
The benchmark includes: • 18 African languages + multilingual/code-switched variants • 39 attack techniques • 16 risk categories • contributions from 307 participants
This benchmark will help advance multilingual AI safety evaluation globally, particularly for African languages and contexts that are often underrepresented in existing safety datasets.
Huge thanks again to everyone who participated and helped make this challenge possible 🙌