Hi everyone 👋
Thanks for all the thoughtful questions and feedback. Here are some key clarifications on submissions and final evaluation:
Submissions & Number of Attacks
- There is no strict limit of 3 prompts per submission
- You can include multiple attacks in a single submission
- We recommend focusing on strong, high-quality attacks
Final Evaluation (Important)
- We will consider your two best submissions (not cumulative across multiple submissions)
- Final evaluation will be done by humans, not just the evaluator
This means:
- Attacks will be re-run against the target models
- We will verify whether the claimed safety failures actually reproduce
- Only real, validated failures will count toward final scores
Selection for Final Evaluation
- Final evaluation will not be limited strictly to the current top 10 or 20
- We will review the best submissions across participants
Focus on one strong submission with clear, reproducible attacks
Thanks again for the engagement - we really appreciate it 🙏
Question on Reproduction Environment
@meganomaly Thanks for the clarification on final evaluation! Quick technical question on reproducibility:
When re-running attacks against the target models, what inference setup will you use?Specifically the transformers version, model precision (bfloat16/float16), and temperature setting?
Asking because for greedy decoding (temperature=0.0), outputs can vary across different transformers versions or precision configurations, so it would help to know the exact environment to ensure submitted responses reproduce faithfully.