Thanks for giving us the opportunity to work on a project like this. I will praise how this challenge teaches us model deployment, fine-tuning, quantization, and real codebase architecture patterns — knowledge that we could not have gained from textbooks, and practice that will actually pave a path toward specialization. However, I will have to sandwich my praise with criticism of how the evaluation is handled.
The system of evaluation used to evaluate real applications based on text parsing of what the application does — but written in transcript form — is, in my view, a suboptimal evaluation method, as @snazon said, and I agree with him. (I used your model by the way, @snazon — black count gave me trouble 🙂).
My concern is not with the rubric itself, but with the reliance on transcript-based evaluation as the primary scoring mechanism before executable verification. A transcript can only describe system behavior; it cannot fully represent runtime reliability, inference correctness, performance stability, or real offline operation. These properties are best evaluated through direct interaction with the application or executable artifact.
I propose that you check each user's best submission more directly. To reduce evaluation workload, you could apply the same structured evaluation logic used in the scoring script in a more automated and scalable way, potentially leveraging reasoning-based evaluation tools to assist reviewers in identifying completeness, consistency, and technical plausibility across submissions.
My intention with this feedback is not to dismiss the evaluation effort, but to highlight a structural limitation that may unintentionally disadvantage participants who prioritized engineering implementation and runtime correctness over transcript iteration strategy. Since this challenge is fundamentally about building a working, field-ready mobile system, aligning evaluation more closely with executable behavior would better reflect the true technical objectives of the competition.
Thank you again for organizing this challenge and providing us with the opportunity to gain real-world deployment experience.
Exactly