Hello @AntonioDeDomenico ,
I have two clarification questions regarding the evaluation and track rules:
1. If a model produces the correct final answer but the accompanying reasoning is partially or completely incorrect, is the evaluation based solely on the final answer(right now it's), or is there any assessment of reasoning quality during judging or post-evaluation?
2. For a given track (e.g., Qwen-1.5B), is it acceptable to use a larger model e.g Qwen3-32B in intermediate steps such as preprocessing, while using the track’s model only for the answer generation? Or must all stages exclusively use the track-specified model?
Hi, 1) good point but I do not think we will have time to run this (LLM-as-a-judge) eval. 2) yes, you can do it
Hi @AntonioDeDomenico, Thanks for the clarification on (2)! Just to make sure I understand the intended scope of preprocessing: if larger models are allowed in preprocessing, are there any constraints on using them for semantic reasoning steps (e.g., decomposition, chain-of-thought generation, or intermediate solution drafting), as opposed to purely mechanical steps like retrieval, filtering, or formatting?
I’m asking because, depending on how preprocessing is defined, if using Larger models at inference time is allowed for preprocessing , the problem can in turn be divided into two stages where reasoning can be heavily offloaded to larger model like generating CoT and use the track model only for final summarization, which could blur the distinction between compute-constrained tracks.
Thankyou for your prompt replies and clarifications!!
Hi @ahuvam, @neuron_x indeed, this is too blurry, and I should have been more precise. Larger models should not be used during inference time. I think this is clear enough and do not let space to misunderstanding
yes, thanks a lot!!