I had a quick question to better understand how fairness is being ensured for all participants. For top submissions, do you re-run the training using the submitted code and declared data? I’m asking because if Phase 2 test data were used during training, this would be hard to detect just by running inference on already-trained models—and re-training models for verification also seems quite time-consuming and GPU-intensive.
If a top-ranked submission (for example, within the top 5) is found to have used Phase 2 data during training, does the review then move on to check the next submission further down the leaderboard? Now what happens if all of top 10 have used phase 2 test data during training?
of cause you can use it to check the % of correct answer to improve the type of question behaving badly. but like score below 0.6 is basicly not on a right track.
Good point. Using test data—either directly or by generating synthetic samples similar to it—will cause overfitting, and despite better benchmarking results, the model will be inferior due to poor generalization. So how this is tracked is definitely important.
Where in rules does it say not to use test data to train a model?
Yeah but then one can get a near 100% score if used the test data in a clever way, that is why using test in training is a grey area, which needs some clarification from the Host on exactly what use is allowed, if at all it is allowed.
mark
i doubt near 100% point to be honest. But would be good if the host can clarify. Typical approaches like pseudo labeling is considered legitimate in other platforms like Kaggle.
Hi, in the review phase we plan to use unseen data. Obviusly, we cannot test all the models of the participants. We would not have the time to comply with our deadline
rerun with unseen data seems to be a fair solution to everyone. Also testing if the workpiece can be contributive in production
Is this new test set similar to train, test phase 1/2 or entirely new format? This has implications for many people's (mine atleast) pipelines.
Thanks for the clarification. Using unseen data helps. One small concern though: if this new data follows the same structure, tables, or distribution as Phase 2, a model that overfitted on Phase 2 may still perform well. To really flag this, it would help if the unseen data includes new table formats, different distributions, so that generalization—not memorization—is being tested. That would make the fairness check much stronger.
This is very tricky. It depends on how 'new' and how 'different'. If we train a model and test it on test set, we should expect the test set has similar distribution as the training. Generalization is only for certain patterns, not for everything.
The main goal of this competition has always been generalization, which is why the Phase 2 set follows a different data distribution than the training set. Based on that, I would expect the same differences observed on Phase 2 to also appear on truly unseen data that was not part of either the training set or Phase 2.
I agree it's about generalization. But there should be a boundary of the generalization. For an extreme example, if all the unseen data is general questions, it will be like testing the underlying Qwen model's generalization. Of course this wouldn't happen. However, the more we want difference, the closer we are to the extreme example. Too much difference may deviate the original scope, which is fine-tuning for detecting certain network failures.