It seems that some participants are fine-tuning on the test set. How do the organizers plan to detect models trained on the test data? If unseen data will be used for local evaluation, will it follow the same format/structure in train/phase1/phase2?
We do know that. Some participants are worried about it because of a single large score. However, it looks like that participant did not fine tune on the test set. We will try to check and provide a fair eval.
We do know that. Some participants are worried about it because of a single large score. However, it looks like that participant did not fine tune on the test set. We will try to check and provide a fair eval.
got it, thanks!
yes it is too important