It's clear after some data exploration that the test set is derived from Common Voice (CV). Sticking to good DS principles (train on train, validate using dev, only use test for final scoring) we will still have misleadingly low WER due to 2330/7067 of the Zindi test set being present in the CV training split, and 1218 in the CV dev split.
Following the competition rules it would be trivial to achieve 0 WER, but this does nothing to further STT for Lugandan speakers.
Please can the organisers provide an independent test set in order to make this a useful challenge. If no other data is available I would suggest using the official CV test set, and making it clear in the rules that it is not allowed for training. Our solutions will be made available and validated independently so there is no need to obsfuscate the test set.
Yes, it would be nice to get some comments from Zindi teams about it as we also asked a while ago in other thread https://zindi.africa/competitions/mozilla-luganda-automatic-speech-recognition/discussions/8939
@ZINDI
Hi all, thank you for your concern.
The data leak has been addressed in this discussion post - https://zindi.africa/competitions/mozilla-luganda-automatic-speech-recognition/discussions/9446
All the best
I appreciate the reply, but I strongly suggest it hasn't been or at least there's insufficient information available.
Can you confirm that the "unseen validation set" will not be derived from Common Voice? If not the same leakage issue applies.
Even if this is the case as competitors we have no way of validating our models using the current test set. Without an indication of our relative performance it's unclear where / if to dedicate additional resource to improving our entry.