You will train your models on the complete Luganda dataset.
...any use of data leaks will result in your disqualification.
The complete dataset contains the test set, and you changed the filenames so looks like you're setting us up to use the data leak.
@zindi should probably specify which files in the complete dataset we can't use for training before someone makes the mistake.
Hi, I actually assumed that the test set from Zindi for the competition is really unknown and never published before, but after few minutes of checks, as mentioned by @kiminya, I found already few audio files from Zindi which exist also in test set of Common Voice, for example:
Zindi: ID_XSSVO2NA.mp3 -> CV (test.tsv): common_voice_lg_23779789.mp3
Zindi: ID_RUJKZDE2.mp3 -> CV (test.tsv): common_voice_lg_25415420.mp3
I have to check again if the list of Zindi test dataset contains also all other 4276 audio files from Common Voice's test.csv. If it is the case, I am not sure if the competition is a valid competition. At least the two examples above are only in the test dataset of Common Voice, and when we trained our model, we don't use the test.tsv. But how would Zindi be able to check if a model is trained using the leaked Zindis dataset or not?
In absence of any response from @Zindi and considering that we have to get some score on leaderboard till 31st October with the constraint that models take a long time to train, I am using the entire Common voice dataset for Luganda despite the leak being present.
I have compiled the data leaks between Zindi Luganda ASR dataset and the Luganda Common Voice dataset version 7: https://gist.github.com/cahya-wirawan/5b31d604056422356bbeaa484e688940
Please review it in case I made a mistake. But if I compiled it correctly, 71.7% (or 5066) from 7068 audio files of Zindi Luganda ASR dataset are contained in Luganda Common Voice dataset. 33% in the train Common Voice, 21.5% in the test Common Voice and 17.2% in validation Common Voice dataset.
It raises a question to @Zindi : how was the dataset for this competition created?