Primary competition visual

Mozilla Luganda Automatic Speech Recognition

Helping Uganda
$3 000 USD
Challenge completed almost 4 years ago
Automatic Speech Recognition
Natural Language Processing
179 joined
20 active
Starti
Oct 13, 21
Closei
Jan 16, 22
Reveali
Jan 16, 22
User avatar
Strathmore university
Rules contradiction
Data · 25 Oct 2021, 09:27 · 3

You will train your models on the complete Luganda dataset.

...any use of data leaks will result in your disqualification.

The complete dataset contains the test set, and you changed the filenames so looks like you're setting us up to use the data leak.

@zindi should probably specify which files in the complete dataset we can't use for training before someone makes the mistake.

Discussion 3 answers

Hi, I actually assumed that the test set from Zindi for the competition is really unknown and never published before, but after few minutes of checks, as mentioned by @kiminya, I found already few audio files from Zindi which exist also in test set of Common Voice, for example:

Zindi: ID_XSSVO2NA.mp3 -> CV (test.tsv): common_voice_lg_23779789.mp3

Zindi: ID_RUJKZDE2.mp3 -> CV (test.tsv): common_voice_lg_25415420.mp3

I have to check again if the list of Zindi test dataset contains also all other 4276 audio files from Common Voice's test.csv. If it is the case, I am not sure if the competition is a valid competition. At least the two examples above are only in the test dataset of Common Voice, and when we trained our model, we don't use the test.tsv. But how would Zindi be able to check if a model is trained using the leaked Zindis dataset or not?

User avatar
AkashPB

In absence of any response from @Zindi and considering that we have to get some score on leaderboard till 31st October with the constraint that models take a long time to train, I am using the entire Common voice dataset for Luganda despite the leak being present.

27 Oct 2021, 06:12
Upvotes 0

I have compiled the data leaks between Zindi Luganda ASR dataset and the Luganda Common Voice dataset version 7: https://gist.github.com/cahya-wirawan/5b31d604056422356bbeaa484e688940

Please review it in case I made a mistake. But if I compiled it correctly, 71.7% (or 5066) from 7068 audio files of Zindi Luganda ASR dataset are contained in Luganda Common Voice dataset. 33% in the train Common Voice, 21.5% in the test Common Voice and 17.2% in validation Common Voice dataset.

It raises a question to @Zindi : how was the dataset for this competition created?