📊 Hot Topic: Rules contradiction

Mozilla Luganda Automatic Speech Recognition

Helping Uganda

$3 000 USD

Completed (over 4 years ago)

Skills you will learn

Automatic Speech Recognition

Natural Language Processing

184 joined

20 active

Info Data Chat Leaderboard

Start

Oct 13, 21

Jan 16, 22

Reveal

Jan 16, 22

kiminya

Strathmore university

Rules contradiction

Data · 25 Oct 2021, 09:27 · 3

You will train your models on the complete Luganda dataset.

...any use of data leaks will result in your disqualification.

The complete dataset contains the test set, and you changed the filenames so looks like you're setting us up to use the data leak.

@zindi should probably specify which files in the complete dataset we can't use for training before someone makes the mistake.

Discussion 3 answers

cahya

Hi, I actually assumed that the test set from Zindi for the competition is really unknown and never published before, but after few minutes of checks, as mentioned by @kiminya, I found already few audio files from Zindi which exist also in test set of Common Voice, for example:

Zindi: ID_XSSVO2NA.mp3 -> CV (test.tsv): common_voice_lg_23779789.mp3

Zindi: ID_RUJKZDE2.mp3 -> CV (test.tsv): common_voice_lg_25415420.mp3

I have to check again if the list of Zindi test dataset contains also all other 4276 audio files from Common Voice's test.csv. If it is the case, I am not sure if the competition is a valid competition. At least the two examples above are only in the test dataset of Common Voice, and when we trained our model, we don't use the test.tsv. But how would Zindi be able to check if a model is trained using the leaked Zindis dataset or not?

25 Oct 2021, 19:00 (edited ~8 hours later)

Upvotes 0

AkashPB

In absence of any response from @Zindi and considering that we have to get some score on leaderboard till 31st October with the constraint that models take a long time to train, I am using the entire Common voice dataset for Luganda despite the leak being present.

27 Oct 2021, 06:12

Upvotes 0

cahya

I have compiled the data leaks between Zindi Luganda ASR dataset and the Luganda Common Voice dataset version 7: https://gist.github.com/cahya-wirawan/5b31d604056422356bbeaa484e688940

Please review it in case I made a mistake. But if I compiled it correctly, 71.7% (or 5066) from 7068 audio files of Zindi Luganda ASR dataset are contained in Luganda Common Voice dataset. 33% in the train Common Voice, 21.5% in the test Common Voice and 17.2% in validation Common Voice dataset.

It raises a question to @Zindi : how was the dataset for this competition created?

27 Oct 2021, 17:44 (edited 1 minute later)

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status