📊 Hot Topic: Data leakage is still a proble...

Mozilla Luganda Automatic Speech Recognition

Helping Uganda

$3 000 USD

Completed (~4 years ago)

Skills you will learn

Automatic Speech Recognition

Natural Language Processing

180 joined

20 active

Info Data Chat Leaderboard

Start

Oct 13, 21

Jan 16, 22

Reveal

Jan 16, 22

errorfixrepeat

Data leakage is still a problem.

Data · 16 Dec 2021, 09:08 · edited 4 days later · 4

It's clear after some data exploration that the test set is derived from Common Voice (CV). Sticking to good DS principles (train on train, validate using dev, only use test for final scoring) we will still have misleadingly low WER due to 2330/7067 of the Zindi test set being present in the CV training split, and 1218 in the CV dev split.

Following the competition rules it would be trivial to achieve 0 WER, but this does nothing to further STT for Lugandan speakers.

Please can the organisers provide an independent test set in order to make this a useful challenge. If no other data is available I would suggest using the official CV test set, and making it clear in the rules that it is not allowed for training. Our solutions will be made available and validated independently so there is no need to obsfuscate the test set.

Discussion 4 answers

cahya

Yes, it would be nice to get some comments from Zindi teams about it as we also asked a while ago in other thread https://zindi.africa/competitions/mozilla-luganda-automatic-speech-recognition/discussions/8939

16 Dec 2021, 12:52

Upvotes 0

errorfixrepeat

@ZINDI

replied to cahya16 Dec 2021, 19:21

Upvotes 0

Amy_Bray

Zindi

Hi all, thank you for your concern.

The data leak has been addressed in this discussion post - https://zindi.africa/competitions/mozilla-luganda-automatic-speech-recognition/discussions/9446

All the best

17 Dec 2021, 07:56

Upvotes 0

errorfixrepeat

I appreciate the reply, but I strongly suggest it hasn't been or at least there's insufficient information available.

Can you confirm that the "unseen validation set" will not be derived from Common Voice? If not the same leakage issue applies.

Even if this is the case as competitors we have no way of validating our models using the current test set. Without an indication of our relative performance it's unclear where / if to dedicate additional resource to improving our entry.

replied to Amy_Bray17 Dec 2021, 08:39 (edited less than a minute later)

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status