📊 Hot Topic: A nearly perfect score (and wh...

Mozilla Luganda Automatic Speech Recognition

Helping Uganda

$3 000 USD

Challenge completed almost 4 years ago

Skills you will learn

Automatic Speech Recognition

Natural Language Processing

179 joined

20 active

Info Data Chat Leaderboard

Start

Oct 13, 21

Jan 16, 22

Reveal

Jan 16, 22

errorfixrepeat

A nearly perfect score (and why I'm sharing the code)

Notebooks · 19 Dec 2021, 18:44 · edited 32 minutes later · 3

N.B - My submissions shared the leaderboard are STT models trained only on Common Voice train and validated on dev.

Intro

After stumbling across the problem of leakage that remains with the test set I was curious to test the simplest model which would give a 0 WER. The approach I took was to extract MFCC feature statistics for each utterance in validated.tsv, give each a unique integer label and train a nearest neighbours classifier with K=1. As expected this is able to identify 100% of the Zindi test set due to the aforementioned leakage.

Despite this my intial model only scored 0.156 WER on Zindi. I realised this was due to different preprocessing of the transcripts. Despite the Evaluation section saying otherwise, the provided transcript should have no punctuation and the case doesn't matter. Fixing this gave me a WER of 0.0005, it's not 0 but I think it's close enough.

Why I'm sharing the code

Despite coming up with a solution to the challenge which fulfils the rules (ML based, only using Common Voice 7) it serves no purpose other than producing a low score.

By sharing this 'perfect' solution I hope to disincentivise cheating so that we can spend the rest of the competition time coming up with great, robust, useful Lugandan STT models. In addition, I don't want this additional knowledge about submission formats to put me at an unfair advantage.

In that sprit I'd encourage people to share how they're getting on using the official Common Voice data splits. I'll start:

Finetuning English --> Luganda Quartznet 5x15: ~0.4 WER on dev set.

Finetuning English --> Kinyarwanda Quartznet 5x15: ~0.35 WER on dev set.

Code:

import pandas as pd

import numpy as np

import librosa

from sklearn import neighbors

from multiprocessing import Pool

import warnings

import os

import tqdm.notebook as tqdm

import glob

import unidecode

def feature_extractor(file):

    warnings.filterwarnings('ignore')

    x, fs = librosa.load(file)

    feat =  librosa.feature.mfcc(x)

    warnings.filterwarnings("default")

    return np.concatenate((feat.mean(axis=1), feat.min(axis=1), feat.max(axis=1), feat.std(axis=1)))

def normalize_str(txt):

    invalids = set()

    # vocabulary

    valid_chars = (" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",

                   "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'") # New valid characters

    # lowercase

    new_txt = txt.lower().strip()

    # remove characters not in vocabulary

    res_arr = []

    for c in new_txt:

        if c in valid_chars:

            res_arr.append(c)

        else:

            #remove accent and see if it is valid

            non_accent_c = unidecode.unidecode(c)

            if non_accent_c in valid_chars:

                res_arr.append(non_accent_c)

            else:

                # a character we don't know

                invalids.update(c)

                res_arr.append(' ')

    res = ''.join(res_arr).strip()

    return ' '.join(res.split())

datasetdir = "/data/CV_unpacked/cv-corpus-7.0-2021-07-21/lg"

df = pd.read_csv(f'{datasetdir}/validated.tsv', delimiter="\t")

X = [feature_extractor(f"{datasetdir}/clips/{_.path}") for i,_ in tqdm.tqdm(df.iterrows())]

#That took a while, so save for later.

np.save("feats", X)

X = np.load("feats.npy")

# Assign a unique number to each utterance and create lookup.

y = df.index

labels = [normalize_str(_) for _ in df.sentence]

clf = neighbors.KNeighborsClassifier(n_neighbors=1, leaf_size=100)

clf.fit(X, y)

# Generate test predictions

testdir = "/data/test_audio/test_audio"

testfiles = glob.glob(f"{testdir}/*.mp3")

testfeats = [feature_extractor(_) for _ in tqdm.tqdm(testfiles)]

testpreds = clf.predict(testfeats)

testtrans = [labels[_].replace("'","") for _ in testpreds]

with open("predictions.csv", "w") as f:

    f.write("Clip_ID,Target\n")

    for file, transcription in zip(testfiles,testtrans):

        ID = os.path.split(os.path.splitext(file)[0])[1]

        f.write(f"{ID},{transcription}\n")

Discussion 3 answers

rdeggau

Thanks for your code. I don't have much experience in word processing and I used your code as an initial test and study version. But I made a submission... @Zindi, how can I delete a submission?

22 Dec 2021, 22:18

Upvotes 0

errorfixrepeat

I can't comment about removing, but if you sumbit a dummy file with incorrect transcripts you can select it as your leaderboard entry.

For a better place to start check out Deepspeech, Huggingface speech to text, Kaldi, and Nvidia's Nemo packages. My code was just to demonstrate that we can get ~0 WER by training with all of the Commonvoice data.

replied to rdeggau24 Dec 2021, 13:07

Upvotes 0

rdeggau

Thanks for the suggestion. I thought that 'selecting it as your leaderboard entry' only worked in the final submission phase.

replied to errorfixrepeat24 Dec 2021, 19:17

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status