Primary competition visual

Mozilla Luganda Automatic Speech Recognition

Helping Uganda
$3 000 USD
Challenge completed almost 4 years ago
Automatic Speech Recognition
Natural Language Processing
179 joined
20 active
Starti
Oct 13, 21
Closei
Jan 16, 22
Reveali
Jan 16, 22
A nearly perfect score (and why I'm sharing the code)
Notebooks · 19 Dec 2021, 18:44 · edited 32 minutes later · 3

N.B - My submissions shared the leaderboard are STT models trained only on Common Voice train and validated on dev.

Intro

After stumbling across the problem of leakage that remains with the test set I was curious to test the simplest model which would give a 0 WER. The approach I took was to extract MFCC feature statistics for each utterance in validated.tsv, give each a unique integer label and train a nearest neighbours classifier with K=1. As expected this is able to identify 100% of the Zindi test set due to the aforementioned leakage.

Despite this my intial model only scored 0.156 WER on Zindi. I realised this was due to different preprocessing of the transcripts. Despite the Evaluation section saying otherwise, the provided transcript should have no punctuation and the case doesn't matter. Fixing this gave me a WER of 0.0005, it's not 0 but I think it's close enough.

Why I'm sharing the code

Despite coming up with a solution to the challenge which fulfils the rules (ML based, only using Common Voice 7) it serves no purpose other than producing a low score.

By sharing this 'perfect' solution I hope to disincentivise cheating so that we can spend the rest of the competition time coming up with great, robust, useful Lugandan STT models. In addition, I don't want this additional knowledge about submission formats to put me at an unfair advantage.

In that sprit I'd encourage people to share how they're getting on using the official Common Voice data splits. I'll start:

Finetuning English --> Luganda Quartznet 5x15: ~0.4 WER on dev set.

Finetuning English --> Kinyarwanda Quartznet 5x15: ~0.35 WER on dev set.

Code:

import pandas as pd
import numpy as np
import librosa
from sklearn import neighbors
from multiprocessing import Pool
import warnings
import os
import tqdm.notebook as tqdm
import glob
import unidecode

def feature_extractor(file):
    warnings.filterwarnings('ignore')
    x, fs = librosa.load(file)
    feat =  librosa.feature.mfcc(x)
    warnings.filterwarnings("default")
    return np.concatenate((feat.mean(axis=1), feat.min(axis=1), feat.max(axis=1), feat.std(axis=1)))

def normalize_str(txt):
    invalids = set()
    # vocabulary
    valid_chars = (" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
                   "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'") # New valid characters 
    # lowercase
    new_txt = txt.lower().strip()
    # remove characters not in vocabulary
    res_arr = []
    for c in new_txt:
        if c in valid_chars:
            res_arr.append(c)
        else:
            #remove accent and see if it is valid
            non_accent_c = unidecode.unidecode(c)
            if non_accent_c in valid_chars:
                res_arr.append(non_accent_c)
            else:
                # a character we don't know
                invalids.update(c)
                res_arr.append(' ')
    res = ''.join(res_arr).strip()
    return ' '.join(res.split())

datasetdir = "/data/CV_unpacked/cv-corpus-7.0-2021-07-21/lg"

df = pd.read_csv(f'{datasetdir}/validated.tsv', delimiter="\t")

X = [feature_extractor(f"{datasetdir}/clips/{_.path}") for i,_ in tqdm.tqdm(df.iterrows())]

#That took a while, so save for later.
np.save("feats", X)

X = np.load("feats.npy")

# Assign a unique number to each utterance and create lookup.
y = df.index
labels = [normalize_str(_) for _ in df.sentence]

clf = neighbors.KNeighborsClassifier(n_neighbors=1, leaf_size=100)
clf.fit(X, y)

# Generate test predictions
testdir = "/data/test_audio/test_audio"
testfiles = glob.glob(f"{testdir}/*.mp3")
testfeats = [feature_extractor(_) for _ in tqdm.tqdm(testfiles)]

testpreds = clf.predict(testfeats)
testtrans = [labels[_].replace("'","") for _ in testpreds]

with open("predictions.csv", "w") as f:
    f.write("Clip_ID,Target\n")
    for file, transcription in zip(testfiles,testtrans):
        ID = os.path.split(os.path.splitext(file)[0])[1]
        f.write(f"{ID},{transcription}\n")

Discussion 3 answers

Thanks for your code. I don't have much experience in word processing and I used your code as an initial test and study version. But I made a submission... @Zindi, how can I delete a submission?

22 Dec 2021, 22:18
Upvotes 0

I can't comment about removing, but if you sumbit a dummy file with incorrect transcripts you can select it as your leaderboard entry.

For a better place to start check out Deepspeech, Huggingface speech to text, Kaldi, and Nvidia's Nemo packages. My code was just to demonstrate that we can get ~0 WER by training with all of the Commonvoice data.

Thanks for the suggestion. I thought that 'selecting it as your leaderboard entry' only worked in the final submission phase.