Primary competition visual

AI4D Africa’s Anglophone Research Lab Tanzania ASR Challenge

Helping Africa
$1 000 USD
Challenge completed over 2 years ago
Automatic Speech Recognition
Natural Language Processing
153 joined
34 active
Starti
Jul 31, 23
Closei
Aug 27, 23
Reveali
Aug 27, 23
User avatar
jpandeinge
University of manchester
Loading the dataset
Data · 18 Aug 2023, 06:06 · 4

Does anyone have an idea of how I can load the dataset? I can't seem to get the actual audio of the dataset to use, here is my code

```

import os

import pandas as pd

from datasets import Dataset, DatasetDict, load_dataset

# Step 1: Read the train.csv file

csv_path = "/content/drive/MyDrive/data/train.csv"

train_df = pd.read_csv(csv_path)

dataset = load_dataset('csv', data_files=csv_path)

```

Discussion 4 answers

the audio data is in the test_audios folder

18 Aug 2023, 06:21
Upvotes 0
User avatar
jpandeinge
University of manchester

i would rather have my `path="/content/drive/MyDrive/train_audios/` because i tried that as well

User avatar
jpandeinge
University of manchester

could you give a code example please, I would really appreciate it!

You can try this! It will load the training audios and the associated sentence in one dataset.

'''

from dataset import load_dataset

import pandas as pd

train_df = pd.read_csv('train.csv', usecols = ["path","sentence"]).rename(columns={'path': 'file_name', 'sentence': 'transcription'})

from datasets import Dataset

print(train_df.head(5))

#save df with labels in the same folder as training audios

train_df.to_csv("train_audios/metadata.csv", index=False)

dataset = load_dataset("audiofolder", data_dir="train_audios")

'''

for more ref: https://huggingface.co/docs/datasets/audio_load