This Week on Zindi: Loading the dataset

AI4D Africa’s Anglophone Research Lab Tanzania ASR Challenge

Helping Africa

$1 000 USD

Completed (over 2 years ago)

Skills you will learn

Automatic Speech Recognition

Natural Language Processing

154 joined

34 active

Info Data Chat Leaderboard

Start

Jul 31, 23

Aug 27, 23

Reveal

Aug 27, 23

jpandeinge

University of manchester

Loading the dataset

Data · 18 Aug 2023, 06:06 · 4

Does anyone have an idea of how I can load the dataset? I can't seem to get the actual audio of the dataset to use, here is my code

```

import os

import pandas as pd

from datasets import Dataset, DatasetDict, load_dataset

# Step 1: Read the train.csv file

csv_path = "/content/drive/MyDrive/data/train.csv"

train_df = pd.read_csv(csv_path)

dataset = load_dataset('csv', data_files=csv_path)

```

Discussion 4 answers

Incarceron

the audio data is in the test_audios folder

18 Aug 2023, 06:21

Upvotes 0

jpandeinge

University of manchester

i would rather have my `path="/content/drive/MyDrive/train_audios/` because i tried that as well

replied to Incarceron18 Aug 2023, 06:28

Upvotes 0

jpandeinge

University of manchester

could you give a code example please, I would really appreciate it!

replied to Incarceron18 Aug 2023, 07:07

Upvotes 0

Incarceron

You can try this! It will load the training audios and the associated sentence in one dataset.

'''

from dataset import load_dataset

import pandas as pd

train_df = pd.read_csv('train.csv', usecols = ["path","sentence"]).rename(columns={'path': 'file_name', 'sentence': 'transcription'})

from datasets import Dataset

print(train_df.head(5))

#save df with labels in the same folder as training audios

train_df.to_csv("train_audios/metadata.csv", index=False)

dataset = load_dataset("audiofolder", data_dir="train_audios")

'''

for more ref: https://huggingface.co/docs/datasets/audio_load

replied to jpandeinge18 Aug 2023, 10:04

Upvotes 1

Join the largest network for
data scientists and AI builders

About FAQs

Status