☎️ Challenge Chat: Blocked from Accessing Common ...

Your Voice, Your Device, Your Language Challenge

Helping Africa

1 000 CHF

Challenge completed 3 months ago

Skills you will learn

Automatic Speech Recognition

Natural Language Processing

289 joined

73 active

Info Data Chat Leaderboard

Start

Jul 22, 25

Sep 22, 25

Reveal

Sep 22, 25

AshaNasri

Blocked from Accessing Common Voice 17.0 (Swahili) on Hugging Face — Deprecated Script Error

1 Aug 2025, 06:56 · 19

I'm currently unable to load the Common Voice 17.0 dataset (Swahili split) from Hugging Face using the datasets library. Here's the code I'm using:

from datasets import load_dataset

cv_17 = load_dataset("mozilla-foundation/common_voice_17_0", "sw", split="train", streaming=True)

print(next(iter(cv_17)))

I also attempted to load it using:

cv_17 = load_dataset("mozilla-foundation/common_voice_17_0", "sw", split="train", data_dir="parquet")

But the same error persists.

It seems Hugging Face has dropped support for script-based datasets like common_voice_17_0.py, and this dataset hasn't yet been fully migrated to the newer "data-only" format (e.g., Parquet). As a result, it’s currently not possible to load it via code as expected.

Could you please clarify:

Whether we are supposed to use a different method to access the data (e.g., manual download)?
If Zindi could provide a preprocessed or alternative link to the dataset that works with the updated datasets library?

Discussion 19 answers

marcoharuni

Sure me also

1 Aug 2025, 08:22

Upvotes 0

Lawrence_Moruye

Maybe you should use !pip install datasets==3.6.0

1 Aug 2025, 09:34

Upvotes 0

AshaNasri

still getting the same challenge

replied to Lawrence_Moruye1 Aug 2025, 13:11

Upvotes 0

MICADEE

LAHASCOM

You can just create a token by using the model name or any name of your choice, then you can now use the hf_token created to load these datasets easily.

1 Aug 2025, 10:28

Upvotes 0

AshaNasri

not working, i have tried so many times but nothing changed

replied to MICADEE1 Aug 2025, 13:12

Upvotes 0

marcoharuni

iam trying using hf_token but it doesnt work i think the solution is to download them manually at mozilla or waiting for instructions

1 Aug 2025, 11:55

Upvotes 0

AshaNasri

yea sure lets wait for instruction

replied to marcoharuni1 Aug 2025, 13:13

Upvotes 0

msamwelmollel

University of Glasgow

Please update your datasets library, and it will work just fine!

replied to marcoharuni1 Aug 2025, 13:15

Upvotes 0

marcoharuni

Okay

replied to msamwelmollel20 Aug 2025, 05:17

Upvotes 0

msamwelmollel

University of Glasgow

https://github.com/Sartify/Swahili-Challenge-Competition---Pan-African-Wide-Alignment-PAWA-ASR

Try

!pip install -U datasets

1 Aug 2025, 13:18

Upvotes 0

AshaNasri

okay lemi try

replied to msamwelmollel1 Aug 2025, 13:22

Upvotes 0

AshaNasri

i get this error "---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

/tmp/ipython-input-3388558349.py in <cell line: 0>()
      1 from datasets import load_dataset
      2 
----> 3 ds = load_dataset("mozilla-foundation/common_voice_17_0", "sw", token = token)

3 frames

/usr/local/lib/python3.11/dist-packages/datasets/load.py in dataset_module_factory(path, revision, download_config, download_mode, data_dir, data_files, cache_dir, **download_kwargs)
    987                     proxies=download_config.proxies,
    988                 )
--> 989                 raise RuntimeError(f"Dataset scripts are no longer supported, but found {filename}")
    990             except EntryNotFoundError:
    991                 # Use the infos from the parquet export except in some cases:

RuntimeError: Dataset scripts are no longer supported, but found common_voice_17_0.py"

replied to msamwelmollel1 Aug 2025, 13:27

Upvotes 0

msamwelmollel

University of Glasgow

Please do as follows

1st step

!pip uninstall datasets

2nd step

!pip install datasets==3.5.1

Please let me know @AshaNasri

replied to AshaNasri1 Aug 2025, 13:54

Upvotes 0

msamwelmollel

University of Glasgow

Please do as follows

1st step

!pip uninstall datasets

2nd step

!pip install datasets==3.5.1 I am sure this works just fine

Please let me know @AshaNasri

1 Aug 2025, 13:55

Upvotes 0

msamwelmollel

University of Glasgow

complete code that tested and worked

!pip uninstall datasets -y

!pip install datasets==3.5.1

from datasets import load_dataset, Features, Value, Audio, DownloadConfig

import soundfile as sf

download_config = DownloadConfig(force_download=True)

# 1. Define a schema that matches the *actual* Arrow types in CV-17

cv_sw_features = Features({

"client_id": Value("string"),

"path": Value("string"),

"sentence_id": Value("string"),

"sentence": Value("string"),

"sentence_domain":Value("string"),

"up_votes": Value("string"), # <- string, not int64

"down_votes": Value("string"), # <- string, not int64

"age": Value("string"),

"gender": Value("string"),

"variant": Value("string"),

"locale": Value("string"),

"segment": Value("string"),

"accent": Value("string"),

# keep audio decoded so we get "array" + "sampling_rate"

"audio":Audio(sampling_rate=48_000, mono=True, decode=True),

})

# 2. Stream the train split with that schema

cv_17 = load_dataset(

"mozilla-foundation/common_voice_17_0",

"sw",

split="train",

streaming=True,

features=cv_sw_features, # <- custom schema solves CastError

)

# 3. Grab the first row

first_row = next(iter(cv_17))

# 4. Save the audio clip

audio_array = first_row["audio"]["array"]

sr = first_row["audio"]["sampling_rate"]

sf.write("first_row_audio.wav", audio_array, sr)

print("Saved:", first_row["sentence"])

print("→ first_row_audio.wav | shape:", audio_array.shape, " sr:", sr)

replied to msamwelmollel1 Aug 2025, 13:57

Upvotes 0

AshaNasri

okay i will update you when am finish

replied to msamwelmollel1 Aug 2025, 13:58

Upvotes 0

AshaNasri

this works, thank you sana

replied to AshaNasri1 Aug 2025, 14:04

Upvotes 0

msamwelmollel

University of Glasgow

More can be found in

https://github.com/Sartify/Swahili-Challenge-Competition---Pan-African-Wide-Alignment-PAWA-ASR

Also, Kolesh has many starter PY for the competition with tricks and tweaks

replied to AshaNasri1 Aug 2025, 14:08

Upvotes 0

AshaNasri

ouh!, okay thanks much

replied to msamwelmollel1 Aug 2025, 14:11

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status