Primary competition visual

Your Voice, Your Device, Your Language Challenge

Helping Africa
1 000 CHF
Challenge completed ~1 month ago
Automatic Speech Recognition
Natural Language Processing
278 joined
73 active
Starti
Jul 22, 25
Closei
Sep 22, 25
Reveali
Sep 22, 25
Blocked from Accessing Common Voice 17.0 (Swahili) on Hugging Face — Deprecated Script Error
1 Aug 2025, 06:56 · 19

I'm currently unable to load the Common Voice 17.0 dataset (Swahili split) from Hugging Face using the datasets library. Here's the code I'm using:

from datasets import load_dataset

cv_17 = load_dataset("mozilla-foundation/common_voice_17_0", "sw", split="train", streaming=True)

print(next(iter(cv_17)))

I also attempted to load it using:

cv_17 = load_dataset("mozilla-foundation/common_voice_17_0", "sw", split="train", data_dir="parquet")

But the same error persists.

It seems Hugging Face has dropped support for script-based datasets like common_voice_17_0.py, and this dataset hasn't yet been fully migrated to the newer "data-only" format (e.g., Parquet). As a result, it’s currently not possible to load it via code as expected.

Could you please clarify:

  1. Whether we are supposed to use a different method to access the data (e.g., manual download)?
  2. If Zindi could provide a preprocessed or alternative link to the dataset that works with the updated datasets library?

Discussion 19 answers

Sure me also

1 Aug 2025, 08:22
Upvotes 0

Maybe you should use !pip install datasets==3.6.0

1 Aug 2025, 09:34
Upvotes 0

still getting the same challenge

User avatar
MICADEE
LAHASCOM

You can just create a token by using the model name or any name of your choice, then you can now use the hf_token created to load these datasets easily.

1 Aug 2025, 10:28
Upvotes 0

not working, i have tried so many times but nothing changed

iam trying using hf_token but it doesnt work i think the solution is to download them manually at mozilla or waiting for instructions

1 Aug 2025, 11:55
Upvotes 0

yea sure lets wait for instruction

User avatar
msamwelmollel
University of Glasgow

Please update your datasets library, and it will work just fine!

User avatar
msamwelmollel
University of Glasgow
1 Aug 2025, 13:18
Upvotes 0

i get this error "---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)
/tmp/ipython-input-3388558349.py in <cell line: 0>()
      1 from datasets import load_dataset
      2 
----> 3 ds = load_dataset("mozilla-foundation/common_voice_17_0", "sw", token = token)

3 frames

/usr/local/lib/python3.11/dist-packages/datasets/load.py in dataset_module_factory(path, revision, download_config, download_mode, data_dir, data_files, cache_dir, **download_kwargs)
    987                     proxies=download_config.proxies,
    988                 )
--> 989                 raise RuntimeError(f"Dataset scripts are no longer supported, but found {filename}")
    990             except EntryNotFoundError:
    991                 # Use the infos from the parquet export except in some cases:
RuntimeError: Dataset scripts are no longer supported, but found common_voice_17_0.py"
User avatar
msamwelmollel
University of Glasgow

Please do as follows

1st step

!pip uninstall datasets

2nd step

!pip install datasets==3.5.1

Please let me know @AshaNasri

User avatar
msamwelmollel
University of Glasgow

Please do as follows

1st step

!pip uninstall datasets

2nd step

!pip install datasets==3.5.1 I am sure this works just fine

Please let me know @AshaNasri

1 Aug 2025, 13:55
Upvotes 0
User avatar
msamwelmollel
University of Glasgow

complete code that tested and worked

!pip uninstall datasets -y

!pip install datasets==3.5.1

from datasets import load_dataset, Features, Value, Audio, DownloadConfig

import soundfile as sf

download_config = DownloadConfig(force_download=True)

# 1. Define a schema that matches the *actual* Arrow types in CV-17

cv_sw_features = Features({

"client_id": Value("string"),

"path": Value("string"),

"sentence_id": Value("string"),

"sentence": Value("string"),

"sentence_domain":Value("string"),

"up_votes": Value("string"), # <- string, not int64

"down_votes": Value("string"), # <- string, not int64

"age": Value("string"),

"gender": Value("string"),

"variant": Value("string"),

"locale": Value("string"),

"segment": Value("string"),

"accent": Value("string"),

# keep audio decoded so we get "array" + "sampling_rate"

"audio":Audio(sampling_rate=48_000, mono=True, decode=True),

})

# 2. Stream the train split with that schema

cv_17 = load_dataset(

"mozilla-foundation/common_voice_17_0",

"sw",

split="train",

streaming=True,

features=cv_sw_features, # <- custom schema solves CastError

)

# 3. Grab the first row

first_row = next(iter(cv_17))

# 4. Save the audio clip

audio_array = first_row["audio"]["array"]

sr = first_row["audio"]["sampling_rate"]

sf.write("first_row_audio.wav", audio_array, sr)

print("Saved:", first_row["sentence"])

print("→ first_row_audio.wav | shape:", audio_array.shape, " sr:", sr)

okay i will update you when am finish

this works, thank you sana

User avatar
msamwelmollel
University of Glasgow

More can be found in

https://github.com/Sartify/Swahili-Challenge-Competition---Pan-African-Wide-Alignment-PAWA-ASR

Also, Kolesh has many starter PY for the competition with tricks and tweaks

ouh!, okay thanks much