I'm currently unable to load the Common Voice 17.0 dataset (Swahili split) from Hugging Face using the datasets library. Here's the code I'm using:
from datasets import load_dataset
cv_17 = load_dataset("mozilla-foundation/common_voice_17_0", "sw", split="train", streaming=True)
print(next(iter(cv_17)))
I also attempted to load it using:
cv_17 = load_dataset("mozilla-foundation/common_voice_17_0", "sw", split="train", data_dir="parquet")
But the same error persists.
It seems Hugging Face has dropped support for script-based datasets like common_voice_17_0.py, and this dataset hasn't yet been fully migrated to the newer "data-only" format (e.g., Parquet). As a result, it’s currently not possible to load it via code as expected.
Could you please clarify:
Sure me also
Maybe you should use !pip install datasets==3.6.0
still getting the same challenge
You can just create a token by using the model name or any name of your choice, then you can now use the hf_token created to load these datasets easily.
not working, i have tried so many times but nothing changed
iam trying using hf_token but it doesnt work i think the solution is to download them manually at mozilla or waiting for instructions
yea sure lets wait for instruction
Please update your datasets library, and it will work just fine!
Okay
https://github.com/Sartify/Swahili-Challenge-Competition---Pan-African-Wide-Alignment-PAWA-ASR
Try
!pip install -U datasets
okay lemi try
i get this error "---------------------------------------------------------------------------
3 frames
Please do as follows
1st step
!pip uninstall datasets
2nd step
!pip install datasets==3.5.1
Please let me know @AshaNasri
Please do as follows
1st step
!pip uninstall datasets
2nd step
!pip install datasets==3.5.1 I am sure this works just fine
Please let me know @AshaNasri
complete code that tested and worked
!pip uninstall datasets -y
!pip install datasets==3.5.1
from datasets import load_dataset, Features, Value, Audio, DownloadConfig
import soundfile as sf
download_config = DownloadConfig(force_download=True)
# 1. Define a schema that matches the *actual* Arrow types in CV-17
cv_sw_features = Features({
"client_id": Value("string"),
"path": Value("string"),
"sentence_id": Value("string"),
"sentence": Value("string"),
"sentence_domain":Value("string"),
"up_votes": Value("string"), # <- string, not int64
"down_votes": Value("string"), # <- string, not int64
"age": Value("string"),
"gender": Value("string"),
"variant": Value("string"),
"locale": Value("string"),
"segment": Value("string"),
"accent": Value("string"),
# keep audio decoded so we get "array" + "sampling_rate"
"audio":Audio(sampling_rate=48_000, mono=True, decode=True),
})
# 2. Stream the train split with that schema
cv_17 = load_dataset(
"mozilla-foundation/common_voice_17_0",
"sw",
split="train",
streaming=True,
features=cv_sw_features, # <- custom schema solves CastError
)
# 3. Grab the first row
first_row = next(iter(cv_17))
# 4. Save the audio clip
audio_array = first_row["audio"]["array"]
sr = first_row["audio"]["sampling_rate"]
sf.write("first_row_audio.wav", audio_array, sr)
print("Saved:", first_row["sentence"])
print("→ first_row_audio.wav | shape:", audio_array.shape, " sr:", sr)
okay i will update you when am finish
this works, thank you sana
More can be found in
https://github.com/Sartify/Swahili-Challenge-Competition---Pan-African-Wide-Alignment-PAWA-ASR
Also, Kolesh has many starter PY for the competition with tricks and tweaks
ouh!, okay thanks much