EOFError: Compressed file ended before the end-of-stream marker was reached

Intron AfriSpeech-200 Automatic Speech Recognition Challenge

$5 000 USD

Challenge completed over 2 years ago

Skills you will learn

Automatic Speech Recognition

430 joined

41 active

Info Data Chat Leaderboard

Start

Feb 17, 23

May 28, 23

Reveal

May 28, 23

Isma

EOFError: Compressed file ended before the end-of-stream marker was reached

Data · 21 Feb 2023, 07:28 · 10

Hello, I tried to download the dataset from huggingface using the following command script:

from datasets import load_dataset

afrispeech = load_dataset("tobiolatunji/afrispeech-200", "all")

And I get and EOFError. Is someone else having the same issue? I am using Python 3.9.

Discussion 10 answers

intron

Can you share the full stack trace? At what point are you getting this error?

21 Feb 2023, 12:12

Upvotes 0

Isma

Downloading and preparing dataset afri_speech/all to /Users/iseck/.cache/huggingface/datasets/tobiolatunji___afri_speech/all/1.0.0/041d7776b1a6e1fe90f0fdf148e58de8d8fa44fc176977bf3efbc5dcabb9f0c6...

Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 694.48it/s]

Extracting data files: 0%| | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):

File "/Users/iseck/Documents/carreer_growth/competitions/afrispeech_200/download_afrispeech200.py", line 3, in <module>

afrispeech = load_dataset("tobiolatunji/afrispeech-200", "all")

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/load.py", line 1691, in load_dataset

builder_instance.download_and_prepare(

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/builder.py", line 605, in download_and_prepare

self._download_and_prepare(

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/builder.py", line 1104, in _download_and_prepare

super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/builder.py", line 672, in _download_and_prepare

split_generators = self._split_generators(dl_manager, **split_generators_kwargs)

File "/Users/iseck/.cache/huggingface/modules/datasets_modules/datasets/tobiolatunji--afrispeech-200/041d7776b1a6e1fe90f0fdf148e58de8d8fa44fc176977bf3efbc5dcabb9f0c6/afrispeech-200.py", line 193, in _split_generators

local_extracted_archive_paths = dl_manager.extract(archive_paths) if not dl_manager.is_streaming else {}

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 355, in extract

extracted_paths = map_nested(

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 314, in map_nested

mapped = [

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 315, in <listcomp>

_single_map_nested((function, obj, types, None, True, None))

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 269, in _single_map_nested

mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 269, in <listcomp>

mapped = [_single_map_nested((function, v, types, None, True, None)) for v in pbar]

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 251, in _single_map_nested

return function(data_struct)

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 262, in cached_path

output_path = ExtractManager(cache_dir=download_config.cache_dir).extract(

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/utils/extract.py", line 40, in extract

self.extractor.extract(input_path, output_path, extractor=extractor)

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/utils/extract.py", line 179, in extract

return extractor.extract(input_path, output_path)

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/site-packages/datasets/utils/extract.py", line 53, in extract

tar_file.extractall(output_path)

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/tarfile.py", line 2045, in extractall

self.extract(tarinfo, path, set_attrs=not tarinfo.isdir(),

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/tarfile.py", line 2086, in extract

self._extract_member(tarinfo, os.path.join(path, tarinfo.name),

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/tarfile.py", line 2159, in _extract_member

self.makefile(tarinfo, targetpath)

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/tarfile.py", line 2208, in makefile

copyfileobj(source, target, tarinfo.size, ReadError, bufsize)

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/tarfile.py", line 247, in copyfileobj

buf = src.read(bufsize)

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/gzip.py", line 300, in read

return self._buffer.read(size)

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/_compression.py", line 68, in readinto

data = self.read(len(byte_view))

File "/Users/iseck/opt/anaconda3/envs/env_sp/lib/python3.9/gzip.py", line 506, in read

raise EOFError("Compressed file ended before the "

EOFError: Compressed file ended before the end-of-stream marker was reached

replied to intron21 Feb 2023, 14:50

Upvotes 0

Siwar_NASRI

I think you don't have enough free space, if you want to download the whole "tobiolatunji/afrispeech-200" dataset, you must have more than 100G free space.

try the streaming mode on the load_dataset "streaming=True", that way you can iterate over the data without downloading it.

21 Feb 2023, 13:40

Upvotes 1