I encounter this error everytime I attempt training the whisper small model on the train data. The training runs till about 0.18 epochs
RuntimeError: The size of tensor a (474) must match the size of tensor b (448) at non-singleton dimension 1
Hi,
This error is due to the fact that Whisper doesn't "accept" (by default) any input (audio) longer than 30s.
A simple fix could be to discard any audio data longer than 30s.
# Whisper max audio length MAX_DURATION_IN_SECONDS = 30.0 MAX_INPUT_LENGTH = MAX_DURATION_IN_SECONDS * 16000 MAX_LABEL_LENGTH = 448 # Add these functions to your code def prepare_dataset(batch):
audio = batch["audio"]
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
batch["labels"] = tokenizer(batch["transcript"]).input_ids
batch["input_length"] = len(batch["audio"])
batch["labels_length"] = len(tokenizer(batch["transcript"], add_special_tokens=False).input_ids)
return batch
def filter_inputs(input_length):
"""Filter inputs with zero input length or longer than 30s"""
return 0 < input_length < MAX_INPUT_LENGTH
def filter_labels(labels_length):
"""Filter label sequences longer than max length (448)"""
return labels_length < MAX_LABEL_LENGTH ... afrispeech = afrispeech.map(prepare_dataset) afrispeech = afrispeech.filter(filter_inputs, input_columns=["input_length"]) afrispeech = afrispeech.filter(filter_labels, input_columns=["labels_length"]) afrispeech = afrispeech.remove_columns(['labels_length', 'input_length']) ...
Thank you so much Muhamed. I am currently trying out your fix, hope to report some good progress soon.
Hi,
This error is due to the fact that Whisper doesn't "accept" (by default) any input (audio) longer than 30s.
A simple fix could be to discard any audio data longer than 30s.
Thank you so much Muhamed. I am currently trying out your fix, hope to report some good progress soon.