This document provides a simple guide to help users get started with audio-to-text conversion using the SeamlessM4T model. The content is based on the Sartify ITU test dataset on the Zindi platform, aiming for a score of approximately 0.48. The guide presents step-by-step instructions from preparing the environment and processing audio data to running the Automatic Speech Recognition (ASR) model for Swahili (swh) and creating a final output file for submission.
In this manual, readers will learn how to install necessary libraries such as fairseq2, pydub, sentencepiece, and seamless_communication. It also covers data loading, audio file preprocessing, and applying the medium version of the SeamlessM4T model along with vocoder_36langs to perform the transcription process. The guide also illustrates how to process multiple audio files in batches to increase processing efficiency, and how to perform post-processing steps to create a complete CSV file, ready for submission on the system.