# 1 The architecture
First, the challenge is in english, as a result, you can use existing english models. Throwing in an idea from Modular Deep Learning. You can throw in a module (a neural net of you choice, I would personnally a few transformer) just between the sound extraction and the transformer part of the model. Training only the added module sounds a good idea, compute/time efficient.
# 2 Dealing with numerical data (and more)
You can use num2words package (pip install num2words) to convert the number in the labels to words (add a tag that tells where the words that represent a number start and end...). Train your model (only the module you added), make your predictions only with letters. And then use word2num package to convert back letters to numbers.
You can use the same idea to convert the "." into "full stop" for model training and back to "." after the prediction.
Pre-processing and Post-processing are all you need.
Thats really great