Dear organisation team & fellow data scientists,
I have noticed that there are some special characters in the training set.
Amongst those, there are some that are inherit to the language e.g. - or ' but others are an artefact of human translators e.g. ( or ) or "
1) will this be removed in the future from the trainings and test set?
2) if not, I would assume that training and test set are taken from the same distribution meaning that we can expect similar special characters in the test set as in the training set.
Additionally, I have noted that the audio data in the test set with ID=e3a74a8998f03c320f5a4923272247485832b1cd803528f5eb5a50aef3d29a78b436b3ea37c47763e9b9be8b3ee53435b51d3466345217ce5d6fcb9b48a53c63
Thanks a lot for settting up this very interesting challenge!
Hi Roman18, you are correct there are some special characters and letters, these are called diacritics and are important for how the language is spoken. Regarding regular punctuation, the distribution is the same across test and train.