The data contains phrases from a movie. All special characters and spaces have been remove.
There are ~56 000 phrases in train and ~2 500 in test.
The training dataset has 3 columns:
- Plain_text: Original text in the transcript with special characters and spaces replaced removed
- encrypted_text: The original text with special characters and spaces replaced by X and encrypted using the enigma machine
- encryption_key: The encrypted message key used to encrypt and decrypt the phrase
You will notice that for evaluation purposes (we use log loss distance) we are one hot encoding the evaluation test set, hence we are asking you to submit the raw probabilities out of the model, where, for a given ID, each row represents a position on the sequence and each column represents the probabilities of each label (token) .