You can do a lot without the unlabelled data for sure. It's shared in case you have a way to make use of it (since this kind of data is very abundant). We're curious to see what happens :) If there were only a few labelled sequences then it would become more useful for language modelling etc. But in this case with something like 1M+ rows in train+test the extra data may well be unnecessary even if you're looking at different ways of doing the pre-training.
But why is it that "unlabelled data" contains only "sequence" column and there's no "Creature" column. This looks challenging anyway, especially when it comes to fishing out the types of "creature" each of these sequences belongs to in this "unlabelled data".
These sequences come from many different organisms - a lot more than the labelled training data. The theory is that you might just want as much varied, unlabelled sequence data as possible to build a language model equivalent (which only needs the sequences themselves to generate input-output pairs for training).
It's very much an optional extra - we're curious to see if and how it gets used by participants. If you're just starting on the competition, I'd recommend focusing on the main task+dataset first, and only coming back to this extra data if inspiration strikes with a good way to make use of it.
Hmmm.... "focusing on the main task+dataset first, and only coming back to this extra data if inspiration strikes with a good way to make use of it." Okay i got you there @Johnwhitaker. Yea. i thought as much of this response though. Because had it been these sequences in unlabelled data were provided along with their corresponding creatures. It could have been easy to fish out the creature 6 & 7 that were not present in train dataset from this unlabelled data. Thanks a lot for your quick response and for sharing an alternative idea on this extra data.
You can do a lot without the unlabelled data for sure. It's shared in case you have a way to make use of it (since this kind of data is very abundant). We're curious to see what happens :) If there were only a few labelled sequences then it would become more useful for language modelling etc. But in this case with something like 1M+ rows in train+test the extra data may well be unnecessary even if you're looking at different ways of doing the pre-training.
But why is it that "unlabelled data" contains only "sequence" column and there's no "Creature" column. This looks challenging anyway, especially when it comes to fishing out the types of "creature" each of these sequences belongs to in this "unlabelled data".
These sequences come from many different organisms - a lot more than the labelled training data. The theory is that you might just want as much varied, unlabelled sequence data as possible to build a language model equivalent (which only needs the sequences themselves to generate input-output pairs for training).
It's very much an optional extra - we're curious to see if and how it gets used by participants. If you're just starting on the competition, I'd recommend focusing on the main task+dataset first, and only coming back to this extra data if inspiration strikes with a good way to make use of it.
Hmmm.... "focusing on the main task+dataset first, and only coming back to this extra data if inspiration strikes with a good way to make use of it." Okay i got you there @Johnwhitaker. Yea. i thought as much of this response though. Because had it been these sequences in unlabelled data were provided along with their corresponding creatures. It could have been easy to fish out the creature 6 & 7 that were not present in train dataset from this unlabelled data. Thanks a lot for your quick response and for sharing an alternative idea on this extra data.