Primary competition visual

InstaDeep Enzyme Classification Challenge

Job Interview
Challenge completed almost 5 years ago
Classification
520 joined
70 active
Starti
Nov 17, 20
Closei
Feb 21, 21
Reveali
Feb 21, 21
User avatar
deleted_ye3hd4fYkVTnbB9q9SHbHfv8
unlabelled_data
Help · 20 Nov 2020, 03:28 · 4

Goodday zindians, please do i necessarily need the unlabelled_data for training?

Can i still get an accurate model without using the unlabelled_data?

Discussion 4 answers

You can do a lot without the unlabelled data for sure. It's shared in case you have a way to make use of it (since this kind of data is very abundant). We're curious to see what happens :) If there were only a few labelled sequences then it would become more useful for language modelling etc. But in this case with something like 1M+ rows in train+test the extra data may well be unnecessary even if you're looking at different ways of doing the pre-training.

20 Nov 2020, 07:56
Upvotes 0
User avatar
MICADEE
LAHASCOM

But why is it that "unlabelled data" contains only "sequence" column and there's no "Creature" column. This looks challenging anyway, especially when it comes to fishing out the types of "creature" each of these sequences belongs to in this "unlabelled data".

These sequences come from many different organisms - a lot more than the labelled training data. The theory is that you might just want as much varied, unlabelled sequence data as possible to build a language model equivalent (which only needs the sequences themselves to generate input-output pairs for training).

It's very much an optional extra - we're curious to see if and how it gets used by participants. If you're just starting on the competition, I'd recommend focusing on the main task+dataset first, and only coming back to this extra data if inspiration strikes with a good way to make use of it.

User avatar
MICADEE
LAHASCOM

Hmmm.... "focusing on the main task+dataset first, and only coming back to this extra data if inspiration strikes with a good way to make use of it." Okay i got you there @Johnwhitaker. Yea. i thought as much of this response though. Because had it been these sequences in unlabelled data were provided along with their corresponding creatures. It could have been easy to fish out the creature 6 & 7 that were not present in train dataset from this unlabelled data. Thanks a lot for your quick response and for sharing an alternative idea on this extra data.