🧬 Join the Buzz: unlabelled_data

InstaDeep Enzyme Classification Challenge

Job Interview

Completed (almost 5 years ago)

Skills you will learn

Classification

521 joined

70 active

Info Data Chat Leaderboard

Start

Nov 17, 20

Feb 21, 21

Reveal

Feb 21, 21

deleted_ye3hd4fYkVTnbB9q9SHbHfv8

unlabelled_data

Help · 20 Nov 2020, 03:28 · 4

Goodday zindians, please do i necessarily need the unlabelled_data for training?

Can i still get an accurate model without using the unlabelled_data?

Discussion 4 answers

Johnowhitaker

You can do a lot without the unlabelled data for sure. It's shared in case you have a way to make use of it (since this kind of data is very abundant). We're curious to see what happens :) If there were only a few labelled sequences then it would become more useful for language modelling etc. But in this case with something like 1M+ rows in train+test the extra data may well be unnecessary even if you're looking at different ways of doing the pre-training.

20 Nov 2020, 07:56

Upvotes 0

MICADEE

LAHASCOM

But why is it that "unlabelled data" contains only "sequence" column and there's no "Creature" column. This looks challenging anyway, especially when it comes to fishing out the types of "creature" each of these sequences belongs to in this "unlabelled data".

replied to Johnowhitaker5 Dec 2020, 10:20

Upvotes 0

Johnowhitaker

These sequences come from many different organisms - a lot more than the labelled training data. The theory is that you might just want as much varied, unlabelled sequence data as possible to build a language model equivalent (which only needs the sequences themselves to generate input-output pairs for training).

It's very much an optional extra - we're curious to see if and how it gets used by participants. If you're just starting on the competition, I'd recommend focusing on the main task+dataset first, and only coming back to this extra data if inspiration strikes with a good way to make use of it.

replied to MICADEE5 Dec 2020, 12:20

Upvotes 0

MICADEE

LAHASCOM

Hmmm.... "focusing on the main task+dataset first, and only coming back to this extra data if inspiration strikes with a good way to make use of it." Okay i got you there @Johnwhitaker. Yea. i thought as much of this response though. Because had it been these sequences in unlabelled data were provided along with their corresponding creatures. It could have been easy to fish out the creature 6 & 7 that were not present in train dataset from this unlabelled data. Thanks a lot for your quick response and for sharing an alternative idea on this extra data.

replied to Johnowhitaker5 Dec 2020, 18:47

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status