🐼 Let's Talk About: Train csv and extra csv

Turtle Recall: Conservation Challenge

Helping Kenya

$10 000 USD

Completed (~4 years ago)

Skills you will learn

Classification

Computer Vision

756 joined

246 active

Info Data Chat Leaderboard

Start

Nov 19, 21

Apr 21, 22

Reveal

Apr 21, 22

polymathAB

Train csv and extra csv

Data · 11 Feb 2022, 15:46 · 6

There are 100 unique IDs in train csv while extra + train has close to 2.2k unique IDs. I don't understand why the extra csv isn't just added in as train csv. Is it because image location of those images isn't in top, left or right or some other reason? I haven't inspected the extra images myself to note any differences between it and train and couldn't find any such descriptions on the info section or tutorial.

More importantly since the number of IDs in extra is more, these IDs must be registered on the DeepMind (or organisation holding the data) database i.e. the test images can also have these IDs, so using extra should intuitively increase accuracy.

Discussion 6 answers

astenuz

Maybe the reason was to keep the test set to a very constrained subset of turtles

11 Feb 2022, 15:53

Upvotes 0

astenuz

Following that, it might be that in extra the ids are there but these have not been verified

11 Feb 2022, 15:55

Upvotes 0

AnneDeepMind

Hello, it’s Anne from DeepMind here. Just to clarify - the test set contains 101 labels, which are the 100 turtle IDs from the training set plus an extra ‘new turtle’ label which is used to label any other turtle.

The ‘extra’ dataset can certainly be used in this challenge - we just chose not to use it in the tutorial to keep things as simple as possible. As you’ve noticed, the extra set contains more turtles, and has not been labeled with the image location (top/left/right). 639 of the images in the extra set are additional images of some of the training set turtles.

Here are a couple of initial ideas as to how you could think about using this extra set:

- Use the 639 extra images of the training set turtles to increase the amount of data available for training

- Use all of the extra turtles to try to combat overfitting and improve generalisation

- Use it as a validation set to measure overfitting

I hope that helps!

11 Feb 2022, 17:12

Upvotes 0

astenuz

Thanks, was just wondering wether the test set might include different turtles

replied to AnneDeepMind11 Feb 2022, 17:22

Upvotes 0

polymathAB

I see thanks for the clarification!

So essentially if we use extra_images for training and one of our test image top 5 predictions are from the extra IDs we just replace it with "new_turtle".

I'm curious are these extra turtles not in your "validated" database or has the train and test set been specifically chosen to cater to these 100 IDs for some reasons (they belong to a specific targetted species or region etc)?

replied to AnneDeepMind11 Feb 2022, 17:41 (edited 3 minutes later)

Upvotes 0

flamethrower

Hello @AnneDeepMind, @Zindi, @amyflorida626

I just wanted to confirm, I noticed we have about 598 images not in extra_images.csv but provided as part of Images data available for loading. Although this is unlabelled, are we allowed to explore using this? That is all 13891 images in total to be used for this challenge. Even though Images is provided as part of Data section, I'm a bit unclear if this is allowed.

Thank you for your response.

replied to AnneDeepMind21 Mar 2022, 21:01 (edited ~7 hours later)

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status