There are 100 unique IDs in train csv while extra + train has close to 2.2k unique IDs. I don't understand why the extra csv isn't just added in as train csv. Is it because image location of those images isn't in top, left or right or some other reason? I haven't inspected the extra images myself to note any differences between it and train and couldn't find any such descriptions on the info section or tutorial.
More importantly since the number of IDs in extra is more, these IDs must be registered on the DeepMind (or organisation holding the data) database i.e. the test images can also have these IDs, so using extra should intuitively increase accuracy.
Maybe the reason was to keep the test set to a very constrained subset of turtles
Following that, it might be that in extra the ids are there but these have not been verified
Hello, it’s Anne from DeepMind here. Just to clarify - the test set contains 101 labels, which are the 100 turtle IDs from the training set plus an extra ‘new turtle’ label which is used to label any other turtle.
The ‘extra’ dataset can certainly be used in this challenge - we just chose not to use it in the tutorial to keep things as simple as possible. As you’ve noticed, the extra set contains more turtles, and has not been labeled with the image location (top/left/right). 639 of the images in the extra set are additional images of some of the training set turtles.
Here are a couple of initial ideas as to how you could think about using this extra set:
- Use the 639 extra images of the training set turtles to increase the amount of data available for training
- Use all of the extra turtles to try to combat overfitting and improve generalisation
- Use it as a validation set to measure overfitting
I hope that helps!
Thanks, was just wondering wether the test set might include different turtles
I see thanks for the clarification!
So essentially if we use extra_images for training and one of our test image top 5 predictions are from the extra IDs we just replace it with "new_turtle".
I'm curious are these extra turtles not in your "validated" database or has the train and test set been specifically chosen to cater to these 100 IDs for some reasons (they belong to a specific targetted species or region etc)?
Hello @AnneDeepMind, @Zindi, @amyflorida626
I just wanted to confirm, I noticed we have about 598 images not in extra_images.csv but provided as part of Images data available for loading. Although this is unlabelled, are we allowed to explore using this? That is all 13891 images in total to be used for this challenge. Even though Images is provided as part of Data section, I'm a bit unclear if this is allowed.
Thank you for your response.