The task seems challenging and interesting to me but I have a feeling that the dataset size is too small (only 616 samples) to be suitable for modern NLP models.
The same reaction for me the first time that I ve opened the competition.I see that the number of submissions is not enough per day too! I think that is good for us to increase it!
The idea behind reducing submissions per day was to encourage people to develop better cross-validation strategies other than relying on the public LB which easily leads to overfitting. see https://zindi.africa/discussions/539
It's much smaller than we would like, but this is all that is available. Treat it as another constraint to work with (after all, lack of data is a major problem for many organizations).
Hand-labelling is definitely not allowed - that's an easy way to get disqualified :)
5 submissions a day isn't bad - some competitions only allow one submission every 5 days! Test locally, and submit your best predictions. The downside of allowing hundreds of submissions is that some folks use that to gain an edge by submitting lots of predictions hoping to learn about the test set or just trying to get lucky - behaviour we don't want to encourage.
With such a small dataset, it's going to be impossible to get a perfect score. This challenge involves human issues, and tricky ones at that. Some folks struggling with depression will talk about drugs or alcohol, and some entries are short and vague enough that predictions are essentially guesses. BUT, it's possible to get something that's able to make the right sort of guesses, and that's the goal here.
You might find this dataset too small, or feel that the task is too hard. That's OK - I hope there are other competitions on here to keep you busy. But I'm looking forward to seeing the results from those who give it a go - creative solutions to this challenge could mean that, with a little more data and a bit of luck, we could end up with something cool.
Good luck all,
J
PS: Opinions are my own and all that, hooray for Zindi not minding me lurking in the forums and chiming in all over the place :)
@Johnowhitaker, thank you for resolving all our queries and doubts. I would too like to try something really creative. Also I would like to take this opportunity to thank you for all your posts, I have honestly always something new and inspiring to learn from them.
The dataset size just zeros out any chance of using lstms for this task , and thats sad for me because this problem would have been so nice to use with lstms which i am trying to master.
Yeah.
The same reaction for me the first time that I ve opened the competition.I see that the number of submissions is not enough per day too! I think that is good for us to increase it!
The idea behind reducing submissions per day was to encourage people to develop better cross-validation strategies other than relying on the public LB which easily leads to overfitting. see https://zindi.africa/discussions/539
Yeah 5 submissions per day is really not enough , i hope they increase
Exactly. It's too much strict.
This dataset is so small, it can easily be handlabelled by anyone, who wants to, and get a perfect score. I fear many people will use this to cheat.
It's much smaller than we would like, but this is all that is available. Treat it as another constraint to work with (after all, lack of data is a major problem for many organizations).
Hand-labelling is definitely not allowed - that's an easy way to get disqualified :)
5 submissions a day isn't bad - some competitions only allow one submission every 5 days! Test locally, and submit your best predictions. The downside of allowing hundreds of submissions is that some folks use that to gain an edge by submitting lots of predictions hoping to learn about the test set or just trying to get lucky - behaviour we don't want to encourage.
With such a small dataset, it's going to be impossible to get a perfect score. This challenge involves human issues, and tricky ones at that. Some folks struggling with depression will talk about drugs or alcohol, and some entries are short and vague enough that predictions are essentially guesses. BUT, it's possible to get something that's able to make the right sort of guesses, and that's the goal here.
You might find this dataset too small, or feel that the task is too hard. That's OK - I hope there are other competitions on here to keep you busy. But I'm looking forward to seeing the results from those who give it a go - creative solutions to this challenge could mean that, with a little more data and a bit of luck, we could end up with something cool.
Good luck all,
J
PS: Opinions are my own and all that, hooray for Zindi not minding me lurking in the forums and chiming in all over the place :)
@Johnowhitaker, thank you for resolving all our queries and doubts. I would too like to try something really creative. Also I would like to take this opportunity to thank you for all your posts, I have honestly always something new and inspiring to learn from them.
John thanks for your response
If people try to manipulate the submission by hand picking their code would not be able to reproduce it there be disqualified by zindi
The dataset size just zeros out any chance of using lstms for this task , and thats sad for me because this problem would have been so nice to use with lstms which i am trying to master.
Also the size makes really easy to cheat .
Still, you should try. My current best score is with an LSTM (LB: 0.42). It'll also be a good way to master it.