Basic Needs Basic Rights Kenya - Tech4MentalHealth
Classify text from university students in Kenya towards a mental health chatbot
Prize
$4 200 USD
Time
Ended over 2 years ago
Participants
499 active · 1066 enrolled
Helping
Kenya
Intermediate
Classification
Health
Isn't the dataset size too small?
Data · 28 Apr 2020, 18:38 · 10

The task seems challenging and interesting to me but I have a feeling that the dataset size is too small (only 616 samples) to be suitable for modern NLP models.

What do you think?

Discussion 10 answers

Yeah.

The same reaction for me the first time that I ve opened the competition.I see that the number of submissions is not enough per day too! I think that is good for us to increase it!

28 Apr 2020, 19:02
Upvotes 0

The idea behind reducing submissions per day was to encourage people to develop better cross-validation strategies other than relying on the public LB which easily leads to overfitting. see https://zindi.africa/discussions/539

Yeah 5 submissions per day is really not enough , i hope they increase

28 Apr 2020, 19:07
Upvotes 0

This dataset is so small, it can easily be handlabelled by anyone, who wants to, and get a perfect score. I fear many people will use this to cheat.

29 Apr 2020, 03:46
Upvotes 0

It's much smaller than we would like, but this is all that is available. Treat it as another constraint to work with (after all, lack of data is a major problem for many organizations).

Hand-labelling is definitely not allowed - that's an easy way to get disqualified :)

5 submissions a day isn't bad - some competitions only allow one submission every 5 days! Test locally, and submit your best predictions. The downside of allowing hundreds of submissions is that some folks use that to gain an edge by submitting lots of predictions hoping to learn about the test set or just trying to get lucky - behaviour we don't want to encourage.

With such a small dataset, it's going to be impossible to get a perfect score. This challenge involves human issues, and tricky ones at that. Some folks struggling with depression will talk about drugs or alcohol, and some entries are short and vague enough that predictions are essentially guesses. BUT, it's possible to get something that's able to make the right sort of guesses, and that's the goal here.

You might find this dataset too small, or feel that the task is too hard. That's OK - I hope there are other competitions on here to keep you busy. But I'm looking forward to seeing the results from those who give it a go - creative solutions to this challenge could mean that, with a little more data and a bit of luck, we could end up with something cool.

Good luck all,

J

PS: Opinions are my own and all that, hooray for Zindi not minding me lurking in the forums and chiming in all over the place :)

29 Apr 2020, 08:45
Upvotes 0

@Johnowhitaker, thank you for resolving all our queries and doubts. I would too like to try something really creative. Also I would like to take this opportunity to thank you for all your posts, I have honestly always something new and inspiring to learn from them.

John thanks for your response

If people try to manipulate the submission by hand picking their code would not be able to reproduce it there be disqualified by zindi

2 May 2020, 08:42
Upvotes 0

The dataset size just zeros out any chance of using lstms for this task , and thats sad for me because this problem would have been so nice to use with lstms which i am trying to master.

Also the size makes really easy to cheat .

3 May 2020, 20:12
Upvotes 0

Still, you should try. My current best score is with an LSTM (LB: 0.42). It'll also be a good way to master it.