Sustainable Development Goals (SDGs): Text Classification Challenge
$1,000 USD
Classify text and documents by relevance to the 27 indicators of SDG #3 (Health and Well-Being)
5 September–12 November 2018 23:59
223 data scientists enrolled, 50 on the leaderboard
Solutions
published 13 Nov 2018, 12:34

Hi all,

I'm curious what the best models (and ideas) are for this competition. I used the 1-d convolutional neural net (CNN) and word embeddings (300-d fastText word embeddings). I didn't use/invent any new features. I also did not use ID and type of a given text (grant, news, etc). I trained the single model and it leads to 0.0403, which is 5th place now. Once i clean the code a bit, i will share everything on my Github. It feels like i miss something as the model itself is quite big. Any hints from top submissions?

I created an lstm with word embedding just like yours but I kept on getting 0.073. I will also make my code available on github. Guess i missed something

i try that but with glove and even used gensim to train the word and import to the embedding layer. and for the gensim wordvec i set the embedding layer to not trainable but i was stock to 0.05. please i will be glad to see how you do that. thanks.

but i later used ensemble, mixing different model, the ensemble is called blending. it gave me 0.040..., for every different model i combined and used trigram for tfidfvectorizer. the i used the naive bayesian svm(LinearSvc) which do give me 0.042.. then i check their correlation to see f they are less correlated and then join them together.

i learn this method from how kaggle grandmaster one the high position http://www.chioka.in/stacking-blending-and-stacked-generalization/, and https://mlwave.com/kaggle-ensembling-guide/.

for Neural net, i think , i did not later submit the cnn one it was the lstm own i submited, i was discourage by the val_accuracy not passing 0.95, like the lstm which was not giving a betterscore.

I also tried 1-d CNN with fasttext, but I used the pre-trained wiki vectors. Early stopping ended after 63 epochs and left me with score of 0.061. My best practical model was LinearSVC trained on 30K tfidf vector, that was built off cleaned and lemmatized text (0.0419). My best overall model was an attempt at the usual kaggle massive ensemble running over about 6 different derived vectors and 12 models for each vector. I then took the 72 predictions and used those as the features for the final model which resulted in 0.0407. I was training a more complex collection of vectors + models, but RAM became and issue and it wouldn't have been a practical model anyway, so I stopped :)

Here is my model: https://github.com/pawelmorawiecki/Zindi_SDG_competition

It looks like i should have tried ensembles to get slightly better results. Similarly to Kaggle, hard to win the competition without ensembling ;-) Feel free if you have any questions on the model.

you use pytorch,cool. i can see why my cnn did not perform well. thanks

Yes, Pytorch and its NLP library fasttext. Why your CNN did not perform well? Too few filters?

edited 1 minute later

Funny that I used a simple ensemble model with MultiOutPutClassifier with a score of 0.070 I think