What I learn from this competition- my first model is a basic model where I used POLITICS as a target for every row of the test dataset and that gives me accuracy around 19.01 ish and so far this is my best one
2. I try to build to simple Machine learning model Using countVectorizer and Tf-Idf embeddings and train a bunch of model Decision tree, Radome forest, Xgboost, these models give a score like 12 13 ish something but with my validation dataset they give 30 40 45 ish so I am not clear why there are different results with validation and test dataset hope so anyone helps me on this
3.Naive Bayes which gives my best score of 19.34 ish I also try to train LSTM but failed to do so I use data cleaning like removing multiple spaces, punctuations, and removing stop words but nothing improve my accuracy, not a single percentage I want to know how you guys approach this I know only top 3 get the prize but at least we get learning cause sharing is learning
Thanks for any help
Anish Jain
@itsanishjain - Twitter
I don't know if it is worth sharing model from 137 rank in the LB. But here is my approach
1. Train fasttext embedding on data which is a concatenated form of train & test data
2. Train and validate on fasttext embedding using Catboost Classifier and predict on test data using 5 folds
3. Take average of prediction probabilities of the 5 folds and submit
You know what it's worth I got 19 ish still I am sharing so yours is definitely worth sharing Thanks I also try to train fasttest but failed.So if it is possible for you to share your code we can learn
https://www.kaggle.com/aninda/word2vec-malawi?scriptVersionId=59483860
Thanks
My score was 0.61 using RandomForestClassifier. The classes were imbalanced. You can use the SMOTE method to rebalance the classes in the pipeline. https://github.com/Linafe313/Mini-projects/blob/main/Zindi_Chichewa_News_Classification_Challenge.ipynb