Primary competition visual

AI4D Malawi News Classification Challenge

Helping Malawi
$2 000 USD
Completed (almost 5 years ago)
Classification
830 joined
322 active
Starti
Jan 22, 21
Closei
May 09, 21
Reveali
May 09, 21
Solution for 0.6548 Private LB
Connect · 12 May 2021, 13:07 · edited 3 minutes later · 4

Congrats to all winners and participants!

This is one of the most challenging problems I've worked on in NLP.

You could guess from my numerous submissions how many ideas I had experimented with. Some failed surprisingly whilst a few did well on private LB, surprisingly lol.

The texts are variably long and , contains mixture of languages (I detected English and Chichewa), and a serious imbalance in classes.

I applied sequence of augmentations(multilingual bert, chichewa fasttext embeddings word substitution, and random word substitution) on all samples belonging to chichewa. To avoid data leakage I kept all augmented samples in train folds(5 stratified folds)

Model 1 - I trained a random forest classifier on tfidf features using the 5-fold augmented chichewa data. Model 2 - I trained Longformer( a transformer model specially designed for long sequences) on all English data using 5-fold stratified splits(no augmentations). Auxillary features such as binned sequence lengths and word-based statistical features were used to train the longformer model. Obtained final submission by merging the English and chichewa submissions.

Text cleaning and FE :- removing - punctuations, integers and english stopwords, and then text normalization by lemmatization(this was a better alternative to stemming)

What did not work(for me):- Pretraining and finetuning( tried bert, multilingual bert and xlm-roberta-base), Chunking each sample into 6 chunks of size 229, concatenating tfidf and ngrams/statistical features, removing chichewa stop words...(some of the major failed ideas)

Wish I tried:- Train the longformer model on all data.

My transformer models did poorly on public LB but great on private. A main reason why I failed to select any solution using transformers just before competition closed.

Discussion 4 answers

Great work

12 May 2021, 13:10
Upvotes 0

Thanks for sharing and congratulations on your results!

12 May 2021, 15:52
Upvotes 0
User avatar
flamethrower

Comprehensive work, thanks for sharing.

Since the transformer models you mentioned arent exactly state of the art for Chichewa. Wondering if you tried any of the MT5 variants in your transformer experiments?

User avatar
MICADEE
LAHASCOM

"My transformer models did poorly on public LB but great on private. The main reason why I failed to select any solution using transformers just before competition closed." Transforner models!!! I will love to see that. Thanks for sharing. Great ideas @drcod.

All i know is that there's nothing we didn't try on this challenge, only that we unluckily failed to choose our best local CV (that eventually translated to 0.664 private CV). Though this is pretty hard to know since this local CV looks unpromising from our end here. But it's all part of the Data Science Journey experience. Congratulations to all the winners. Great and amazing work done.