🎙️ Challenge Chat: Solution for 0.6548 Private LB

AI4D Malawi News Classification Challenge

Helping Malawi

$2 000 USD

Completed (~5 years ago)

Skills you will learn

Classification

833 joined

322 active

Info Data Chat Leaderboard

Start

Jan 22, 21

May 09, 21

Reveal

May 09, 21

deleted_Aets8vx5W8jEX3WqLgKpJFWo

Solution for 0.6548 Private LB

Connect · 12 May 2021, 13:07 · edited 3 minutes later · 4

Congrats to all winners and participants!

This is one of the most challenging problems I've worked on in NLP.

You could guess from my numerous submissions how many ideas I had experimented with. Some failed surprisingly whilst a few did well on private LB, surprisingly lol.

The texts are variably long and , contains mixture of languages (I detected English and Chichewa), and a serious imbalance in classes.

I applied sequence of augmentations(multilingual bert, chichewa fasttext embeddings word substitution, and random word substitution) on all samples belonging to chichewa. To avoid data leakage I kept all augmented samples in train folds(5 stratified folds)

Model 1 - I trained a random forest classifier on tfidf features using the 5-fold augmented chichewa data. Model 2 - I trained Longformer( a transformer model specially designed for long sequences) on all English data using 5-fold stratified splits(no augmentations). Auxillary features such as binned sequence lengths and word-based statistical features were used to train the longformer model. Obtained final submission by merging the English and chichewa submissions.

Text cleaning and FE :- removing - punctuations, integers and english stopwords, and then text normalization by lemmatization(this was a better alternative to stemming)

What did not work(for me):- Pretraining and finetuning( tried bert, multilingual bert and xlm-roberta-base), Chunking each sample into 6 chunks of size 229, concatenating tfidf and ngrams/statistical features, removing chichewa stop words...(some of the major failed ideas)

Wish I tried:- Train the longformer model on all data.

My transformer models did poorly on public LB but great on private. A main reason why I failed to select any solution using transformers just before competition closed.

Discussion 4 answers

aninda_bitm

Great work

12 May 2021, 13:10

Upvotes 0

anamip

Thanks for sharing and congratulations on your results!

12 May 2021, 15:52

Upvotes 0

flamethrower

Comprehensive work, thanks for sharing.

Since the transformer models you mentioned arent exactly state of the art for Chichewa. Wondering if you tried any of the MT5 variants in your transformer experiments?

13 May 2021, 11:43 (edited ~1 hour later)

Upvotes 0

MICADEE

LAHASCOM (Freelance)

"My transformer models did poorly on public LB but great on private. The main reason why I failed to select any solution using transformers just before competition closed." Transforner models!!! I will love to see that. Thanks for sharing. Great ideas @drcod.

All i know is that there's nothing we didn't try on this challenge, only that we unluckily failed to choose our best local CV (that eventually translated to 0.664 private CV). Though this is pretty hard to know since this local CV looks unpromising from our end here. But it's all part of the Data Science Journey experience. Congratulations to all the winners. Great and amazing work done.

15 May 2021, 14:11 (edited 14 days later)

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status