🌍 Challenge Chat: 2nd place solution

Lelapa AI Buzuzu-Mavi Challenge

Helping Africa

$1 300 USD

Completed (~1 year ago)

Skills you will learn

Natural Language Processing

Sentiment Analysis

Machine Translation

495 joined

118 active

Info Data Chat Leaderboard

Start

Jan 09, 25

Apr 06, 25

Reveal

Apr 07, 25

stefan027

2nd place solution

Notebooks · 18 May 2025, 09:42 · 7

Introduction

Thank you to Zindi and Lelapa for hosting a competition on such an important topic. Congratulations to @yvan_carre whose private LB score is on a different level to the rest of us!

This competition challenged us to not only improve the model's accuracy but also make it smaller. I began by instruction-tuning InkubaLM to get a sense of what score might be achievable. After getting to about 0.45 on the public LB, I decided to start focussing on making InkubaLM smaller. Consider the scoring formula:

zindi_score = (PrivateLB_score + (1-(size/PARAM_SIZE))*PrivateLB_score )/2

My understanding of the formula is that a full InkubaLM with 0.4B parameters achieving an LB score of 0.5 will achieve zindi_score = (0.5 + (1-1)*0.5)/2 = (0.5 + 0)/2 = 0.25. In comparison, a model with half the number of parameters and a LB score of 0.4 will achieve zindi_score = (0.4 + (1-0.5)*0.4)/2 = (0.4 + 0.2)/2 = 0.3.

As a sidenote, it has been a little frustrating that the application of the scoring formula hasn't been more transparent, and that the private LB is still only based on F1 scores.

Data

In addition to the data provided in the competition's Data page, I also used data from Inkuba-Instruct and XNLI. I'm still not sure if these datasets were considered 'external' - the data page refers to the competition data only as 'samples' - but I thought using it was allowed regardless given this clarification on the Lelapa Discord server.

Approach

InkubaLM has 420M parameters. With a vocabulary of 61,788 and embedding dimension of 2048, it means that the first and last layers of the model alone account for 253M parameters, or 60% of the total number of parameters. In contrast, each hidden layer consists of only 16M parameters. Therefore, to make the model significantly smaller, the number of parameters in the first and last layers must be reduced. My approach consisted of the following:

Reduce the vocabulary size by only keeping frequently occuring tokens. The vocabulary size was reduced to 8,064 tokens. This change alone reduced the number of parameters by 220M parameters. InkubaLM's parameter weights are kept - I simply remove tokens that don't occur frequently and let the tokenizer 'fall back' to smaller units.
Reduce the dimension of all layers by a factor of two. For example, all 2048x2048 parameter tensors become 1024x1024. The pruning was achieved by randomly eliminating parameters. As with the embedding layer, I eliminated parameters but kept the weights of the remaining parameters.
Alter the architecture so that the first embedding layer and the final prediction layer share the same parameters.
(Optional): Reduce the number of hidden layers from 8 to 6. The code is written to always keep the first _N_ layers when pruning layers.

Trained models

The following models were trained:

50M:

Number of parameters: 50,480,128
Private LB score: 0.496534562
Modified tokenizer
Dimensions reduced by a factor of two
First and last layers share parameters

100M:

Number of parameters: 105,493,504
Private LB score: 0.521065192
Original InkubaLM tokenizer
Dimensions reduced by a factor of two
First and last layers share parameters

40M:

Number of parameters: 42,087,424
Private LB score: 0.532638201
Modified tokenizer
Dimensions reduced by a factor of two
First and last layers share parameters
Reduce number of hidden layers from 8 to 6

All three models were submitted during the competition. The 50M and 100M models were my two selections for consideration. The 100M was identified by the private LB as the best-performing of the two, but that did not take the number of parameters into account. I believe that the `50M` model was the best submission when the number of parameters are considered.

The 40M could not be considered for the competition because it wasn't one of my two selected submissions. I only trained it right at the end of the competition and didn't have time to test it thoroughly, and was therefore hesitant to select it as a submission. For me this is the most exciting model because it is both the smallest and most performant. Additionally, for the 40M model, I used a two-stage training process where it was first fine-tuned on translation data only, and then further fine-tuned on all three tasks. It makes the model particularly strong at translation while still performing well on the other tasks.

Code

All the code is on Github: https://github.com/stefan027/zindi-competitions/tree/main/lelapa_buzuzu_mavi.

Credit

I learned a lot from the source code that accompanies the book Build a Large Language Model (From Scratch) by Sebastian Raschka. My training code is heavily influenced by this notebook from the repo.

Raschka, Sebastian. Build A Large Language Model (From Scratch). Manning, 2024. ISBN: 978-1633437166.

Discussion 7 answers

Koleshjr

Multimedia university of kenya

Wooooow , woooooow , woooooow, woooooooooowwww!!!, thanks @stefan027 well deserved!!!!

18 May 2025, 09:51

Upvotes 1

Koleshjr

Multimedia university of kenya

@stefan027 was your eos token respected and if it was how did you enforce that? I struggled making the eos token to be respected, and all I did was set a max output length and used the @snow inspired approach of repeating the target

replied to Koleshjr18 May 2025, 10:05

Upvotes 0

stefan027

thanks @Koleshjr. My eos token was respected for the most part, so I didn't spend much time thinking about that problem. It's probably standard in most finetuning frameworks, but because I trained in pure Pytorch I made sure to set padding tokens to -100 in the targets to not include then in my loss calculation. I also turned sampling off during inference for sentiment analysis and QA, and set max_new_tokens to 15. But no special tricks to deal with eos.

There is a function called llm_classify in my inference notebook that I wrote as a fallback in case my model didn't respect instructions for sentiment and QA, but I didn't use it in the end. Given an input text and possible class labels, it calculates the log softmax of each class label given the input text, and returns the class label with the highest liklihood.

replied to Koleshjr18 May 2025, 17:04

Upvotes 1