Thank you to Zindi and Lelapa for hosting a competition on such an important topic. Congratulations to @yvan_carre whose private LB score is on a different level to the rest of us!
This competition challenged us to not only improve the model's accuracy but also make it smaller. I began by instruction-tuning InkubaLM to get a sense of what score might be achievable. After getting to about 0.45 on the public LB, I decided to start focussing on making InkubaLM smaller. Consider the scoring formula:
zindi_score = (PrivateLB_score + (1-(size/PARAM_SIZE))*PrivateLB_score )/2
My understanding of the formula is that a full InkubaLM with 0.4B parameters achieving an LB score of 0.5 will achieve zindi_score = (0.5 + (1-1)*0.5)/2 = (0.5 + 0)/2 = 0.25. In comparison, a model with half the number of parameters and a LB score of 0.4 will achieve zindi_score = (0.4 + (1-0.5)*0.4)/2 = (0.4 + 0.2)/2 = 0.3.
As a sidenote, it has been a little frustrating that the application of the scoring formula hasn't been more transparent, and that the private LB is still only based on F1 scores.
In addition to the data provided in the competition's Data page, I also used data from Inkuba-Instruct and XNLI. I'm still not sure if these datasets were considered 'external' - the data page refers to the competition data only as 'samples' - but I thought using it was allowed regardless given this clarification on the Lelapa Discord server.
InkubaLM has 420M parameters. With a vocabulary of 61,788 and embedding dimension of 2048, it means that the first and last layers of the model alone account for 253M parameters, or 60% of the total number of parameters. In contrast, each hidden layer consists of only 16M parameters. Therefore, to make the model significantly smaller, the number of parameters in the first and last layers must be reduced. My approach consisted of the following:
The following models were trained:
50M:
100M:
40M:
All three models were submitted during the competition. The 50M and 100M models were my two selections for consideration. The 100M was identified by the private LB as the best-performing of the two, but that did not take the number of parameters into account. I believe that the `50M` model was the best submission when the number of parameters are considered.
The 40M could not be considered for the competition because it wasn't one of my two selected submissions. I only trained it right at the end of the competition and didn't have time to test it thoroughly, and was therefore hesitant to select it as a submission. For me this is the most exciting model because it is both the smallest and most performant. Additionally, for the 40M model, I used a two-stage training process where it was first fine-tuned on translation data only, and then further fine-tuned on all three tasks. It makes the model particularly strong at translation while still performing well on the other tasks.
All the code is on Github: https://github.com/stefan027/zindi-competitions/tree/main/lelapa_buzuzu_mavi.
I learned a lot from the source code that accompanies the book Build a Large Language Model (From Scratch) by Sebastian Raschka. My training code is heavily influenced by this notebook from the repo.
Raschka, Sebastian. Build A Large Language Model (From Scratch). Manning, 2024. ISBN: 978-1633437166.
Wooooow , woooooow , woooooow, woooooooooowwww!!!, thanks @stefan027 well deserved!!!!
@stefan027 was your eos token respected and if it was how did you enforce that? I struggled making the eos token to be respected, and all I did was set a max output length and used the @snow inspired approach of repeating the target
thanks @Koleshjr. My eos token was respected for the most part, so I didn't spend much time thinking about that problem. It's probably standard in most finetuning frameworks, but because I trained in pure Pytorch I made sure to set padding tokens to -100 in the targets to not include then in my loss calculation. I also turned sampling off during inference for sentiment analysis and QA, and set max_new_tokens to 15. But no special tricks to deal with eos.
There is a function called llm_classify in my inference notebook that I wrote as a fallback in case my model didn't respect instructions for sentiment and QA, but I didn't use it in the end. Given an input text and possible class labels, it calculates the log softmax of each class label given the input text, and returns the class label with the highest liklihood.
Nice congratulations and thanks again for sharing!!
I really have to learn from your approach @stefan027. Great work done. 🔥🔥🔥🔥🔥
Congratulations @stefan027. Very impressive work and thanks for sharing 👏
Great work!!!👏 Congratulations 🎉, thanks for the walkthrough