-This was a very interesting and exciting challenge - huge thanks to @Zindi and the hosts led by @Nevenka for presenting such a research rich area. It was my first time diving into a biology-related machine learning application.
-I also encountered MPEG for the first time of course, I’d heard of MP4, but not MPEG in the context of genomic data. It was fascinating to see how powerful the MPEG-G format is for data compression and transfer - over 200gb fastq data compressed to only abot 20gb MPEG-G files!
-Another first for me in this competition was exploring federated learning which opened up a new world of possibilities for privacy-preserving machine learning. This is exactly what makes Zindi such a great platform for emerging African talent in ML and AI - you get to explore cutting-edge ideas in real-world applications.
From my submissions and experiments:
1. The public leaderboard felt a bit unstable - my local cross-validation performance averaged around a logloss of 0.03, while the public leaderboard often showed much smaller (and inconsistent) scores -sometimes in the 0.0000x range!
2. Neural networks performed better than tree models on the private leaderboard reveal compared to best tree model (Xgboost)- 0.015 vs 0.018.
3. I tested DNA-BERT, an LLM-based approach for genomic sequences; however, its predictive performance was notably poor Public-0.37 and Private-0.28, even though feature extraction was notably multiple times faster than tree model feature extraction which took me a whooping 6hrs to generate only about 94 features!
5. Federated learning using XGBoost shows promise, outperforming centralized models on the private leaderboard (0.0152 vs 0.0018 on public).
👉 Check out my solution on this repo: https://github.com/JuliusFx131/cyclic-federated-learning - where I demonstrate cyclic federated learning, a setup that doesn’t require a central server like traditional frameworks such as Flower or PySyft. You can run the notebooks directly on kaggle now-thanks to @MuhammadQasimShabbeer who took the pain of uploading over 200gb sized fastq files on kaggle!
6. K-mer frequencies (di- and tri-nucleotides) emerged as the most important features - aligning with the idea that genomic composition biases can act as biomarkers for microbial community origins.
Overall, it was a challenging but deeply rewarding experience to all participants. Cheers to @Amy_Bray and the whole @Zindi team for patience to run very heavy feature extraction pipelines over many hours.
Congratulations.
Thanks Ran. Your team score is so good too bruv, what did you do? Keep up
Same here! It was absolutely amazing! 🎉
Huge thanks to @Zindi, @Nevenka, and all the hosts for organizing this incredible competition.
Here’s how our journey went 👇
We prioritized speed above all else. ⚡ @Knowledge_Seeker101 was truly the brain behind this challenge (the guy literally knows everything 😅).
We started by asking ourselves:
Well, there was — and we managed to pull it off! 💪 Because we were so focused on speed, we selected only the top 100 ranks (readings that appeared most frequently). That made our pipeline blazingly fast for preprocessing— about 3 hours 16 minutes on my local PC.
Next, we explored Shannon entropy and canonical k-mers, with a lot of help (again) from @Knowledge_Seeker101 — absolute genius! 🧠
Then came the modeling phase. Based on what @Juliuss mentioned, deep learning models seemed to outperform tree-based ones on the private board. But since the public board had us thinking otherwise, we decided to blend both worlds.
That’s how the EverClassifier was born — an XGBoost-distilled MLP model. We trained the MLP using XGBoost’s probability outputs, and it didn’t just outperform XGBoost on the private board — it gave the raw MLP a serious challenge on the private board too. 🔥
Finally, for the federated learning setup, @Knowledge_Seeker101 pulled another magic trick — using Flower with the EverClassifier in a clever way across all clients. Result? Great score, happy team, and big smiles all around 😄
Now here’s the funny part — I had zero background in bioinformatics. Like, nothing. But with a bit of determination, solid prompting of LLMs (very, very important 😅), and a good teammate (obviously😅)— we pulled through.
Honestly, I still remember those days I spent prompting ChatGPT for hours straight, trying to understand what on earth “k-mers” even were 😂. Those late-night rabbit holes paid off in the end.
To everyone still exploring — keep pushing, keep prompting, and keep learning. The journey is wild, but it’s worth it. 🚀
I think I have to maybe pursue bioinformatics later on, who knows🤷♂️?
Good stuff! Relatable to mine experience.... I like how your initial questions prompted you to come up with the data handling approach. congrats to you and team mate.
Thanks, big man! Your federated learning approach was absolutely top-tier — super clean implementation and rock-solid results! 💪🔥
Congratulations @CodeJoe and your teammate. You're very correct, the prompting of LLMs is very important. I had lots of chats with it. In fact, it was with its help that I developed the federated learning script in "our" own way.
Exactly Brother😅, very necessary😂
Congratulations @Juliuss
Thanks. You hacked a great score too men. Good job