Microbiome research is rapidly becoming one of the most exciting frontiers in modern healthcare. From chronic disease diagnostics to personalised nutrition and therapeutics, understanding the microbial ecosystems living on and inside the human body has never been more important.
But while sequencing technologies are advancing quickly, making sense of massive metagenomic datasets remains a major challenge.
My name is Julius Mwangi, and this is the story of how I approached the problem, the engineering decisions I made, and the lessons I learned from building an ML system capable of predicting the body-site origin of microbiome samples.
In this repo, we showcase how to work with cyclic federated learning, where we do not need a central server like the traditional frameworks ,such as Flower and PySyft
In this project, I set out to build a machine learning model capable of predicting the body-site origin of microbiome samples—specifically gut, oral, nasal, and skin—using 16S rRNA sequencing data encoded in MPEG-G, the emerging ISO standard for genomic data compression.
This end-to-end workflow spanned:
Here’s a detailed breakdown of how it all came together.
The raw samples were provided in MPEG-G format, a highly compressed genomic file format optimised for efficient storage and streaming.
To work with these samples, I used the Genie MPEG-G reference docker image to decompress the data into FASTQ format.
From these FASTQ files, I extracted 94 biologically relevant features, covering:
These features capture the microbial community structure and sequencing quality necessary for body-site classification.
Raw features rarely capture the nuances of microbial community structure. To improve separability, I engineered additional features derived from:
This expanded the feature set to 117 total features, providing a richer input space for downstream machine learning.
I performed Stratified 10-Fold cross-validation to ensure stable body-site performance and fair class representation.
Using Optuna, I optimised the key XGBoost hyperparameters:
EmissionsTracker was used throughout training to measure carbon and energy consumption—a step toward sustainable ML.
Results (Centralised Training):
Each fold produced class probabilities, which were combined using a geometric mean ensemble for robustness.
Interpreting ML in biology is critical. SHAP values helped uncover the sequence features most predictive of each microbiome body site:
Key Insights:
This provided confidence that the model wasn’t just performing well, it was learning real biological signals.
To simulate privacy-preserving, decentralised genomics scenarios, I implemented Cyclic Federated Learning (CFL)—a server-less alternative to FedAvg.
Why CFL?
Client Setup
Samples were grouped by metabolic phenotype:
Each phenotype acted as a federated client, training the model locally before passing it to the next.
CFL Performance
CFL matched (and in some cases exceeded) centralised performance - showcasing its potential for privacy-first microbiome ML.
This project demonstrates the full lifecycle of machine learning for metagenomics, combining:
As metagenomic sequencing becomes more routine in healthcare, such pipelines will play a crucial role in diagnostics, personalised medicine, and biological discovery.
My name is Julius Mwangi, a research-driven data scientist who is always learning. I am deeply passionate about exploring how ML, AI, and statistical methods can be applied responsibly to solve real-world challenges. I may not know everything, but I love the process of discovering, experimenting, and building solutions that make a difference.