🌍 Winning Solution to the MPEG-G Microbiome Classification Challenge

Winning Solution to the MPEG-G Microbiome Classification Challenge

Meet the winners · 21 Nov 2025, 06:24 · 4 mins read ·

Microbiome research is rapidly becoming one of the most exciting frontiers in modern healthcare. From chronic disease diagnostics to personalised nutrition and therapeutics, understanding the microbial ecosystems living on and inside the human body has never been more important.

But while sequencing technologies are advancing quickly, making sense of massive metagenomic datasets remains a major challenge.

My name is Julius Mwangi, and this is the story of how I approached the problem, the engineering decisions I made, and the lessons I learned from building an ML system capable of predicting the body-site origin of microbiome samples.

Follow along on GitHub.

In this repo, we showcase how to work with cyclic federated learning, where we do not need a central server like the traditional frameworks ,such as Flower and PySyft

In this project, I set out to build a machine learning model capable of predicting the body-site origin of microbiome samples—specifically gut, oral, nasal, and skin—using 16S rRNA sequencing data encoded in MPEG-G, the emerging ISO standard for genomic data compression.

This end-to-end workflow spanned:

Large-scale genomic data handling
Feature extraction from FASTQ files
Extensive feature engineering
XGBoost model tuning and cross-validation
SHAP-driven biological interpretation
And finally, a Federated Learning extension using a decentralisedCyclic FL scheme

Here’s a detailed breakdown of how it all came together.

🔍 Step 1 — Handling & Preprocessing 200GB of Genomic Data

The raw samples were provided in MPEG-G format, a highly compressed genomic file format optimised for efficient storage and streaming.

To work with these samples, I used the Genie MPEG-G reference docker image to decompress the data into FASTQ format.

20GB of MPEG-G expanded to over 200GB of FASTQ data
Decompression took ~1 hour on a local machine
Feature extraction consumed ~4 hours for train & ~2 hours for test

From these FASTQ files, I extracted 94 biologically relevant features, covering:

Read counts, GC content
Sequence complexity
Quality score distributions
k-mer frequencies
Entropy and compositional patterns

These features capture the microbial community structure and sequencing quality necessary for body-site classification.

🧬 Step 2 — Feature Engineering for Biological Signal Enhancement

Raw features rarely capture the nuances of microbial community structure. To improve separability, I engineered additional features derived from:

Nucleotide composition
k-mer diversity
Codon usage patterns
Non-linear interactions between sequence metrics

This expanded the feature set to 117 total features, providing a richer input space for downstream machine learning.

🤖 Step 3 — Modeling With XGBoost + Optuna Hyperparameter Tuning

I performed Stratified 10-Fold cross-validation to ensure stable body-site performance and fair class representation.

Using Optuna, I optimised the key XGBoost hyperparameters:

learning_rate
max_depth
subsample
colsample_bytree
min_child_weight

EmissionsTracker was used throughout training to measure carbon and energy consumption—a step toward sustainable ML.

Results (Centralised Training):

Cross-Validation Log Loss: 0.038
Public Leaderboard: 0.0037
Private Leaderboard: 0.018

Each fold produced class probabilities, which were combined using a geometric mean ensemble for robustness.

🔬 Step 4 — SHAP Analysis: Understanding What the Model Learned

Interpreting ML in biology is critical. SHAP values helped uncover the sequence features most predictive of each microbiome body site:

Key Insights:

AT_skew and CT frequencies ranked high in global importance
Distinct k-mers dominated predictions per body site (e.g., ATT, CTC, TCT)
These patterns reveal biologically meaningful motifs characteristic of microbial habitats

This provided confidence that the model wasn’t just performing well, it was learning real biological signals.

🔄 Step 5 — Cyclic Federated Learning (CFL) Extension

To simulate privacy-preserving, decentralised genomics scenarios, I implemented Cyclic Federated Learning (CFL)—a server-less alternative to FedAvg.

Why CFL?

Requires no central server
Ideal for distributed hospitals/labs
Maintains patient privacy
Each “client” updates the global model sequentially

Client Setup

Samples were grouped by metabolic phenotype:

Control
Crossover
Diabetic
Prediabetic

Each phenotype acted as a federated client, training the model locally before passing it to the next.

CFL Performance

Cross-Validation: 0.0281
Public Leaderboard: 0.0018
Private Leaderboard: 0.015

CFL matched (and in some cases exceeded) centralised performance - showcasing its potential for privacy-first microbiome ML.

📈 Final Thoughts

This project demonstrates the full lifecycle of machine learning for metagenomics, combining:

Large-scale genomic data processing
Feature engineering rooted in biological insight
Rigorous model validation and interpretability
And privacy-preserving Federated Learning

As metagenomic sequencing becomes more routine in healthcare, such pipelines will play a crucial role in diagnostics, personalised medicine, and biological discovery.

About Julius Mwangi

My name is Julius Mwangi, a research-driven data scientist who is always learning. I am deeply passionate about exploring how ML, AI, and statistical methods can be applied responsibly to solve real-world challenges. I may not know everything, but I love the process of discovering, experimenting, and building solutions that make a difference.

If you enjoyed this content upvote this article to show your support

Discussion 0 answers

Join the largest network for
data scientists and AI builders

About FAQs

Status