Winning Solution to the MPEG-G Microbiome Classification Challenge
Meet the winners ยท 21 Nov 2025, 06:24 ยท 4 mins read ยท
5

Microbiome research is rapidly becoming one of the most exciting frontiers in modern healthcare. From chronic disease diagnostics to personalised nutrition and therapeutics, understanding the microbial ecosystems living on and inside the human body has never been more important.

But while sequencing technologies are advancing quickly, making sense of massive metagenomic datasets remains a major challenge.

My name is Julius Mwangi, and this is the story of how I approached the problem, the engineering decisions I made, and the lessons I learned from building an ML system capable of predicting the body-site origin of microbiome samples.

Follow along on GitHub.

In this repo, we showcase how to work with cyclic federated learning, where we do not need a central server like the traditional frameworks ,such as Flower and PySyft

In this project, I set out to build a machine learning model capable of predicting the body-site origin of microbiome samples—specifically gut, oral, nasal, and skin—using 16S rRNA sequencing data encoded in MPEG-G, the emerging ISO standard for genomic data compression.

This end-to-end workflow spanned:

  • Large-scale genomic data handling
  • Feature extraction from FASTQ files
  • Extensive feature engineering
  • XGBoost model tuning and cross-validation
  • SHAP-driven biological interpretation
  • And finally, a Federated Learning extension using a decentralisedCyclic FL scheme

Here’s a detailed breakdown of how it all came together.

🔍 Step 1 — Handling & Preprocessing 200GB of Genomic Data

The raw samples were provided in MPEG-G format, a highly compressed genomic file format optimised for efficient storage and streaming.

To work with these samples, I used the Genie MPEG-G reference docker image to decompress the data into FASTQ format.

  • 20GB of MPEG-G expanded to over 200GB of FASTQ data
  • Decompression took ~1 hour on a local machine
  • Feature extraction consumed ~4 hours for train & ~2 hours for test

From these FASTQ files, I extracted 94 biologically relevant features, covering:

  • Read counts, GC content
  • Sequence complexity
  • Quality score distributions
  • k-mer frequencies
  • Entropy and compositional patterns

These features capture the microbial community structure and sequencing quality necessary for body-site classification.

🧬 Step 2 — Feature Engineering for Biological Signal Enhancement

Raw features rarely capture the nuances of microbial community structure. To improve separability, I engineered additional features derived from:

  • Nucleotide composition
  • k-mer diversity
  • Codon usage patterns
  • Non-linear interactions between sequence metrics

This expanded the feature set to 117 total features, providing a richer input space for downstream machine learning.

🤖 Step 3 — Modeling With XGBoost + Optuna Hyperparameter Tuning

I performed Stratified 10-Fold cross-validation to ensure stable body-site performance and fair class representation.

Using Optuna, I optimised the key XGBoost hyperparameters:

  • learning_rate
  • max_depth
  • subsample
  • colsample_bytree
  • min_child_weight

EmissionsTracker was used throughout training to measure carbon and energy consumption—a step toward sustainable ML.

Results (Centralised Training):

  • Cross-Validation Log Loss: 0.038
  • Public Leaderboard: 0.0037
  • Private Leaderboard: 0.018

Each fold produced class probabilities, which were combined using a geometric mean ensemble for robustness.

🔬 Step 4 — SHAP Analysis: Understanding What the Model Learned

Interpreting ML in biology is critical. SHAP values helped uncover the sequence features most predictive of each microbiome body site:

Key Insights:

  • AT_skew and CT frequencies ranked high in global importance
  • Distinct k-mers dominated predictions per body site (e.g., ATT, CTC, TCT)
  • These patterns reveal biologically meaningful motifs characteristic of microbial habitats

This provided confidence that the model wasn’t just performing well, it was learning real biological signals.

🔄 Step 5 — Cyclic Federated Learning (CFL) Extension

To simulate privacy-preserving, decentralised genomics scenarios, I implemented Cyclic Federated Learning (CFL)—a server-less alternative to FedAvg.

Why CFL?

  • Requires no central server
  • Ideal for distributed hospitals/labs
  • Maintains patient privacy
  • Each “client” updates the global model sequentially

Client Setup

Samples were grouped by metabolic phenotype:

  • Control
  • Crossover
  • Diabetic
  • Prediabetic

Each phenotype acted as a federated client, training the model locally before passing it to the next.

CFL Performance

  • Cross-Validation: 0.0281
  • Public Leaderboard: 0.0018
  • Private Leaderboard: 0.015

CFL matched (and in some cases exceeded) centralised performance - showcasing its potential for privacy-first microbiome ML.

📈 Final Thoughts

This project demonstrates the full lifecycle of machine learning for metagenomics, combining:

  • Large-scale genomic data processing
  • Feature engineering rooted in biological insight
  • Rigorous model validation and interpretability
  • And privacy-preserving Federated Learning

As metagenomic sequencing becomes more routine in healthcare, such pipelines will play a crucial role in diagnostics, personalised medicine, and biological discovery.

About Julius Mwangi

My name is Julius Mwangi, a research-driven data scientist who is always learning. I am deeply passionate about exploring how ML, AI, and statistical methods can be applied responsibly to solve real-world challenges. I may not know everything, but I love the process of discovering, experimenting, and building solutions that make a difference.

Back to top
If you enjoyed this content upvote this article to show your support
Discussion 0 answers