Join the Buzz: Approach (Unofficial 2nd place...

MPEG-G Microbiome Classification Challenge

$5 000 USD

Completed (10 months ago)

Skills you will learn

Classification

Federated Learning

Python

Deep Learning

797 joined

83 active

Info Data Chat Leaderboard

Start

Jun 20, 25

Sep 15, 25

Reveal

Sep 15, 25

Gozie

Freelance

Approach (Unofficial 2nd place)

Platform · 15 Sep 2025, 09:56 · 13

Apparently, it seems simple models trump sophisticated models in this competition. Well, at least from the majority of my submissions from logistic regression models.

Approach Outline

Thanks to @MuhammadQasimShabbeer for decompressing the files and sharing the Kaggle link.

Data Preprocessing

Data cleaning

Removed incorrect bases (bases with Ns, if present)
Removed bases with low quality scores (below 20) and reads with sequence lengths less than 50

Feature Engineering

Created 8-kmer sequences, with no successive skips.
Aggregated them into counts
To prevent sample bias due to read length and depth, I normalised counts using the centred-log ratio method (CLR).
Dimensionality reduction: I tried three methods: Partial-Least Squares, PCA and SVD. For PCA, I selected principal components that explained 95% variance in the data. PLS 25 number of components and SVD about 200 number of components.

CLR = np.log1p(counts) - np.mean(np.log1p(counts), axis=0)

The total number of possible unique kmer sequences to generate is equal to 4**k. So, with k=8, I had about 66k unique kmers.

Modelling

Logistic Regression (I used the one from Scikit-Learn, and also one implemented in Pytorch ).
CV scores were between 0.012xx and 0.02. Public LB scores: 0.002xx Private LB: 0.007xx
Unfortunately, I selected models from a different approach, which I will mention next.

Other Approaches I tried

Generated 6-kmer counts and applied dimensionality reduction strategies outlined above. But had not so great a performance compared to 8-kmers
Generated 10-kmer counts but had to abandon because a lot of computational resource was needed. Feature extraction took about 8-10 hours.

Autoencoder

Autoencoder from CLR normalised 8-kmer counts
A PyTorch model with two dense layers and a logistic regression model

NB: Was having CV scores similar to the linear models, but they had better performance on the public LB, but much worse than the linear models on the private LB (Selected submissions)

Word embeddings

Another approach is using a DNA2VEC embedding model, which is a word2vec model developed from 3-8 kmer gene sequences.
A PyTorch model with two dense layers and a logistic regression model
CV scores were also similar to other approaches, but had worse Private LB.

Final Thoughts

In terms of runtime, extracting the kmer counts was much faster than using the word embeddings. The total runtime for feature extraction and generation was 4h 40m 11s on Kaggle, while that of word embeddings took 10h 43m 24s. Also, modelling runtime using simple models was much faster than the deep learning models.
I was also able to learn a few things, and tried so many ideas and codes from this competition. At least, the concept of federated learning, where I tried to implement the idea from scratch with the help of Copilot (Not sure if it'll be the same performance if implemented in the Flower library, but at least the code is working, although the performance is not great).
Thirdly, some other normalisation methods were tried: Normalising kmer counts by total kmer counts in the sample and also normalising using the Term Frequency Inverse Document Frequency (TF-IDF) to prevent sample and kmer sequence bias.
Using all kmer counts as features in the logistic regression models, performed better also, than the sophisticated models.

Discussion 13 answers

Koleshjr

Multimedia university of kenya

Amazing!! What was your score with federatted approach?

15 Sep 2025, 10:14

Upvotes 0

Gozie

Freelance

If the code I wrote was done correctly, then it really isn't performing well. I am having a very high loss (1.xx), possibly due to the labels each client model has. I used 4 clients representing the 4 individual classes.

replied to Koleshjr15 Sep 2025, 12:44

Upvotes 0

Koleshjr

Multimedia university of kenya

I see

replied to Gozie15 Sep 2025, 13:25

Upvotes 0

CodeJoe

What is yours 😂? Mine is ~0.02.

replied to Koleshjr15 Sep 2025, 16:16

Upvotes 0

CodeJoe

@Gozie, Honestly you did super well. I thought most guys with insanely low scores were post-processing. But I heard top 20 will receive a message if I'm not mistaken. Submit your code for review. You might do great big man😊.

15 Sep 2025, 16:14

Upvotes 0

Gozie

Freelance

@CodeJoe Thanks man. Well, not sure about the top 20 receiving a message. I didn't receive any email concerning it.

replied to CodeJoe15 Sep 2025, 16:43

Upvotes 1

CodeJoe

Oh interesting, my bad! When I read the rules, it made mention of the top 20 participants for review.

replied to Gozie15 Sep 2025, 17:39

Upvotes 0

Knowledge_Seeker101

Freelance

Genius 💯, we didn't consider data cleaning, we where also considering doing DNA2VEC but we weren't not able to do so ,can you please share the word Embeddings notebook if possible

15 Sep 2025, 21:13

Upvotes 1

Gozie

Freelance

Hi,

This is the link to the DNA2VEC feature extraction notebook. Here

I must say, it took very long hours to complete. It took more than 12 hours for both the train and test sets.

replied to Knowledge_Seeker10115 Sep 2025, 21:55

Upvotes 1

CodeJoe

🙏🙏

replied to Gozie16 Sep 2025, 17:36

Upvotes 0

Ibrahim_alhafiz

Thanks for sharing.i didn't understand

Removed incorrect bases (bases with Ns, if present)

what do you mean by N's ? , you mean NaNs ?

18 Sep 2025, 13:12

Upvotes 0

Gozie

Freelance

Hi,

Not NaNs but during DNA sequencing AGCT bases are called as they pass through the sequencer and if the sequencer is sure it assigns a quality score to the base. But in cases where it isn't sure of the base that was called, it assigns an N. These "incorrect" bases aren't valid as DNA has only four base types. So to prevent these incorrect bases during kmer generation, I removed them, by not including them into the counts.

replied to Ibrahim_alhafiz18 Sep 2025, 13:16

Upvotes 0

Ibrahim_alhafiz

ok!!, thanks a lot.

replied to Gozie19 Sep 2025, 05:31

Upvotes 1

Join the largest network for
data scientists and AI builders

About FAQs

Status