Primary competition visual

MPEG-G Microbiome Classification Challenge

$5 000 USD
Completed (6 months ago)
Classification
Federated Learning
Python
Deep Learning
794 joined
83 active
Starti
Jun 20, 25
Closei
Sep 15, 25
Reveali
Sep 15, 25
User avatar
Freelance
Approach (Unofficial 2nd place)
Platform · 15 Sep 2025, 09:56 · 13

Apparently, it seems simple models trump sophisticated models in this competition. Well, at least from the majority of my submissions from logistic regression models.

Approach Outline

Thanks to @MuhammadQasimShabbeer for decompressing the files and sharing the Kaggle link.

Data Preprocessing

Data cleaning

  • Removed incorrect bases (bases with Ns, if present)
  • Removed bases with low quality scores (below 20) and reads with sequence lengths less than 50

Feature Engineering

  1. Created 8-kmer sequences, with no successive skips.
  2. Aggregated them into counts
  3. To prevent sample bias due to read length and depth, I normalised counts using the centred-log ratio method (CLR).
  4. Dimensionality reduction: I tried three methods: Partial-Least Squares, PCA and SVD. For PCA, I selected principal components that explained 95% variance in the data. PLS 25 number of components and SVD about 200 number of components.
CLR = np.log1p(counts) - np.mean(np.log1p(counts), axis=0)

The total number of possible unique kmer sequences to generate is equal to 4**k. So, with k=8, I had about 66k unique kmers.

Modelling

  • Logistic Regression (I used the one from Scikit-Learn, and also one implemented in Pytorch ).
  • CV scores were between 0.012xx and 0.02. Public LB scores: 0.002xx Private LB: 0.007xx
  • Unfortunately, I selected models from a different approach, which I will mention next.

Other Approaches I tried

  • Generated 6-kmer counts and applied dimensionality reduction strategies outlined above. But had not so great a performance compared to 8-kmers
  • Generated 10-kmer counts but had to abandon because a lot of computational resource was needed. Feature extraction took about 8-10 hours.

Autoencoder

  1. Autoencoder from CLR normalised 8-kmer counts
  2. A PyTorch model with two dense layers and a logistic regression model

NB: Was having CV scores similar to the linear models, but they had better performance on the public LB, but much worse than the linear models on the private LB (Selected submissions)

Word embeddings

  1. Another approach is using a DNA2VEC embedding model, which is a word2vec model developed from 3-8 kmer gene sequences.
  2. A PyTorch model with two dense layers and a logistic regression model
  3. CV scores were also similar to other approaches, but had worse Private LB.

Final Thoughts

  • In terms of runtime, extracting the kmer counts was much faster than using the word embeddings. The total runtime for feature extraction and generation was 4h 40m 11s on Kaggle, while that of word embeddings took 10h 43m 24s. Also, modelling runtime using simple models was much faster than the deep learning models.
  • I was also able to learn a few things, and tried so many ideas and codes from this competition. At least, the concept of federated learning, where I tried to implement the idea from scratch with the help of Copilot (Not sure if it'll be the same performance if implemented in the Flower library, but at least the code is working, although the performance is not great).
  • Thirdly, some other normalisation methods were tried: Normalising kmer counts by total kmer counts in the sample and also normalising using the Term Frequency Inverse Document Frequency (TF-IDF) to prevent sample and kmer sequence bias.
  • Using all kmer counts as features in the logistic regression models, performed better also, than the sophisticated models.

Discussion 13 answers
User avatar
Koleshjr
Multimedia university of kenya

Amazing!! What was your score with federatted approach?

15 Sep 2025, 10:14
Upvotes 0
User avatar
Freelance

If the code I wrote was done correctly, then it really isn't performing well. I am having a very high loss (1.xx), possibly due to the labels each client model has. I used 4 clients representing the 4 individual classes.

User avatar
Koleshjr
Multimedia university of kenya

I see

User avatar
CodeJoe

What is yours 😂? Mine is ~0.02.

User avatar
CodeJoe

@Gozie, Honestly you did super well. I thought most guys with insanely low scores were post-processing. But I heard top 20 will receive a message if I'm not mistaken. Submit your code for review. You might do great big man😊.

15 Sep 2025, 16:14
Upvotes 0
User avatar
Freelance

@CodeJoe Thanks man. Well, not sure about the top 20 receiving a message. I didn't receive any email concerning it.

User avatar
CodeJoe

Oh interesting, my bad! When I read the rules, it made mention of the top 20 participants for review.

User avatar
Knowledge_Seeker101
Freelance

Genius 💯, we didn't consider data cleaning, we where also considering doing DNA2VEC but we weren't not able to do so ,can you please share the word Embeddings notebook if possible

15 Sep 2025, 21:13
Upvotes 1
User avatar
Freelance

Hi,

This is the link to the DNA2VEC feature extraction notebook. Here

I must say, it took very long hours to complete. It took more than 12 hours for both the train and test sets.

User avatar
CodeJoe

🙏🙏

Thanks for sharing.i didn't understand

  • Removed incorrect bases (bases with Ns, if present)

what do you mean by N's ? , you mean NaNs ?

18 Sep 2025, 13:12
Upvotes 0
User avatar
Freelance

Hi,

Not NaNs but during DNA sequencing AGCT bases are called as they pass through the sequencer and if the sequencer is sure it assigns a quality score to the base. But in cases where it isn't sure of the base that was called, it assigns an N. These "incorrect" bases aren't valid as DNA has only four base types. So to prevent these incorrect bases during kmer generation, I removed them, by not including them into the counts.

ok!!, thanks a lot.