🩺 AI in Focus: Track 2 Solution

MPEG-G: Decoding the Dialogue

$5 000 USD

Completed (9 months ago)

Skills you will learn

Visualisation

Insights

Prediction

501 joined

26 active

Info Data Chat Leaderboard

Start

Jun 27, 25

Nov 02, 25

Reveal

Dec 02, 25

Gozie

Freelance

Track 2 Solution

Notebooks · 14 Jan 2026, 13:27 · 4

I participated in three (3) tracks: 1, 2 and 5. Track 2 depended mostly on the other two tracks and the microbiome classification challenge.

Data Preparation

Removed invalid bases (bases with Ns, if present)
Removed bases with low quality scores (below 20) and reads with sequence lengths less than 50
Created 8-kmer sequences, with no successive skips.
Aggregated them into counts.
Selecting samples with matching cytokine levels.

Feature selection

Variance thresholding was applied in track 1 to filter out low-variance kmer sequences, reducing this by about 80%, from 65,536 unique kmers
Elastic linear regression was used to select non-zero coefficients, reducing it to 1,492 kmers
In track 5, the Spearman's correlation coefficient was applied to select kmers that associate with cytokines. Statistically significant kmers with alpha < 0.05 and effect size >= 0.25 were selected for track 2. This further trimmed kmers to 262.
The Centred Log-Ratio normalisation method was used to normalise the 262 kmer counts.

CLR = log(1 + counts) − mean(log(1 + counts))

Microbe-Cytokine Association

A penalised (ridge) linear model was used for this. A linear model was used due to its interpretability (coefficients), and the ridge penalised method's ability to handle multicollinearity among kmers.
To generate confidence scores and a rank, a bootstrap method was used to create resamples of the original data. This was done 1,000 times to develop a reliable estimate of their association.
To quantify the reliability of the kmer-cytokine association, I computed stability metrics at each bootstrap to identify kmers or microbes whose effects are not due to sampling error or a spurious relationship. The metrics were sign stability, coefficient of variation (CV) stability, topN and scaled absolute mean coefficient.
The sign and CV stabilities were used to compute a composite score that ranks their association.
The sign stability measures the direction consistency (negative or positive) of their association. The higher the score, the better the reliability.
The CV stability measures the variability of their association, normalised by the mean coefficient value in each bootstrap. So, the smaller the variability, the better the association. That is, the effect of a kmer on a cytokine does not change so much.

More information (code scripts and results) can be found here

All thanks to the Zindi team and to the organisers of this challenge. It was nice to have participated.

Discussion 4 answers

CodeJoe

Wow Congrats big man! So much with just linear models. I have to take my math seriously😅

14 Jan 2026, 13:33

Upvotes 1

Gozie

Freelance

Thanks man. Yea, I had to go with the simplest method to balance interpretation and run time, and linear models felt very appropriate.

replied to CodeJoe14 Jan 2026, 13:38

Upvotes 1

CodeJoe

Honestly, I actually find that amazing. Indeed the methodology (features and analysis) matters more than the model. I am definitely starring the repo⭐!

replied to Gozie14 Jan 2026, 13:39

Upvotes 1

Gozie

Freelance

Thanks

replied to CodeJoe14 Jan 2026, 13:41

Upvotes 1

Join the largest network for
data scientists and AI builders

About FAQs

Status