I participated in three (3) tracks: 1, 2 and 5. Track 2 depended mostly on the other two tracks and the microbiome classification challenge.
Data Preparation
- Removed invalid bases (bases with Ns, if present)
- Removed bases with low quality scores (below 20) and reads with sequence lengths less than 50
- Created 8-kmer sequences, with no successive skips.
- Aggregated them into counts.
- Selecting samples with matching cytokine levels.
Feature selection
- Variance thresholding was applied in track 1 to filter out low-variance kmer sequences, reducing this by about 80%, from 65,536 unique kmers
- Elastic linear regression was used to select non-zero coefficients, reducing it to 1,492 kmers
- In track 5, the Spearman's correlation coefficient was applied to select kmers that associate with cytokines. Statistically significant kmers with alpha < 0.05 and effect size >= 0.25 were selected for track 2. This further trimmed kmers to 262.
- The Centred Log-Ratio normalisation method was used to normalise the 262 kmer counts.
CLR = log(1 + counts) − mean(log(1 + counts))
Microbe-Cytokine Association
- A penalised (ridge) linear model was used for this. A linear model was used due to its interpretability (coefficients), and the ridge penalised method's ability to handle multicollinearity among kmers.
- To generate confidence scores and a rank, a bootstrap method was used to create resamples of the original data. This was done 1,000 times to develop a reliable estimate of their association.
- To quantify the reliability of the kmer-cytokine association, I computed stability metrics at each bootstrap to identify kmers or microbes whose effects are not due to sampling error or a spurious relationship. The metrics were sign stability, coefficient of variation (CV) stability, topN and scaled absolute mean coefficient.
- The sign and CV stabilities were used to compute a composite score that ranks their association.
- The sign stability measures the direction consistency (negative or positive) of their association. The higher the score, the better the reliability.
- The CV stability measures the variability of their association, normalised by the mean coefficient value in each bootstrap. So, the smaller the variability, the better the association. That is, the effect of a kmer on a cytokine does not change so much.
More information (code scripts and results) can be found here
All thanks to the Zindi team and to the organisers of this challenge. It was nice to have participated.
Wow Congrats big man! So much with just linear models. I have to take my math seriously😅
Thanks man. Yea, I had to go with the simplest method to balance interpretation and run time, and linear models felt very appropriate.
Honestly, I actually find that amazing. Indeed the methodology (features and analysis) matters more than the model. I am definitely starring the repo⭐!
Thanks