Apparently, it seems simple models trump sophisticated models in this competition. Well, at least from the majority of my submissions from logistic regression models.
Approach Outline
Thanks to @MuhammadQasimShabbeer for decompressing the files and sharing the Kaggle link.
Data Preprocessing
Data cleaning
- Removed incorrect bases (bases with Ns, if present)
- Removed bases with low quality scores (below 20) and reads with sequence lengths less than 50
Feature Engineering
- Created 8-kmer sequences, with no successive skips.
- Aggregated them into counts
- To prevent sample bias due to read length and depth, I normalised counts using the centred-log ratio method (CLR).
- Dimensionality reduction: I tried three methods: Partial-Least Squares, PCA and SVD. For PCA, I selected principal components that explained 95% variance in the data. PLS 25 number of components and SVD about 200 number of components.
CLR = np.log1p(counts) - np.mean(np.log1p(counts), axis=0)
The total number of possible unique kmer sequences to generate is equal to 4**k. So, with k=8, I had about 66k unique kmers.
Modelling
- Logistic Regression (I used the one from Scikit-Learn, and also one implemented in Pytorch ).
- CV scores were between 0.012xx and 0.02. Public LB scores: 0.002xx Private LB: 0.007xx
- Unfortunately, I selected models from a different approach, which I will mention next.
Other Approaches I tried
- Generated 6-kmer counts and applied dimensionality reduction strategies outlined above. But had not so great a performance compared to 8-kmers
- Generated 10-kmer counts but had to abandon because a lot of computational resource was needed. Feature extraction took about 8-10 hours.
Autoencoder
- Autoencoder from CLR normalised 8-kmer counts
- A PyTorch model with two dense layers and a logistic regression model
NB: Was having CV scores similar to the linear models, but they had better performance on the public LB, but much worse than the linear models on the private LB (Selected submissions)
Word embeddings
- Another approach is using a DNA2VEC embedding model, which is a word2vec model developed from 3-8 kmer gene sequences.
- A PyTorch model with two dense layers and a logistic regression model
- CV scores were also similar to other approaches, but had worse Private LB.
Final Thoughts
- In terms of runtime, extracting the kmer counts was much faster than using the word embeddings. The total runtime for feature extraction and generation was 4h 40m 11s on Kaggle, while that of word embeddings took 10h 43m 24s. Also, modelling runtime using simple models was much faster than the deep learning models.
- I was also able to learn a few things, and tried so many ideas and codes from this competition. At least, the concept of federated learning, where I tried to implement the idea from scratch with the help of Copilot (Not sure if it'll be the same performance if implemented in the Flower library, but at least the code is working, although the performance is not great).
- Thirdly, some other normalisation methods were tried: Normalising kmer counts by total kmer counts in the sample and also normalising using the Term Frequency Inverse Document Frequency (TF-IDF) to prevent sample and kmer sequence bias.
- Using all kmer counts as features in the logistic regression models, performed better also, than the sophisticated models.
Amazing!! What was your score with federatted approach?
If the code I wrote was done correctly, then it really isn't performing well. I am having a very high loss (1.xx), possibly due to the labels each client model has. I used 4 clients representing the 4 individual classes.
I see
What is yours 😂? Mine is ~0.02.
@Gozie, Honestly you did super well. I thought most guys with insanely low scores were post-processing. But I heard top 20 will receive a message if I'm not mistaken. Submit your code for review. You might do great big man😊.
@CodeJoe Thanks man. Well, not sure about the top 20 receiving a message. I didn't receive any email concerning it.
Oh interesting, my bad! When I read the rules, it made mention of the top 20 participants for review.
Genius 💯, we didn't consider data cleaning, we where also considering doing DNA2VEC but we weren't not able to do so ,can you please share the word Embeddings notebook if possible
Hi,
This is the link to the DNA2VEC feature extraction notebook. Here
I must say, it took very long hours to complete. It took more than 12 hours for both the train and test sets.
🙏🙏
Thanks for sharing.i didn't understand
what do you mean by N's ? , you mean NaNs ?
Hi,
Not NaNs but during DNA sequencing AGCT bases are called as they pass through the sequencer and if the sequencer is sure it assigns a quality score to the base. But in cases where it isn't sure of the base that was called, it assigns an N. These "incorrect" bases aren't valid as DNA has only four base types. So to prevent these incorrect bases during kmer generation, I removed them, by not including them into the counts.
ok!!, thanks a lot.