It’s no secret that what goes on inside many machine and deep learning models is a bit of, well, a secret. This can make the finetuning process tedious and sometimes even difficult especially when figuring out what aspects of the model we need to improve.
This week we’ll hear from Brian Muhia (aka poppingtonic) on how he made use of a visualization technique(T-SNE) to gain a better understanding of the data from the Microsoft Rice Disease Classification Challenge. This not only offered insights on various aspects of the image data but also helped in understanding the model he designed and in turn proved useful in improving its performance.
The Microsoft Rice Disease Classification Challenge introduced a dataset comprising RGB and RGNiR (RG-Near-infra-Red) images. This second image type increased the difficulty of the challenge such that all of the winning models worked with RGB only.
In this challenge we applied a res2next50 encoder that was first pre-trained with self-supervised learning through the SwAV algorithm, to represent each RGB and their corresponding RGNIR images with the same weights. The encoder was then fine-tuned and self-distilled to classify the images which produced a public test set score of 0.228678639, and a private score of 0.183386940. K-fold cross-validation was not used for this challenge result.
To better understand the impact of self-supervised pre-training on the problem of classifying each image type, we apply T-distributed Stochastic Neighbour Embedding(T-SNE) on the logits (predictions before applying softmax). We show how this method graphically provides some of the values of a confusion matrix by locating some incorrect predictions. We then render the visualization by overlaying the raw images in each data point, and note that to this model, the RGNIR images do not appear to be inherently more difficult to categorize. We make no comparisons through sweeps, RGB-only models or RGNIR-only models. This is left to future work.
Goal of this Report
This report tries to explain a simple-to-understand method for visualizing the distribution of raw predictions from a vision classifier on a random sample of data in the validation set. We do this to, at a glance;
Combining data from multiple sensors seems to be a good way to increase the number of training set examples, which has a known positive effect on train/test performance, among other measures of generalization. Additional sensors are often deployed to capture different features from the baseline sensors, which may help to resolve their deficiencies.
Less well studied is the question of when the additional sensor(s) add noise or require more representational capacity from the model, whether this reduces its capacity to perform the task on even the baseline sensor data.
Methods & Analysis
This work is an example of post-hoc interpretability , which addresses the black-box nature of our models, where we do not have access to their internal representations, or ignore the structure of the model whose behaviour we are trying to explain. This means that we only use raw predictions and labels (0.0 = blast, 1.0 = brown, 2.0 = healthy) on each data point, ignoring the model’s layer structure, learned features, dimensionality, weights and biases.
This lets us use general methods for clustering data such as T-SNE . To plot a 2D image, we initialize using PCA to reduce dimensionality to 2 components and apply perplexity=50. Note the overlaps i.e the presence of false-positives in each class, indicating the need for k-fold cross-validation.
To show the effect that the image type had on classification, we overlay each datapoint with the raw image it represents. This follows from related work by Karpathy and Iwana et. al which use this methodology to produce informative visualizations with some explanatory value, although in this case the effect is more salient due to the two image types. We see where the RGNIR images tend to cluster in relation to their location in the global cluster regions in the chart above.
Note the density of RGNIR images in the “tip” of the “blast” cluster (blue region in the first plot, scroll up then back), and in the bottom middle, indicating that while some RGNIR images were easy to correctly classify as “blast”, others were more easily confused with “brown” than they were with “healthy”.
Qualitatively, there appear to be more false-positive RGNIR images than not, which might indicate higher uncertainty or noise in the predictions due to conflicting sensor data. This might be an artifact of the data augmentation methods used to train SwAV and the classifier. A lot more region-overlapping in the centroid of the image, together with the presence of both image types indicates some confusion for the classification task.
In conclusion, the separation could be improved by applying readily available methods and there is no a priori reason to expect the pretraining strategy to contribute to better separation of classes. It helps with representing the images more fairly, but not decisively for the classification problem.
All this work can be reproduced with the notebooks available here. The repository also has links to model weights: Rice Disease Classification through Self-Supervised Pre-training