What is your CV? For me, there is a big gap between CV and LB. I can affirm that the metric implementation is wrong. @Zindi
The is the second time I join a Zindi competition using the crossentropy loss and Zindi is using a wrong way to calculate the metric.
Instead of using the CrossEntropy for multiclass Zindi is measuring the (logloss(column1)+logloss(column2)+logloss(column3))/3. This is should be done when you have a multilabel classification but we are facing a multiclassification problem so before calculating any metric : A softmax layer should be applied on the output of the CNN. It means that the sum of predictions should be 1. This is not done when measuring the LB score.
Please check the implementation!
Hi rinnqd.
It is funny, but there is no wrong and correct implementations of metrics. Only the challenge hosts define what metric they will use and how it should be implemented.
Speaking about Softmax - we have multilabel problem, as long as in the rules it is mentioned that one image could have more than one label: " Some images may contain both stem and leaf rust".
Hey AZ,
This is not funny, I am not here to tell you some jokes. Please watch your words!
The metric was implemented in the wrong way in the last year during another competition. I think, the same error is made now.
Sorry mate, I didn't want to offend you. Was just saying that there is no right and wrong here
Nevermind. The problem here is not multilabel... You need to predict only one class. Let me explain more:
Let's suppose we have in the same image stem(dominant) and leaf rust. You need to predict 1 for stem and 0 for leaf rust because we're asked to predict the most prominent class. You will find below the justification.
"The goal is to classify the image according to the type of wheat rust that appears most prominently in the image." You can find this in the evaluation page.
The problem here is multilable. Please read the instructions and description of this competition. Don't spread panic.
Please read my comments above.
Can you please copy-paste here the sentence that mentions that the problem is multilabel?
The evaluation metric for this challenge is Log Loss.
Some images may contain both stem and leaf rust, there is always one type of rust that is more dominant than the other, i.e. you will not find images where both appear equally. The goal is to classify the image according to the type of wheat rust that appears most prominently in the image.
The values can be between 0 and 1, inclusive.
Log loss can be applied in 2 ways:
When we have multilabel classification that means for one image we can have 3 independent labels. Example: [0,1,1], [1,0,1] it can even be [0,0,0] or [1,1,1]. Here the 3 labels, are independent. In this case we calculate (logloss(column1)+logloss(column2)+logloss(column3))/3. Every value can be between [0,1] and the sum of the 3 probabilities is in [0,3].
When we have a multiclass classification that means for one image we only have 1 correct label. Example [1,0,0],[0,1,0] or [0,0,1]. Here the 3 labels are independent it means when the probability of one class is very high the other classes have low probabilities and we will have 1 as a sum of the 3 probabilities.
In our competition:
First, the training labels contain only one class so the models are not learning to predict 2 classes for one image. If one image may contain 2 classes labeling should be done in that way but it is not. Each image in the training sample has only one label. Model will be unable to learn that 2 classes may be present in only one image.
Second, in the part, you copied it here. It is mentioned, "The goal is to classify the image according to the type of wheat rust that appears most prominently in the image.". Please focus on "most prominently", it means you need to predict only one class "the most". The example shown in the table is wrong in this case because the sum should be 1.
I am not new to machine learning or competition, I am not sharing this because my models have failed ;), and I am not here to spread panic. I shared that because the example is wrong and the metric calculation is not implemented in the right way.
Thank you
Personally am new to this but I also thought the Example shown is wrong. I thought if I classify the image as 0.8 leaf_rust, then the sum of probabilities of it being stem_rust or healty should be 0.2. but i can see the example adds up to more than 1.
People tend to interpret model outputs as probabilities, but they aren't necessarily dependant on eachother - depending on your final layer, you may we'll get 'probabilities' of 0.9, 0.4 and 0.2. I tend to cop out and cal them 'model confidences' - higher means the model thinks it's more likely that that's the right class, and we don't think too hard about it :)
For the example I've just given, they'd be scaled to sum to one during scoring. Or you could achieve this yourself with something like a SoftMax layer (which constrains them to sum to 1), or a final scaling step...
Hopefully someone with some more knowledge on this will chime in. There are definitely more rigorous statistical ways of thinking about what those numbers mean, and how we can interpret them.
I agree with what @rinnqd said. As they stated "The values can be between 0 and 1, inclusive". This setting can only be possible in the case of multilabel classification, which is not the case given that our image files are located in 3 separate folders. The multiclass classif imposes that the output of the last layer be a softmax ==> values in a row will add up to 1, which can't be ensured if each value in th erow can itself be between 0 and 1 :
ID leaf_rust stem_rust healthy_wheat GVRCWM 0.63 0.98 0.21 8NRRD6 0.76 0.11 0.56
Regarding what you said, @johnowhitaker about the model's output, in a classification task, the network may have two possible outputs :
In the former case your model's output should add up to 1. In the latter the values can be any kind of real numbers but not the "between 0 and 1" as stated in the evaluation section of the competition.
So I think there is a need to clarify:
It's multiclass. The example shows the submission format - if you submit those numbers the error calculation will scale the values to sum up to 1, i.e. the values are normalized across the columns/classes. Zindi could change the example to show [0.7, 0.2, 0.1] for example, but the key thing is that you're also able to submit [0.63, 0.98, 0.21] and the error calculation will give you an appropriate score as well. What the example is trying to show (I assume) is the format/type - I think we're all reading too much into it :)
But what about images in the train, which are both in stem_rust and leaf_rust folders? (you can use imagehash to find repeated images in different rust classes). What target should be assigned to such images? In my opinion they would have [0, 1, 1] target which is not accepted by nn.crossentropyloss
I think @Johnowhitaker is right.
If you submit all values 0.4 or 0.5 you will get the same score 1.098612...
So it is clear that they divide every row by its sum. After that we get [1/3, 1/3, 1/3]. And the score is exactly -ln(1/3)=1.098612, as (for example, sklearn) multiclass logloss returns.
What about models whose prediction per row sum to one, will they also be divided by 3?
No, if a row has the sum 1, it stays the same. In general, Zindi divides every row by its sum before applying log loss.
We are going to look into this.
Thank you for your quick answer
Hmm, digging into this it's much less clear than I first thought - in most places I can find, cross-entropy and log loss are treated as the same thing, especially in the case of a loss function for classification. See http://wiki.fast.ai/index.php/Log_Loss for some clarification, specifically the final paragraph.
If you want to recreate the scoring of the LB, I've seen a fairly close correlation in my tests between the score given by fastai as valid_loss during training and the leaderboard score. You can calculate the score with `sklearn.metrics.log_loss(reference[classes], predictions[classes])` , which gives the same result as fastai's valid_loss. This is exactly the same as how Zindi calculates it as far as I can tell.
More recently, scores on the leaderboard seem worse than your local scoring since they've removed duplicate images. Might want to do the same for your validation set.
Out of interest, can you explain what you think they should be doing metric-wise? Running all predictions through softmax before calculating log_loss?
Ah - lots of comments since I refreshed the page! This isn't multi-label - the 'class' is one and only one of the three options (the 'most dominant' in cases where both stem and leaf rust are present). Labels are always of the form [0, 0, 1] and not [1, 1, 0] for eg. It's interesting - you could potentially take a multi-label approach by re-labelling according to a first rough model, and then see if there's a way to scale predictions to favour the most likely/prominent fungus...
running logloss( predictions, outputs)
where predictions'shape is (n_samples,3), outputs'shape is (n_samples,3) (in this case sum of each row is 1 after softmax)
What I am thinking is they use np.mean([logloss( predictions[:,i], outputs[:i]) for i in range(3)]) which is wrong
each method will return a different score
I'll check with them, but I'm almost certain they use the former.
Yup, double-checked against their reference file. Score is equal to log_loss(reference[classes], submission[classes]) not np.mean([log_loss(reference[c], submission[c]) for c in classes]).
do you work at Zindi?
I do some some work with them, but not a full-time employee :) All opinions etc are my own etc. Never had an excuse to peep at the backend before, so thanks! :)
I have a question
If I submit [0.6,0.5,0.8], what's the way they use to normalize their sum to 1?
sum is 1.9, so it would be [0.6/1.9, 0.5/1.9, 0.8/1.9]
Hi John. i need clarification a bit. This means we need to redownload the samples since you have said they have removed duplicates?
The duplicates in the test set wont be scored. You can leave them in (your predictions won't count for anything).
As for duplicates in the training data etc, you can get rid of them by removing files of identical size (for example)
+1 @rinqd
@zindi , please comment on above @alka reply