🌱 Hot Topic: Metric might be wrong

ICLR Workshop Challenge #1: CGIAR Computer Vision for Crop Disease

Helping East Africa

$5 000 USD

Completed (over 6 years ago)

Skills you will learn

Classification

Computer Vision

1058 joined

338 active

Info Data Chat Leaderboard

Start

Jan 29, 20

Mar 28, 20

Reveal

Mar 29, 20

ragnarok

Metric might be wrong

Help · 18 Feb 2020, 08:03 · edited 1 minute later · 31

What is your CV? For me, there is a big gap between CV and LB. I can affirm that the metric implementation is wrong. @Zindi

The is the second time I join a Zindi competition using the crossentropy loss and Zindi is using a wrong way to calculate the metric.

Instead of using the CrossEntropy for multiclass Zindi is measuring the (logloss(column1)+logloss(column2)+logloss(column3))/3. This is should be done when you have a multilabel classification but we are facing a multiclassification problem so before calculating any metric : A softmax layer should be applied on the output of the CNN. It means that the sum of predictions should be 1. This is not done when measuring the LB score.

Please check the implementation!

Discussion 31 answers

Hi rinnqd.

It is funny, but there is no wrong and correct implementations of metrics. Only the challenge hosts define what metric they will use and how it should be implemented.

Speaking about Softmax - we have multilabel problem, as long as in the rules it is mentioned that one image could have more than one label: " Some images may contain both stem and leaf rust".

18 Feb 2020, 08:15

Upvotes 0

ragnarok

Hey AZ,

This is not funny, I am not here to tell you some jokes. Please watch your words!

The metric was implemented in the wrong way in the last year during another competition. I think, the same error is made now.

replied to AZ18 Feb 2020, 08:54

Upvotes 0

Sorry mate, I didn't want to offend you. Was just saying that there is no right and wrong here

replied to ragnarok18 Feb 2020, 09:03

Upvotes 0

ragnarok

Nevermind. The problem here is not multilabel... You need to predict only one class. Let me explain more:

Let's suppose we have in the same image stem(dominant) and leaf rust. You need to predict 1 for stem and 0 for leaf rust because we're asked to predict the most prominent class. You will find below the justification.

"The goal is to classify the image according to the type of wheat rust that appears most prominently in the image." You can find this in the evaluation page.

replied to AZ18 Feb 2020, 09:09

Upvotes 0

picekl

The problem here is multilable. Please read the instructions and description of this competition. Don't spread panic.

replied to ragnarok18 Feb 2020, 10:03

Upvotes 0

ragnarok

Please read my comments above.

Can you please copy-paste here the sentence that mentions that the problem is multilabel?

replied to picekl18 Feb 2020, 10:14

Upvotes 0

picekl

The evaluation metric for this challenge is Log Loss.

Some images may contain both stem and leaf rust, there is always one type of rust that is more dominant than the other, i.e. you will not find images where both appear equally. The goal is to classify the image according to the type of wheat rust that appears most prominently in the image.

The values can be between 0 and 1, inclusive.

ID       leaf_rust   stem_rust   healthy_wheat   
GVRCWM      0.63       0.98          0.21      
8NRRD6      0.76       0.11          0.56

replied to ragnarok18 Feb 2020, 10:38

Upvotes 0

ragnarok

Log loss can be applied in 2 ways:

When we have multilabel classification that means for one image we can have 3 independent labels. Example: [0,1,1], [1,0,1] it can even be [0,0,0] or [1,1,1]. Here the 3 labels, are independent. In this case we calculate (logloss(column1)+logloss(column2)+logloss(column3))/3. Every value can be between [0,1] and the sum of the 3 probabilities is in [0,3].

When we have a multiclass classification that means for one image we only have 1 correct label. Example [1,0,0],[0,1,0] or [0,0,1]. Here the 3 labels are independent it means when the probability of one class is very high the other classes have low probabilities and we will have 1 as a sum of the 3 probabilities.

In our competition:

First, the training labels contain only one class so the models are not learning to predict 2 classes for one image. If one image may contain 2 classes labeling should be done in that way but it is not. Each image in the training sample has only one label. Model will be unable to learn that 2 classes may be present in only one image.

Second, in the part, you copied it here. It is mentioned, "The goal is to classify the image according to the type of wheat rust that appears most prominently in the image.". Please focus on "most prominently", it means you need to predict only one class "the most". The example shown in the table is wrong in this case because the sum should be 1.

I am not new to machine learning or competition, I am not sharing this because my models have failed ;), and I am not here to spread panic. I shared that because the example is wrong and the metric calculation is not implemented in the right way.

Thank you

replied to picekl18 Feb 2020, 10:53 (edited 13 minutes later)

Upvotes 0

Estaar

Personally am new to this but I also thought the Example shown is wrong. I thought if I classify the image as 0.8 leaf_rust, then the sum of probabilities of it being stem_rust or healty should be 0.2. but i can see the example adds up to more than 1.

replied to ragnarok19 Feb 2020, 11:41

Upvotes 0

Johnowhitaker

People tend to interpret model outputs as probabilities, but they aren't necessarily dependant on eachother - depending on your final layer, you may we'll get 'probabilities' of 0.9, 0.4 and 0.2. I tend to cop out and cal them 'model confidences' - higher means the model thinks it's more likely that that's the right class, and we don't think too hard about it :)

For the example I've just given, they'd be scaled to sum to one during scoring. Or you could achieve this yourself with something like a SoftMax layer (which constrains them to sum to 1), or a final scaling step...

Hopefully someone with some more knowledge on this will chime in. There are definitely more rigorous statistical ways of thinking about what those numbers mean, and how we can interpret them.

replied to Estaar19 Feb 2020, 11:55

Upvotes 0

alka

I agree with what @rinnqd said. As they stated "The values can be between 0 and 1, inclusive". This setting can only be possible in the case of multilabel classification, which is not the case given that our image files are located in 3 separate folders. The multiclass classif imposes that the output of the last layer be a softmax ==> values in a row will add up to 1, which can't be ensured if each value in th erow can itself be between 0 and 1 :

ID leaf_rust stem_rust healthy_wheat GVRCWM 0.63 0.98 0.21 8NRRD6 0.76 0.11 0.56

Regarding what you said, @johnowhitaker about the model's output, in a classification task, the network may have two possible outputs :

probability distribution (not in a pure statistical way) over you classes which means you applied softmax,
or logits if you don't apply softmax.

In the former case your model's output should add up to 1. In the latter the values can be any kind of real numbers but not the "between 0 and 1" as stated in the evaluation section of the competition.

So I think there is a need to clarify:

If we are doing multiclass classification then the example numbers provided in the evaluation section are misleading, because they must add up to 1
if we are doing a multilabel classification, then the training data should be reorganized accordingly and appropriate label should be provided

replied to Johnowhitaker21 Feb 2020, 21:03

Upvotes 0

Johnowhitaker

It's multiclass. The example shows the submission format - if you submit those numbers the error calculation will scale the values to sum up to 1, i.e. the values are normalized across the columns/classes. Zindi could change the example to show [0.7, 0.2, 0.1] for example, but the key thing is that you're also able to submit [0.63, 0.98, 0.21] and the error calculation will give you an appropriate score as well. What the example is trying to show (I assume) is the format/type - I think we're all reading too much into it :)

replied to alka2 Mar 2020, 07:08

Upvotes 0

esingildinov

But what about images in the train, which are both in stem_rust and leaf_rust folders? (you can use imagehash to find repeated images in different rust classes). What target should be assigned to such images? In my opinion they would have [0, 1, 1] target which is not accepted by nn.crossentropyloss

replied to Johnowhitaker2 Mar 2020, 07:14 (edited 1 minute later)

Upvotes 0

Sir-G

I think @Johnowhitaker is right.

If you submit all values 0.4 or 0.5 you will get the same score 1.098612...

So it is clear that they divide every row by its sum. After that we get [1/3, 1/3, 1/3]. And the score is exactly -ln(1/3)=1.098612, as (for example, sklearn) multiclass logloss returns.

replied to Johnowhitaker4 Mar 2020, 21:02 (edited 2 minutes later)

Upvotes 0

steveoni

Aims-senegal

What about models whose prediction per row sum to one, will they also be divided by 3?

replied to Sir-G13 Mar 2020, 20:27

Upvotes 0

Sir-G

No, if a row has the sum 1, it stays the same. In general, Zindi divides every row by its sum before applying log loss.

replied to steveoni14 Mar 2020, 06:26

Upvotes 0

ZINDI

We are going to look into this.

18 Feb 2020, 08:36

Upvotes 0

ragnarok

Thank you for your quick answer

replied to ZINDI18 Feb 2020, 08:55

Upvotes 0

Johnowhitaker

Hmm, digging into this it's much less clear than I first thought - in most places I can find, cross-entropy and log loss are treated as the same thing, especially in the case of a loss function for classification. See http://wiki.fast.ai/index.php/Log_Loss for some clarification, specifically the final paragraph.

If you want to recreate the scoring of the LB, I've seen a fairly close correlation in my tests between the score given by fastai as valid_loss during training and the leaderboard score. You can calculate the score with `sklearn.metrics.log_loss(reference[classes], predictions[classes])` , which gives the same result as fastai's valid_loss. This is exactly the same as how Zindi calculates it as far as I can tell.

More recently, scores on the leaderboard seem worse than your local scoring since they've removed duplicate images. Might want to do the same for your validation set.

Out of interest, can you explain what you think they should be doing metric-wise? Running all predictions through softmax before calculating log_loss?

18 Feb 2020, 10:48

Upvotes 0

Johnowhitaker

Ah - lots of comments since I refreshed the page! This isn't multi-label - the 'class' is one and only one of the three options (the 'most dominant' in cases where both stem and leaf rust are present). Labels are always of the form [0, 0, 1] and not [1, 1, 0] for eg. It's interesting - you could potentially take a multi-label approach by re-labelling according to a first rough model, and then see if there's a way to scale predictions to favour the most likely/prominent fungus...

replied to Johnowhitaker18 Feb 2020, 10:53

Upvotes 0

ragnarok

running logloss( predictions, outputs)

where predictions'shape is (n_samples,3), outputs'shape is (n_samples,3) (in this case sum of each row is 1 after softmax)

What I am thinking is they use np.mean([logloss( predictions[:,i], outputs[:i]) for i in range(3)]) which is wrong

each method will return a different score

replied to Johnowhitaker18 Feb 2020, 11:00

Upvotes 0

Johnowhitaker

I'll check with them, but I'm almost certain they use the former.

replied to ragnarok18 Feb 2020, 11:08

Upvotes 0

Johnowhitaker

Yup, double-checked against their reference file. Score is equal to log_loss(reference[classes], submission[classes]) not np.mean([log_loss(reference[c], submission[c]) for c in classes]).

replied to Johnowhitaker18 Feb 2020, 11:14

Upvotes 0

ragnarok

do you work at Zindi?

replied to Johnowhitaker18 Feb 2020, 11:32

Upvotes 0

Johnowhitaker

I do some some work with them, but not a full-time employee :) All opinions etc are my own etc. Never had an excuse to peep at the backend before, so thanks! :)

replied to ragnarok18 Feb 2020, 11:40

Upvotes 0

ragnarok

I have a question

If I submit [0.6,0.5,0.8], what's the way they use to normalize their sum to 1?

replied to Johnowhitaker18 Feb 2020, 11:42

Upvotes 0

Johnowhitaker

sum is 1.9, so it would be [0.6/1.9, 0.5/1.9, 0.8/1.9]

replied to ragnarok18 Feb 2020, 14:54

Upvotes 0

Estaar