ICLR Workshop Challenge #1: CGIAR Computer Vision for Crop Disease
\$7,000 USD
Identify wheat rust in images from Ethiopia and Tanzania, and win a trip to present your work at ICLR 2020 in Addis Ababa.
29 January–15 March 2020 23:59
566 data scientists enrolled, 193 on the leaderboard
Log_loss of class predictions

Dear ZINDI and fellow contestants, please I have a question that needs clearing up.

It says that "The goal is to classify the image according to the type of wheat rust that appears most prominently in the image." and the log_loss function is being used as the evaluation metric.

So if there is a model prediction of healthy, leaf_rust,stem_rust [0.21, 0.98, 0.63] is the log_loss function calculated based on the model's dominant class prediction(0.98) and the true dominant class, or based on all the model's predictions against the true predictions?

For example, say my model ouput predictions for a single image for ['healthy','leaf_rust','stem_rust'] are [0.2,0.7,0.1] while the true predictions are [0.8,0.4,0.3], is the log loss calculated as:

a) log_loss(0.8,0.2) #in this case, only the deviation between the model's predicted dominant class and the true dominant class is taken into accout

b)log_loss(0.8,0.2) + log_loss(0.7,0.4) + log_loss(0.3,0.1) #Here, all the predictions for all the classes of the image are taken into account.

There is only one true class for each image. The logloss is calculated against your prediction for that class; essentially every prediction is log_loss(1,[your prediction for the ground truth class])

In your example, predictions for ['healthy','leaf_rust','stem_rust'] being [0.2,0.7,0.1]:

if the true class was leaf_rust, it's log_loss of a true prediction of 0.7, which is a loss of 0.3566

if the true class was stem_rush, it's log_loss for a true prediciton of 0.1, which is a loss of 2.302

Thank you for this reply, but I just did a little tweaking of the model predictions in lieu of what you said, and the result I had contradicts your assumption. Could someone else, or perhaps the ZINDI organizers kindly comment on this please?

Keep in mind, probabilities will be scaled so that the total is 1. In a couple of the examples you gave, you have a trio of predictions that adds up to far greater than 1, such as [0.21, 0.98, 0.63]. You can't be 98% confident of one class and 63% confident of another. Typical logloss measurement will conveniently scale each of those probabilities down, then go about measuring against the scaled prediction of the true class. Similarly, a cap is ordinarily used to prohibit infinite loss on an incorrect 0% probability.

And I didn't mention, but it's the average of all individual losses, whereas in your original question you have a sum. It's a bit hard to follow the examples you have because few of them actually fit inside 100%. But it's neither A nor B. There is no such thing as a dominant class. It's merely whatever the probability you assigned to the true class. For an image of a healthy plant, you'll get the same logloss if you predict ['healthy','leaf_rust','stem_rust'] [0.4, 0.3, 0.3] as you will if you predict [0.4, 0.59, 0.01]. It's logloss of 0.4 (which is ~0.9163).

You're right that Zindi may have an alternative implementation, but this is the standard: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html

from sklearn.metrics import log_loss

log_loss(["a","b","c"],[[0.4,0.35,0.25],[0.4,0.35,0.25],[0.4,0.35,0.25]]) ## 1.1174690724975747

log_loss(["a","b","c"],[[0.4,0.59,0.01],[0.4,0.35,0.25],[0.4,0.35,0.25]]) ## 1.1174690724975747

So the dominant class changed in the first prediction set with no effect on the score, since p(b) isn't used in the calculation for that, as "a" is the answer.

You can work it out if you want, here showing the log of 1.A, 2.B, 3.C probabilities

(-log(0.4) + -log(0.35) + -log(0.25))/3 ## 1.1174690724975747

And just to show that the class matters and use an example where the positive predictions aren't identical, here is the result of the altered class was correct and 1.B, 2.A, 3.C probabilities are used:

log_loss(["b","a","c"],[[0.4,0.59,0.01],[0.4,0.35,0.25],[0.4,0.35,0.25]]) ## 0.9434059450254725

(-log(0.59) + -log(0.4) + -log(0.25))/3 ## 0.9434059450254725

If you consider the 'true' labels to be of the form [0, 1, 0] and your predictions to be [0.3, 0.99, 0.67] for each image. The easiest way to calculate a score is `log_loss(y_true, y_pred)`. You could also do `log_loss(y_true.flatten(), y_pred.flatten())`. Importantly, it's taking into accont all predictions for all classses, not just considering the dominant class.

Given a dataframe 'preds' that looks like the sample submission, and one 'reference' that has the correct classes encoded in the same way, you can use `log_loss(reference[classes], preds[classes])` (provided they're in the same order!)