The Evil Loss - Log Loss
Connect · 8 Jul 2020, 14:46 · edited 38 minutes later · 27

Seeing @Zindi using log_loss continuously for a variety of competitions, I would like to raise an issue here. Log loss is not at all a good metric. for a competition Plus its highly unstable. Neither is it a good metric for any kind of real life interpretation. For competitions Iike binary and/or multi-class classification problems I suggest AUC/F1 scores as they are much more beter metrics.

The issue is log_loss penalizes wrong predictions too heavily, which is a good characteristic for a loss function, but undesirable for a metric. For a metric we need to see how both correct and incorrect predictions are behaving to see the useability of the model.

Lets understand the evil of log loss through a binary classification example.

true is a randomly generated array of 0s and 1s, which represents the actual label of 100 samples.

true = np.random.choice([0, 1], size = 100).astype('float')

Part A:

Here we reverse the labels of the first 3 vales, so the accuracy is 97/100 = 97%.

And log loss is 1.03

pred = true.copy()
pred[0:3] = 1 - true[0:3]
log_loss(true, pred)

Part B:

Here we set all values to 0.5 (no ML required :D ) and the log_loss here is 0.69

preds = np.ones(100) * 0.5
log_loss(true, preds)

So even if you ML model is more than 90% accurate, a guy with say 70% (or even lower) accuracy could have a better logloss . This is a very trivial example, I suggest everyone including @Zindi @Johnowhitaker, to please try this and understand the peril this metric brings. And stop using log_loss as soon as possible.

P.S: My example was as simple as possible, but I hope the idea is clear to the Zindi community. Many of us are already aware of this. And since the Zimnat Insurance Recommendation Competition has started very soon, rescoring based on a much better metric like AUC/F1 would be highly appreciated.

Any more ideas/discussions are welcome. Lets make the Zindi community better together :)

Discussion 27 answers

Can't agree more ! I think AUC metric will be appropriate for the Zimnat.

8 Jul 2020, 14:56
Upvotes 0

Thanks @Haytheem for you support :)

User avatar
marching_learning
Nostalgic Mathematics

Agree. We should use AUC or F1-score

8 Jul 2020, 14:57
Upvotes 0

I think this is a good discussion to have. There are always upsides and downsides to any metric. I'll come back with some more thoughts later, but one thing I wanted to raise right now:

In your example, the 'good' model is making three predictions with 100% confidence that are wrong. log loss really penalizes confidence in wrong answers. If you take that same set of predictions but clip them, you suddenly get a much better log loss score:

pred = pred.clip(0.05, 0.95)

log_loss(true, pred)

> log loss of 0.14, muuuuuch better than 0.5s everywhere.

A good model, trained with the metric in mind, can output predictions that get a good score by taking the dynamics into account.

I have back-to-back meetings now but would love to chat more. From past contests where this has come up we've seen hat accuracy and log_loss are very closely related - generally a good model does well, a bad model less well. Obviously it's annoying if someone comes and does some metric hacking to get a better score without focusing on getting a better model, so I see that side.

I like the idea of AUC. Look forward to everyone else's inputs.

8 Jul 2020, 15:02
Upvotes 0

Yes, the main issue is metric hacking. or to put in better words metric optimization or post procesing. Post processing can help a great deal and could be a big disadvantage for those who are not using it. For example we clipped our metric to 0.18, yes we did np.clip(preds, 0, 0.18) so the maximum value of our predictions was 0.18 in the Zimnat Insurance weekend challenge, and jumped from below the 10th position to the 3rd position. AUC is one of the best metrics for a competition, you work with probabilities, and its very unlikely that two people have the same exact AUC., whereas F1 its more likely that scores collide. I would love to hear more about your opinions after your meetings.

Great point @Johnowhitaker There is absolutley no reason to throw away log loss, which is the mathematically optimal loss function for classification. problems!

@jcatanz, again there's a difference between a loss function and a metric. logloss is ideal for a loss function, but not for a metric.

I agree with Nikhil's opinion here. Area under PR curve could also be considered for binary class classification problems. But still AUC/F1 score would be a better indicator of a strong classifier.

Thanks Rajat, for the support.

in my opinion , Nikhil is right but metric in competitions solely depends on the companies hosting the competition and what do they actually want to do with the metric involved,Ex Probably if the model goes wrong it's probably necessary just to check why it goes wrong in insurance it imperative that wrong values have to be penalized more than usual as there aren't any room for mistakes

8 Jul 2020, 15:22
Upvotes 0

If you have noticed @Sravan121, sklearn's logloss has a parameter called eps with a value of 1e-15, so it again decides how much should it be penalized, and by wrong, I mean both false positives as well as false negatives. In most of the cases the organization could do away with one, as the other would be more valuable

Yeah, true but organisations don't want you think like that, about false positives and false negatives, in a ideal scenario they probably don't want To held accountable ,I work in health insurance and false positives here can lead someone not getting a insurance for a surgery so high penalisation here does make a lot of sense.Again it's an ideal scenario there always be human auditors here .But my point is log loss isn't a bad metric it Should be used carefully and also zindi should refrain from using it as much as they do.

User avatar
National polytechnic school of algiers

I believe it really depends whether it's more important to predict one than zeros. But in my opinion, AUC would have been a much better metric.

However, the accuracy logic is flawed, this dataset is imbalanced, so you can't really compare accuracy with logloss or AUC and so on. Besides accuracy is a count based metric, where as AUC or log loss uses probability to compute the loss.

8 Jul 2020, 15:25
Upvotes 0
However, the accuracy logic is flawed, this dataset is imbalanced, so you can't really compare accuracy with logloss or AUC and so on. 

I agree, the dataset is imbalanced. Again I made the example as simple as possible, for everyone to understand, you can try f1_score, and AUC, for the same example, and you will get my point :). People understand accuracy better, so I chose to use it :)

When your statement is somehow correct, I will ask not to be so categorical about the LogLoss metric. LogLoss foundamentaly forbid confidence in wrong opinions ! Imagin you're building a system that would predict if a person suffers form HIV, you will want your model to place wrong "ones" nowhere ! And, if it does, your metric must be rude to him !

Usually, the metric choice is up to the organizers and depends a lot in their use cases. Even if LogLoss everywhere is surely a bad idea, LogLoss nowhere isn't better !

8 Jul 2020, 16:01
Upvotes 0

Yes @zkiller, the choice of log loss as a metric should depend upon competitions. But using it for every competition isn't a good idea. Like in insurance recommendation challenge, getting more 1s would be favourable according to me.

I always share the same discussion in many competitions but no one is understanding what I am saying. PLEASE!!!!! LOGLOSS IS A STUPID METRIC!

I left Zindi for this reason

@rinnqd thanks for the support here. My idea is its a good loss function, but not a good metric :D

I agree with the thought, but AUC is insensitive to meaningful misorderings, is a relative measure, and does not incentivize well-calibrated probabilities. This is why log loss is used as an evaluation metric for binary classifiers when well-calibrated probabilities are important.

8 Jul 2020, 18:18
Upvotes 0

Thanks all - we'll take this up internally. I don't think we can change the metric on this one since it's launched and going, but for future comps/clients, we can push the clients to go for something like AUC or F1 where it makes sense.

I think @Sravan121 hit the nail on the head the point that in any problem like this there needs to be some humans thinking hard about what a model's outputs mean, and how the choice of metric affects that.

We shall see what the big bosses say but I suspect that looking at the list of top competitors here sharing ideas might be a good argument in favour of some new metrics to hack ;)

Thanks @devnikhilmishra for bringing this up, and everyone for all the input - it's really cool having such a large community of experts here on Zindi helping us learn :)

8 Jul 2020, 18:22
Upvotes 0

Exactly I agree with @Sravan121 and here, the choice of metric should be in accordance with what the outcome of the problem should be. And thank you @Johnowhitaker for the time to engage in this discussion and for reviewing this :)

User avatar
Vidya Jyothi Institue of Technology

Agreed! I was in a Hackathon where average logloss of LB of all ppl was 0.73 or something.... Keeping all 0.5 i could get to 0.69 without any model... So... I support!!

same here @saikrithik, we clipped it 0.18 in a competition :D

good point Nikhil. +1

I would like to add 2 cents here.

In case of real life problems, the evaluation metric is always a big challenge. Because the business don't understand metrics like AUC and/or logloss. The goals are set keeping in mind improvements towards accuracy, recall and precision numbers. That being said, setting a hackathon eval metric is a challenge in itself. Here the idea is provide a best way to evaluate a large audience. A detail analysis is definately needed to set up evaluation metric for various hackathons.

As far as my experience goes, AUC works well with binary class classification problems. It is independent of threshold values and powerful in case of imbalanced data. Logloss is also a good metric but i feel it works better in case of multi-class classification problems.

8 Jul 2020, 23:11
Upvotes 0

thanks Abdul for sharing your valuable insights regarding log loss and AUC. A lot of this was observed from your notebook :)