🛡️ AI in Focus: Classification vs Regression

CGIAR Eyes on the Ground Challenge

Helping Africa

$10 000 USD

Completed (over 2 years ago)

Skills you will learn

Prediction

872 joined

137 active

Info Data Chat Leaderboard

Start

Jul 21, 23

Nov 03, 23

Reveal

Nov 03, 23

arnolfokam

Classification vs Regression

Data · 4 Oct 2023, 07:39 · 13

The description of the target value extent says it is in this range of 0-100 with increments of 10%.

## Ancillary data For each growth stage the damage types and their `extent` are provided with the extent given as percentage (%) in 10% increments.

Therefore, from a classification point of view, it seems like we will have 11 classes (0, 10, 20, ... 100). It might appear easier for the model to tackle the problem this way as we are only dealing with eleven values instead of a continuous range (regression) even though labels are discrete.

However, from what I have tried so far, a simple regression model works far better than a simple classification one. I would assume this might be because of the imbalanced nature of the dataset or the test data that contains values outside this train extent range (like in this example given to us).

ID                             extent

L1095F00009C01S00200Rp01978     56

L1095F00009C01S00200Rp09218     48

What do you think?

Discussion 13 answers

zhkun

Because mse is sensitive to differences between numbers, while crossentropy is not.

4 Oct 2023, 09:14

Upvotes 1

Muhamed_Tuo

Inveniam

Hey, It makes sense that the better choice here would be regression. If you take a simple example where your model is struggling on whether an image extent should be 40 or 50 with both having equal probabilities. How do you decide to go with either one ? An obvious solution would be to take the middle ground (being 45). Well, that's natively done with a regression approach.

Someone could go with a classification approach and then use the probabilities to output a single value. Meaning " np.sum(probabilities * labels)", with labels being [0, 10, ... , 100]

Haven't tried the latter, but could be a good compromise

4 Oct 2023, 16:36

Upvotes 0

arnolfokam

Hi, thanks for your input on this. Looking at the extent column in the training dataset, it looks like it only contains 10% increments of percentages. So, every instance will have one specific label (either of these [0, 10, ..., 100]).

Therefore, when speaking about the model struggling between 40 and 50, may I ask you the case where you think this phenomenon might happen?

replied to Muhamed_Tuo4 Oct 2023, 18:10

Upvotes 0

Muhamed_Tuo

Inveniam

Yeah, I agree that every instance has 1 specific label. But considering the metric to be RMSE ( and not logloss or any other classification metric) and the fact that it is a gradual 10% increments, make it even even more punishing in case you predict the wrong extent.

I have seen a few instances where the image contains an obvious drought damage, but the extent is 0. In such cases, predicting anything higher than 0 will result in a relatively high penalty.

I also saw a few cases ( a lot actually:) where the extent is very low (let's say 30 ) but the model's estimation of the damage is about 70 (and frankly in some of those cases I believed the model to be right for the simple reason that in these images there wasn't a single healthy weed )

replied to arnolfokam4 Oct 2023, 19:38

Upvotes 1

arnolfokam

Oh Interesting! you are correct.

replied to Muhamed_Tuo4 Oct 2023, 19:43

Upvotes 1

Nayal_17

are you using all train images for training, or missing out some on the basis of analysis.

replied to Muhamed_Tuo5 Oct 2023, 04:51

Upvotes 0

Muhamed_Tuo

Inveniam

I'm using all images for now. But removing some images "might" help

replied to Nayal_175 Oct 2023, 17:50

Upvotes 1

Nayal_17

you got the rmse score of 10, without any post processing using metadata given in dataset?

replied to Muhamed_Tuo7 Oct 2023, 05:40

Upvotes 0

hasan_n

i think all submissions with scores >= 9 can be achieved without any use of the "damage type" column. I doubt that the other higher solutions didnt use the leak though.

take into consideration that you can use the leak to get a high score and hide your real score.

replied to Nayal_178 Oct 2023, 11:50

Upvotes 0

Nayal_17

hmm, i too got rmse of 10 without damage type and most probably will get score of 9 too. But i dont't think there is much scope of pushing it further. It's a humble request to all participants who are top 10 to atleast tell us whether they used damage type in their solution in any way.

replied to hasan_n8 Oct 2023, 11:58

Upvotes 0

hasan_n

Well the final accepted top 10 solutions are based on the scores on the full test (currenlty this score is computed on only 20-30% of the test set, so a shake-up is expected).

If all these top 10 solutions select the submissions which use the leak, they will all be disqualified. Given the time invested everyone put in this competition, i prefer to not beleive that they would risk it all for nothing.

Anyway, I think its better if the organizers check the top 25-30 solutions instead of 10.

replied to Nayal_178 Oct 2023, 12:16

Upvotes 0

Nayal_17

Agree with you on every point, but i am not expecting much shakeup, as cv and lb seems to be correlated and test set seems to be randomly picked from full dataset.

replied to hasan_n8 Oct 2023, 12:25

Upvotes 0

Muhamed_Tuo

Inveniam

@Nayal_17 Yeah, my current score is without any postprocessing. Like @hasan_n is saying, you can achieve a score of 9.x without any pp or using the leak.

I wouldn't trust any score lower than that :)

replied to Nayal_178 Oct 2023, 14:57

Upvotes 1

Join the largest network for
data scientists and AI builders

About FAQs

Status