Makerere Fall Armyworm Crop Challenge
Can you determine if maize crops have been affected by the fall armyworm pest?
Prize
$1 000 USD
Time
Ended 8 months ago
Participants
156 active · 671 enrolled
Helping
Uganda
Good for beginners
Classification
Computer Vision
Agriculture
Tie break resolution method
Platform · 16 Apr 2022, 10:28 · 8

Seems like many will score AUC=1 by the time competition ends, how will the tie resolve?

Discussion 8 answers

Seconding this comment!

16 Apr 2022, 18:47
Upvotes 0

I agree..

The dataset is way too easy, didn't even take 5 minutes

18 Apr 2022, 06:16
Upvotes 0

I disagree, it might just be a coincidence that I find weird that most leaderboard scores have an AUC of 1, this will suggest that in the chunk of the test dataset used in the public leaderboard, participants managed to get 0 and 1 (like exact!!! predictions), is it possible that the backend calculation is faulty for the AUC? @amyflorida626

The predictions submitted are probabilities, therefore 0.9, 0.99, 0.999 are very different when AUC is calculated, also probabilities are not to be rounded per @Zindi's policy

If the error metric requires probabilities to be submitted, do not set thresholds (or round your probabilities) to improve your place on the leaderboard. In order to ensure that the client receives the best solution Zindi will need the raw probabilities. This will allow the clients to set thresholds to their own needs.

Keep in mind that we are seeing the public leaderboard, what does the private leaderboard hold in store?

Hence several participants can have the same public leaderboard score but very different private leaderboard scores hence no need to worry.

However, in the event (less likely) that the private leaderboard scores are identical, @Zindi has this covered in the following rule.

If two solutions earn identical scores on the leaderboard, the tiebreaker will be the date and time in which the submission was made (the earlier solution will win).

PS: Check out the competition rules for details

I disagree with these points.. I'll explain why as best as I can.

Firstly, at no point in the description of this particular challenge does it state that rounding model outputs is forbidden nor that raw probabilities are even required. In fact I'd argue that the phrase:

"Where 1 indicates that the image has been affected by a fall armyworm and 0 if it has not been affected"

...indicates that raw probabilities are discouraged. Therefore the gebneric Zindi rule: "If the error metric requires probabilities to be submitted", likely does not apply here.

Moreover, there is nothing discouraging the use of models which can only output binary classifications, further lending weight to the argument that these AUC scores of exactly 1 are perfectly logical and the underlying AUC code is fine.

Last bit on this point is that even from a personal perspective I've found it possible for models that output continuous target probabilities will converge very easily to giving explicit binary [0,1] outputs when trained on the data.

The next point:

"Keep in mind that we are seeing the public leaderboard, what does the private leaderboard hold in store?"

Yeah this is fair enough.. but it relies on the competition organisers not disclosing that the AUC score is just a public facing scoring and under the hood they're using some other metric. However, given my point above about AUC scores of 1 being genuine and transparently perfect results, it means that there is no possible other metric under the hood differentiating the results... a perfect classifier is perfect, no matter how you change the metric.

It would also rely on the statement in the info page:

'The evaluation metric for this competition is Area Under the Curve (AUC).'

...being a very bold lie.

The last point:

"If two solutions earn identical scores on the leaderboard, the tiebreaker will be the date and time in which the submission was made (the earlier solution will win)."

Yes this is explicitly stated in the rules... but come on, how is this in anyway fair? There are people on the leaderboard who joined after the current 1st ranked competitor (no offence to the person in first..we're all just very jealous right now!) and achieved the top score in less attempts. Why not a metric for the difference in time joined to time submitted top result? It penalises those joining late and goes against the spirit of the competition which is really to generate the best most interesting and diverse set of model solutions.. something that won't happen if people see there's no point in even joining because they can't win.

Just my two cents on the matter.. please don't hate me I haven't eaten yet today :')

There are a number of points you suggested that are very solid, however, you might want to rethink some of your points. If anyone agrees with me on the first go, I am probably saying something off.

Look at this evaluation, particularly the sample submission file and give it some thinking.

''''

Evaluation

The evaluation metric for this competition is Area Under the Curve (AUC).

For every row in the dataset, submission files should contain 2 columns: ID and Target.

Where 1 indicates that the image has been affected by a fall armyworm and 0 if it has not been affected.

Your submission file should look like this (numbers to show format only):

Image_ID             Target
ID_D9ONL553           0.13
ID_263YTILY           0.87

"""

How different is AUC from Accuracy?

AUC is used when we wish to deal with probabilities to have an understanding of the level of confidence of a given prediction hence intuitively 0.99 is more trustworthy than 0.90 while Accuracy is used when we wish to deal with actual predications (0 or 1 in this case based on some given threshold - subjective threshold or strategy, we commonly use np.argmax())

I am happy to discuss this further.

Enjoy your breakfast bro, thanks for pointing out, I should get breakfast too :-)

You're quite right! I guess what it really comes down to is whether the sample sumbission file is a literal example of what the contest organisers are after, or if '(numbers to show format only)' should be taken as carrying more weight. Basically @Zindi need to clear this up! :')

Because, unfortunately when it comes to an AUC score of 1, to my knowledge at that point there's no way of differentiating classifier scores unless like you say (correct me if I'm misinterpreting you here) there's a thresholding going on privately by the scoring code and in fact the scores are different because everyone's been submitting probabilities rather than binar classifications.

One way to determine that is if everyone with top scores just declared here if their outputs are all completely, 100%, binary.. and if that's by user defined thresholding (danger territory! as you showed with the Zindi rules.. but also again ambiguous in this competition) or because the models simply achieve that sort of output without thresholding. In that last case we return to the original problem of everyone genuinely having perfect models and the AUC scores matching both publicly and privately.

Please correct me if I'm wrong or talking garbage! And definitely eat, it makes everything better :')

I have probably made a number of such as ...therefore 0.9, 0.99, 0.999 are very different when AUC is calculated which is not always true.

The only tiebreaker is the private leaderboard and submission time unless something is updated in the competition during the upcoming days.