Hi everyone,
I wanted to open up a discussion about the outcome of this competition and whether the winning solutions truly align with the original intent of the challenge.
The competition was framed as a Computer Vision challenge to estimate cassava root volume from underground scanning images. However, some of the top-ranking solutions, including the second-place submission, did not use image data at all. Instead, they relied purely on metadata and tabular features like genotype averages, plant numbers, and layer differences (Delta = End - Start).
If the best-performing models didn’t actually analyze the images, does that mean:
If a solution only uses metadata and ignores the images, how useful will it be for farmers and researchers who need to estimate root volume from actual underground scans? In real-world settings, new cassava varieties and scanning conditions may not have reliable metadata available. Wouldn't a purely vision-based approach be more generalizable?
Some participants pointed out that using RMSE as the evaluation metric might have favored tabular models, as it is highly sensitive to outliers. Would using MAE (Mean Absolute Error) or a different metric have encouraged more image-based approaches?
While I respect the skill and efficiency of the winning solutions, I wonder if they truly serve the purpose of the challenge. Should a challenge marketed as a Computer Vision competition reward solutions that don’t actually use images?
👉 To the best of my knowledge, this is what I believe, though I may be wrong on some points. I mean no disrespect to any of my fellow challengers, the organizers, or anyone else. Please don’t take my words the wrong way—these were just some thoughts I wanted to share, with no offense intended to anyone.
I don't think so. To me, it is a disaster like Smart Energy Supply Scheduling for Green Telecom Challenge. The metrics should have been MAE. If they wanted image only model, they shouldn't have provided the meta data.
or they should clearly state that in the rule.
In my oppinion i believe the best solutions should take into consideration both the images and tabular data to estimate the root volume. Quote 1 "The estimation of root volume should take into account the left and right images, since these represent parts of the full image. The full image can be segmented to identify the roots of the individual plant in it. Following this, you should then carry on with volume estimation."
Quote 2 "The PlantNumber should primarily be used as reference to check the ouput of your segmentation. Since the first step of this challenge is identifying the different plants in a given image. As previously mentioned, the values Start and End are merely suggestions, and their use is not mandatory. No other data than what is provided is required for this challenge.".
The point of this challenge is to solve a problem: non-destructive yield estimation for cassava farmers. A solution that simply takes metadata doesn't help much at all -- as a farmer I will likely have large blocks of a single variety (maybe only a single variety everywhere) and I will not be looking at GPR data to come up with "start" and "end" layers for some model.
So a tabular model using only the metadata will just tell me the same number for each of my plants and that does nothing for me in terms of yield prediction -- at best it becomes a look-up table. It will also be very wrong a lot of the time because weather and other conditions won't be accounted for.
When it comes to AgTech, limited data is the norm. Plants take a long time to grow, there are a lot of different variables, and a lot of variance for anything you're trying to model. It's the reality.
I think the only thing that would have been nice is to have the full set of data for all plants in the given scans. I believe there were only 255 unique plants provided in the training data but for 98 scans... so there should have been 686 total plants. And because it's difficult to hone in on where exactly a plant is within the length of the scan it becomes difficult to chop out the 98 scans for just the 255 plants given (so the 431 plants we don't have volumes for become noise in the detection/estimation problem).
So yes to your point #1 but not in the sense that you should expect 10x or 100x more data.
No to #2, that's impossible.
Also kind of yes and no to #3. It may have been a good idea for the organizers to set a maximum score threshold under which any viable solution needs to pass, and that threshold defined by a baseline model that is tabular only (or just simply say your solution must not be tabular only). From the perspective of CGIAR setting up this competition to get the desired outcome it is likely a failure if the result is just a useless tabular model.
MSE vs MAE shouldn't necessarily matter. We're all subject to the same constraints so we all equally suffer or benefit from the way a competition is set up (this also applies to the approaches used).
I for one am kicking myself for not submitting more often as I was seeing vision-based solutions with decent test split results but thinking I was way off if they weren't below a score of 1! Lesson learned!
Just wanted to add this to make sure my previous comment isn't interpreted wrong...
The winners of this competition as-is deserve the respective prizes. I don't mean to diminish their accomplishments with the word "useless". Even if a tabular solution isn't readily applicable there is value in knowing that's the best the participants could do as a community.
It would be great to see CGIAR and Zindi run a second version of this because I do think the signals are there in the vision data. There's a lot you can do with GPR preprocessing but we'd need more information on the radar itself and the data collection protocols.
To the best of my knowledge, this is what I believe, though I may be mistaken on some points. I mean no disrespect to any of my fellow challengers, the organizers, or anyone else. Please don't take my words the wrong way—these were simply some thoughts I wanted to share. No offense intended to anyone.