Hi Everyone, I share my solution with you
For this challenge I worked locally and didn’t use any GPU. The whole code (train and inference) runs in one minute.
Given the small size of the dataset, I decided to use a very light model. The only features we used were (5 in total):
The model is a simple xgboost trained using stratified folds on Genotype.
The environment is Python 3.10.13.
The code snippet is provided below:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold
import numpy as np
from tqdm import tqdm
import xgboost as xgb
import time
start = time.time()
CATCOLS = ["Stage","Genotype","Side","PlantNumber"]
NUMCOLS = ["Start","End","Delta"]
ID = "ID"
TGT = "RootVolume"
PATH = "Your Path Here..."
tr = pd.read_csv(f"{PATH}/Train.csv")
te = pd.read_csv(f"{PATH}/Test.csv")
tr["train"] = 1
te["train"] = 0
tr["Delta"] = tr["End"] - tr["Start"]
te["Delta"] = te["End"] - te["Start"]
data = pd.concat([tr,te])
data["x_geno"] = data.groupby("Genotype")[TGT].transform("mean")
data["x_plant"] = data.groupby("PlantNumber")[TGT].transform("mean")
data["x_stage"] = data.groupby("Stage")[TGT].transform("mean")
data["x_side"] = data.groupby("Side")[TGT].transform("mean")
data["x_bl"] = 0.8*data["x_geno"] + 0.2 * data["x_plant"]
tr = data.loc[data["train"]==1].copy().reset_index()
te = data.loc[data["train"]==0].copy().reset_index()
FE = ["Delta"] + ["x_geno","x_plant","x_stage","x_side"]
X = tr[FE].values
Xe = te[FE].values
y = tr[TGT].values
grp = tr["Genotype"].values
NFOLDS = 10
skf = StratifiedKFold(n_splits=NFOLDS)
FOLDS = list(skf.split(X,grp))
oof = np.zeros(y.shape)
pe = 0.0
for idx in range(NFOLDS):
tr_idx, val_idx = FOLDS[idx]
clf = xgb.XGBRegressor(max_depth=4, n_estimators=80, learning_rate=0.025) #
clf.fit(X[tr_idx],y[tr_idx])
oof[val_idx] = clf.predict(X[val_idx])
pe += clf.predict(Xe) / NFOLDS
print("FOLD :", idx)
#
oof = np.round(oof, 2)
CV = mean_squared_error(y, oof, squared=False)
print("Ridge CV:", CV)
sub = te[[ID]].copy()
sub[TGT] = pe
sub.to_csv(f"{PATH}/submission.csv", index=False)
end = time.time()
print(f"Elapsed Time: {end-start} seconds" )
Waouh !!! first of all thanks for sharing your approach but I'm just wondering if such kind of solution that does not use the images at all is still valuable to the client . This challenge is mainly a Computer Vision Challenge . "Can you estimate cassava root volume from underground scanning images? "
Anyway just wanted to share my opinion on this .
I think there is a signal in the images. because the delta features indirectly used the images. A bigger dataset could have allowed to really assess the importance of images
@PUBG You just said my mind. Anyway let's see what clients will consider as their preferred solution. The onus is on them.
This is also a concern of mine, however so far it's shown that predictions made using the images are inferior to those without ( most definitely due to the small sample size) so any model that incorporates the images would be less valuable to the client. And also, as @wizzard stated, the tabular data contains some features linked to the images and seem to yield better results. In the end though, the decision is left to Zindi and the clients.
thanks for sharing , nice work keeping it as simple as possible , maybe i misunderstood the chalenge as it appeared to be a computer vision or image processing but you have not used this and your model succeded so very cool work
@wizzard well played.
Amazing. one question: did you have strong conviction in the cross-validation scores you got with this approach? my teammate had such a model initially. but in the end it was difficult to justify selecting such a model given the cv of models with image features.
Not really, but I chose it thinking maybe it can be successful on the private data. The size of the dataset didn't reward big models which used the images.
your intuition paid off. congrats
Congratulations on your victory! Your solution is well deserved. However, it seems the host would be happier if an image-based solution ranked high 😂
Hi @Wizzard,
First of all congratulations in taking part in the challenge. How does your solution take the images provided into account. Recall that the objective of the challenge was primarily to use computer vision techniques to estimate the volume from the images. So it is a two step process.
Hello @Joel. One of the challenge was to choose the appropriate depth range for the images. I tried many, but finally I stick with the proposed start and end layers in the metadata provided. My idea is simple, the more the delta (end - start) is high, the volume should be higher since the plant is assumed to take more space. So I only used that feature. We all experienced that models which used extensive image features (object detection, Feature map from conv nets, etc.) didn't generalize well on the private. Though it is legitimate to enforce the use of images, I will always stick with the more efficient model. In my case, I think that it uses images indirectly. To me, it is a predictive challenge, we should not forget that one of the primary focus is to generalize on unseen data.
I think the flaw is from the structure of the competition, the size of the data meant that trying to increase your private scores when using computer vision techniques was effectively overfitting. Also there's no way a pure computer vision model would beat the baseline which would be predicting with the average volume per genotype. The only way it would do so is if the images were used with the tabular data and since it gives less scores than without it means it was effective noise. The question would the winning solution be given to the team who incorporated noise and were the best at ignoring it? Plus we were able to beat the baseline score because we added features that were linked to the images so although indirectly, the images were used.
Hi @AJoel,
my solutions are purely based radar data on left & right side image of plant and cv 0.90 MAE ( used mae score for generalization purpose since dataset is small ) , and i got 1.40042721 on lb ( 182'th rank )
maybe, problem is evalution metrics Root mean Squared error that sensitive to outliers
I agree, it should have been MAE since the beginning.
my model is very image based. LOL
That's an outside-the-box solution—appreciate you sharing it!