Primary competition visual

Sasol Customer Retention Recruitment Competition

Helping South Africa
R10 000 ZAR
Challenge completed ~2 years ago
Prediction
Job Opportunity
253 joined
56 active
Starti
Oct 05, 23
Closei
Nov 26, 23
Reveali
Nov 26, 23
Hyperparameter tuning
Data · 23 Nov 2023, 04:30 · 6

I have picked up that there is an issue of overfitting for this challenge. The higher the training accuracy the worse the model performs, has anyone had a similar experience?

With that being said, how does one know that they have built a model that is robust enough to handle the test data if the metrics used to get the best model results in worse scores on thee leaderboard

Discussion 6 answers
User avatar
skaak
Ferra Solutions

Check and ensure you are not contaminating your validation data. This is easy to do. If you e.g. standardise features like so

z = [ x - avg(X) ] / std(X)

and you use the full dataset X, which includes the validation set, to calculate avg and std, then you have a (in this case relatively harmless) degree of contamination, since you are looking into the validation set when you construct features. This really to show how easy it is to let something slip in, if you get a significant impact then the leak (if it is indeed such a leak) is probably a bit bigger.

23 Nov 2023, 06:14
Upvotes 0
User avatar
University of Johannesburg

I am also experiencing the same challenges if high Accuracy and F1 score with relatively small changes from training and validation data but the model performs worse. However I had splitted the training and validation data and after I standardised the feature using the Z-score approach. Thanks @skaak for your inputs.

23 Nov 2023, 09:16
Upvotes 0

@skaak won't RobustScaler be better for scaling given that the continuous variables in this case all have outliers????

23 Nov 2023, 12:03
Upvotes 0
User avatar
skaak
Ferra Solutions

Well ... it depends

Note that I am not suggesting you do z, I am using it to show how easy it is to contaminate your validation set.

The transformation, if any, depends also on the model you use. NN is e.g. good if the data is homogeneous, e.g. pixels in an image, all between 0 and 255. Random forest e.g. can handle very heterogenous data and scaling will have little impact.

User avatar
wuuthraad

"make_pipline" anyone? it helps with data-leakadge problems you might have obviously mixed with preprocessing like RobustScaler(which helps reduce the impact of outliers on the dataset). Can help in mitigating the impact of data-leakadge. Optimizing your model the the "best features" doesn't help when the fuel is subpar. revisit some of the preprocessing and FE steps you took, see if it can be improved.... then just use XGBoost, XGBoost is all you need

23 Nov 2023, 17:25
Upvotes 0

Yeah I'm on my 5th iteration of model building. First 4 iterations were focused on parameter tuning and getting the best performing model...

only to have the best performing model being XGB on raw data without any tuning or cleaning... which says to me that I have been processing the data using the wrong thought process