☎️ AI in Focus: Hyperparameter tuning

Sasol Customer Retention Recruitment Competition

Helping South Africa

R10 000 ZAR

Challenge completed ~2 years ago

Skills you will learn

Prediction

Job Opportunity

253 joined

56 active

Info Data Chat Leaderboard

Start

Oct 05, 23

Nov 26, 23

Reveal

Nov 26, 23

MakalaMabotja

Hyperparameter tuning

Data · 23 Nov 2023, 04:30 · 6

I have picked up that there is an issue of overfitting for this challenge. The higher the training accuracy the worse the model performs, has anyone had a similar experience?

With that being said, how does one know that they have built a model that is robust enough to handle the test data if the metrics used to get the best model results in worse scores on thee leaderboard

Discussion 6 answers

skaak

Ferra Solutions

Check and ensure you are not contaminating your validation data. This is easy to do. If you e.g. standardise features like so

z = [ x - avg(X) ] / std(X)

and you use the full dataset X, which includes the validation set, to calculate avg and std, then you have a (in this case relatively harmless) degree of contamination, since you are looking into the validation set when you construct features. This really to show how easy it is to let something slip in, if you get a significant impact then the leak (if it is indeed such a leak) is probably a bit bigger.

23 Nov 2023, 06:14

Upvotes 0

Ambiqour

University of Johannesburg

I am also experiencing the same challenges if high Accuracy and F1 score with relatively small changes from training and validation data but the model performs worse. However I had splitted the training and validation data and after I standardised the feature using the Z-score approach. Thanks @skaak for your inputs.

23 Nov 2023, 09:16

Upvotes 0

natarajanlalgudi

@skaak won't RobustScaler be better for scaling given that the continuous variables in this case all have outliers????

23 Nov 2023, 12:03

Upvotes 0

skaak

Ferra Solutions

Well ... it depends

Note that I am not suggesting you do z, I am using it to show how easy it is to contaminate your validation set.

The transformation, if any, depends also on the model you use. NN is e.g. good if the data is homogeneous, e.g. pixels in an image, all between 0 and 255. Random forest e.g. can handle very heterogenous data and scaling will have little impact.

replied to natarajanlalgudi23 Nov 2023, 12:29

Upvotes 0

wuuthraad

"make_pipline" anyone? it helps with data-leakadge problems you might have obviously mixed with preprocessing like RobustScaler(which helps reduce the impact of outliers on the dataset). Can help in mitigating the impact of data-leakadge. Optimizing your model the the "best features" doesn't help when the fuel is subpar. revisit some of the preprocessing and FE steps you took, see if it can be improved.... then just use XGBoost, XGBoost is all you need

23 Nov 2023, 17:25

Upvotes 0

MakalaMabotja

Yeah I'm on my 5th iteration of model building. First 4 iterations were focused on parameter tuning and getting the best performing model...

only to have the best performing model being XGB on raw data without any tuning or cleaning... which says to me that I have been processing the data using the wrong thought process

replied to wuuthraad23 Nov 2023, 19:19

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status