I have been wondering if anybody is happy to share some thoughts around what they did/are doing regarding data cleaning/preparation.
I know that this may be a bit cheeky to ask re a live competition, but I am a total beginner and would really appreciate any guidance. Also happy if it makes more sense tp wait till the competition is over before sharing this sort of stuff.
Thanks in advance,
Simon!! Nice to see you do this one ..
It is tricky, you have lots and lots of data for a binary classification.
Also, not wanting to give anything away (yet) makes it difficult to discuss, but I think a lot hinges on the technique you select for this, e.g. neural nets or gradient boosting as in the starter. I note in the starter (section outliers) they do discuss some approaches you could try to deal with outliers but also to transform. Have you tried some of those e.g.?
Cool to see you too!
I am using a gradient booster and I guess my data cleaning knowledge comes from the introductory Neural Net courses I have done.
What this has shown me is that I need to put a lot more work into understanding what the requirements/theory behind the different ML models..
Its a fascinating and never-ending rabbit hole that just makes me appreciate how good you top guys are!!!
Ok - if you use GBM you have less worries than NN ... here is a really basic example (just copied and pasted from kaggle notebook https://www.kaggle.com/code/lilyelizabethjohn/standardization-using-standardscaler )
#Standardization from sklearn.preprocessing import StandardScaler sc=StandardScaler() X_train_std=sc.fit_transform(X_train)
This will standardise or normalise all your data. For a GBM it will not have much impact I think but you can run it and see if it makes a difference.
You may see in EDA or starter that the variables are on very different scales, so this will put them all on the same scale.
fwiw this is more about showing how to process data than prepping for GBM, as this won't have much impact, but if you do this then you can in the same vein do others that will change the way a GBM fits the data ...
MG like always, thanks for sharing so freely!!
I have a confession: I got into AI/machine learning to do deep learning / neural net stuff (for some reason, it just feels more like magic to me)!!
Sadly I come from a strong commercial background so lots of tabular data, and the more I study the more I realise deep learning isn't great for tabular data.
Am going to take a look into some of the deep learning toolsets for tabular data and see if that helps.
Or start looking for non tabular data competritions....
Hi Simon, it is nice that you got a model going. I know you prefer NN stuff, which is why I asked, but still I'd suggest to try to add that standardisation to your pipeline. Even if it does not change the score, it is good to have it in your pipeline and later you can refine it a bit.
ok cool I will!