Primary competition visual

Sasol Customer Retention Recruitment Competition

Helping South Africa
R10 000 ZAR
Challenge completed ~2 years ago
Prediction
Job Opportunity
253 joined
56 active
Starti
Oct 05, 23
Closei
Nov 26, 23
Reveali
Nov 26, 23
User avatar
skaak
Ferra Solutions
Model solutions and approaches
Platform · 27 Nov 2023, 06:22 · 43

Hi - what a comp, what a journey. This was a short one, with a lot of it during the RWC, which offered great inspiration for this. And, as in the RWC ... small margins ...

My approach: this begs for a GBM solution. Eyeballing the data it is immediately clear - lots of missing values, categoricals and heterogenous numericals. Either GBM or you venture into the impute jungle. I stayed out of the latter, although, after some comments by @Satti_Tareg, did try a few imputation approaches, but to no avail, so I quickly switched back to GBM and simply used data containing missing values again.

At some stage I combined a few animals of the GBM zoo - cat (my personal favourite, sorry @wuuthraad) and light and xgboost and more, but later as I started to tweak the hyperparameters and then threw out the rest and only used cat. It was simply too much management to tweak and keep track of all of them.

Tweaking hyperparameters was needed here, but still had little impact. The biggest tweak I recon is to account for the imbalanced target values. Tweaking class weights as well as using thresholds helped to account for that. While hyperparameter tweaking was needed here, I did not tweak too much and tried to keep those tweaks light and manageable and often reverted back to defaults. To some extent, I used this comp as a way to explore those tweaks a bit.

The biggest thing, of course, is the data and the features one extract from it. Here I am quite eager to discuss with @wuuthraad what he cooked up. I tried so many things and oscillated between using heaps of features to a heavily culled model. Both had benefits and both at times seemed the right approach. My final model had around 300 features! Yeah - lets not leave any feature behind ... was that Churchill? Or Rassie?

Without domain knowledge or guidance (as some begged for in the forums) the feature engineering became an exercise that closely resembled rolling dice ... I used feature importance from time to time but also tried to, you know, make good choices here.

A lot of the tweaking of hyperparameters were, in fact, to accommodate the explosion in features. Depth and regularisation and estimators all became important as the features grew. So I suppose @wuuthraad wins this argument (not that I ever disagreed) and that features, and not fancy modelling, gives good results. The fancy modelling is simply required to cater for the features.

I tried so many things I did not mention here. I'll mention one: I consistently had better CV than LB performance and so at some stage I tried very hard to distinguish between the train and test datasets, but they were so similar, I started to (and still do) suspect that the data is simulated and not real and had to abandon that approach.

Well, that's it. Any comments or others who want to share their approach / experience? As requested by @MakalaMabotja I'll set up an informal coffee meeting to discuss this a bit also and post the (open) meeting details here.

Discussion 43 answers

Thank you @skaak, and I agree with your sentiment around FE (and with @wuuthraad's) comments. I was consistently getting ~0.97 accuracy & F1 score but 0.55 LB scores until I used a simple impute(median) for the missing data and that's when I breached the 0.60 mark.

My initial approach was to build 5 models (Logistic reg from statsmodel & sklearn, RF, XGB, MLP classifier(sklearn) and NN(tensorflow) to explore the concepts further as I have relative little experience in the data science field and more importantly to get weights & feature importance to see if I can formulate a impute strategy around this. I eventually landed on a groupby strategy using region, tenure, regularity & cluster (I introduce a KMeans features using Amount, Frequency, Tenure, Regularity & revenue to group similar customers together). This proved to be a step in the right direction however to make the most of this I needed to adopt a impute strategy around using XGB regressor for the missing numerical values & classifier for the categorical features. All models built after this point were hitting 0.68 in the testing phase

*It's funny how the basics actually make such a significant change compared to time & resource consuming hyperparameter tuning exercising I went through (one even took 2 full days)*

If I were to highlight the biggest impact/takeaway for me, I'd say:

1. Simple is best, I found myself going back to basics more often than not

2. EDA's are a lifeline and understanding the various plots helps a lot to understanding what prepapration techniques are required especially with a lack of domain knowledge

3. There's a million ways to skin a cat (no pun intended)

27 Nov 2023, 06:52
Upvotes 2
User avatar
skaak
Ferra Solutions

lol ... yeah I like that, won't tell my cat about it though.

While I did not find joy with impute, it is really nice to hear about models built around that. I also tried clusters at some stage and got a bit of a lift, but, since my data had missing values, I had to dance around that so much I eventually discarded the clusters. But I think that would be a real nice and natural way to add features here.

The nice thing with impute, if you get it right, is that it allows you to try out many more models as you also did. It also allows you to do some nice, expligit modelling around region. Region had by far the biggest impact and I tried to exploit that a bit but given my GBM "constraint" I did not really find a way to make it work harder, except to stretch some features around it. The problem is that this introduced co-linearity and I was not sure if it helped or damaged the final result ...

Oh well, nice, thanks for sharing!

User avatar
wuuthraad

" the road to mastery is through the basics. wax on, wax off" - Mr Miyagi

Thank you @skaak with the update! This was my first official competition, Really enjoyed it. I would like to know how you dealt with the size of the data, I could only use 100 000 rows of data, bigger than that, it will take really long to run. The final model though, I trained it on the entire dataset, which took 1 to two hours because I was using stackClassifier to bundle a number of models.

27 Nov 2023, 07:43
Upvotes 1
User avatar
skaak
Ferra Solutions

fwiw I used mostly KFold, but, right at the end, used stratified KFold with stratification based on region. I think, given the huge size of this, you could sample (as you did e.g. 100k rows) and still get a very good model, as long as you stratified on region and perhaps also on target.

That said, waiting a few hours for a model to complete is pretty much business as usual. I think my final model took ~10 hours ... no matter, just let it run through the night (or through Sunday lunch e.g. yum).

I have old hardware but lots of RAM, so I can handle this type of thing. But I also recently got myself a fancy GPU system. This was one of the first comps I did on my fancy new GPU system. I think the GPU card alone is more than 10k. But, if I get the 10k, it is definitely going into the piggy bank for the next GPU.

Come on Sasol, we worked so very very hard on this, 10k is too small. You can sponsor some of these competitors with souped up GPU enabled machines please. Really ... Sasol, when you evaluate the model, if you think you are getting value, why not reward with a proper GPU machine (price ~50k and indispensable for these comps nowadays). Even the bokke have MTN and FNB behind them ... we really need you here.

User avatar
skaak
Ferra Solutions

... and, I hasten to add ... wow! You did really well, especially for a first comp. Scary, this comps shows anybody who cares that we have skills a plenty. Wooooow - congrats my friend e-x-c-e-l-l-e-n-t!!!!

Stacking here is also excellent but you have to do it right, then your score should increase going to private from public. Note many top guys have better private than public, I guess many did that through some kind of ensembling.

Thank you. lol, yeah that 10k is going to help.

Yeah this was also my first comp, I had previously done project work here and there but never had to deal with such messy data...

I actually wanted to try to build an ensemble model with LightGBM & XGB model training on unfilled data, XGB model based on the cleaned & filled data and tehn using a voting system to get the best estimate from the 3 however my skill level(or hardware for that matter) isn't at that level yet so I'm still researching and I'll just do it for curiousity sake

But a huge well done @SboneloN

User avatar
Rorisang

Thanks for reaching out. It will be an honour to sit and discuss the solutions with you all. I couldn't train on Colab (always ran out of RAM). On my PC it would take the whole day to train a stacked classifier and sometimes the PC would run out of energy and I would lose my work- (energy = electricity :) ). I couldn't even run a voting classifier e.g. (0.5-0.5, 0.7-0.3, etc) using a sample size of 100000. My best accuracy on the training data was 0.9236... . Great experience but impossible to solve the problem sufficiently without domain knowledge and proper hardware/cloud resources (expensive). Since the competition is over, are we allowed to share the notebooks? We could use GitHub.

27 Nov 2023, 08:14
Upvotes 1
User avatar
skaak
Ferra Solutions

Yeah look at that - your stacking gave you better private than public!

Colab .... nice option on paper but not really ever worked for me, but probably because I use the cheapo option. But afaik they don't yet do SA, otherwise I would have bought an account, but now I have my fancy GPU so no worry. And, nice big backup system to keep the lights on when Eskom fails us. My gpu adds a considerable load to the system fwiw, but at least in summer there is enough sun, might struggle in winter.

You can share notebook but I'd suggest just wait until Zindi finalise and seal the LB.

How much RAM in your machine?

User avatar
Rorisang

I have to get a backup too. In terms of the GPU, I only have an MX450. What are you using? PC RAM is 16 GB. I also use the free version of Colab thus the difficulties.

Thanks for the advice. I will wait till the LB is finalized.

User avatar
skaak
Ferra Solutions

Ok, 16G is a bit small for this type of a thing. My very old (2017) machine has 32 that I added myself and at some stage I could do any comp with that. My fancy new GPU has 128. I ordered 256 but when I saw the price I had to scale back a bit ... the GPU itself is 4070 if I remember correctly, but it has only 12G GPU ram and I've had quite a few keras models that would not work in that. So I am saving for the next one and trying to convince my wife it is a good idea to invest in GPU RAM ...

Its a games machine - the rep told me it is best value for money, which I think is true, but it comes with fancy spinning lights. But the bling grows on you after a while.

GPU here made a big difference for me. I tried on CPU only and did some prelim stuff like that, but the difference is so big as to be impractical to use CPU for this one.

User avatar
Rorisang

I will have to upgrade to 32 GB seeing I have the competitive coding bug now. Good luck in getting the extra funds to invest in the GPU RAM :)

The games' machines seem to be good at this. Will look into them as well.

User avatar
wuuthraad

My solution can be broken down into 5 steps... simple and strainght forward

Step 1(Load the Dataset)

the data... is not large (in modern standards) If you use outdated methods it will increase the time taken to gain meaningful insights use these tools and conversions.

A - Turn the CSV file to a parquet file and you will get significant increase in speads upto 50x faster , read this blog post on parquet files (https://towardsdatascience.com/demystifying-the-parquet-file-format-13adb0206705)

B - Use the rapids accelerated framework lirbary (specifically cuDF) to gain the power of a GPU(UNLIMITED POWER!) when loading huge datasets. The link : https://rapids.ai/

I personally used cuDF in the early stages then just went about keeping the loading and preprocessing of the data on the CPU... I didn't see a need for accelarated libraries for this part of my process, I later use other accelarated libraries from rapids which I will touch on later

Step 2 (preprocessing of data for EDA)

-On the second step I am just using basic transformations to convert some categorical values to numeric, seeing the description of the data(dataset.describe() and dataset.info() ) nothing major

Step 3 (EDA)

- Exploratory data analysis ... This was an interesting one mainly due to an interesting find I unearthed when plotting a correlation heatmap, there is wa correlation between [[top_pack , region and target]], which is what I mainly used for FE later on. details can be seen at the disscusion https://zindi.africa/competitions/sasol-customer-retention-recruitment-competition/discussions/19089

Step 4 (Feature Engineering)

the 4th and penultimate step, which was the most fun IMO was the feature engineering. I will distill some the information so I do not end up writing too much. From 'top_pack' I was able to engineer a feature where I can the duration of the product was it "one day ,two days..." the end result was a new column with the number of days, the main reason for interest and due diligence in [top_pack] was due to the overall correlation with the target. I dropped 'mrg', it was useless in terms of helping with FE. I filled missing values in specific columns with the median help improve my score ever so slightly @Satti_Tareq's post helped with that part (https://zindi.africa/competitions/sasol-customer-retention-recruitment-competition/discussions/19059). later I perfromed aggregations with my newly created features, mainly on amount , tenure and I got something intersting where if you mulltiply 'arpu_segment'*3 you get ~ the same value as revenue... it was not the exact amount so I calculated the difference of (dataset['arpu_segment']*3 - dataset['revenue']. I did other steps but I wont mention them... to avoid writting a lightnovel.

Step 5 (model training and evaluation)

- Ahhh yes @skaak and @Rorisang I completely disagree with waiting HOURS for you model to train, I literally trained most of my models in unter 10 mins some in less than a minute. Using (read! https://rapids.ai/) of course. Specifically I was using the accelarated version of XGBoost which was all I needed. I used cuML which is an accelarated ML library similar to sklearn (read! https://rapids.ai/) , where some of the most "popluar" models are accelarated with GPU, instead of waiting hours you can train on the entire dataset in less than a minute, which is what happend to me when using randomforestclassifier. I fitted the entire dataset and less than a minute later the training was done @skaak and @Rorisang , everyone in general I advise you utilize https://rapids.ai/ . BTW I was using Google Colab pro with High RAM capabilities, yes $9.99 Dollars well spent, why colab? I still have PTSD from when Eksom ruined my GPU way back when. back to the model, I tried creating a SuperLearner(https://machinelearningmastery.com/super-learner-ensemble-in-python/) with LogisticRegression , RandomForestClassifier and XGBoost. It did little well but keeping track of all the metrics was a bit too tedious so I just stuck to the accelerated version of XGBoost(Rapids GPU accelareted). which I then later incorparated into Sklearns "CalibratedClassifier" (https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html) with a cv=5. That in a nut shell is what I did... there are imrpovements to be made in the FE part of things. Fine tuing model parameters was never my main focus, getting informative features was. That being said I did not do any Hyperparameter tuning but when I do, I mainly use https://optuna.org/

27 Nov 2023, 08:45
Upvotes 5
User avatar
skaak
Ferra Solutions

?

Did you really do all that? Wow @wuuthraad impressive, as always. Looking forward to the rest. This is thing with you, always feels as if you are two steps ahead with the tech and has this guilty-I-must-do-that-also kind of feeling to it, you know, why throw resources at it if you can do it smarter.

Yeah, probably should read that article and learn those things you mention ... thanks, now I have real motivation to do it. Any my wife will support it, as it will stave off the imminent GPU upgrade a bit ...

User avatar
wuuthraad

Can you see it now @skaak?

@skaak and @Rorisang simplify you life guys, use Rapids. Hahaha.I am touched you two were training your models and waiting for over an HOUR!. Yes there are exceptions but this is not one of them

I was just FEing to the moon.

User avatar
wuuthraad

Also... one of the steps I forgot to mention was , I used "undersampling" when training my model to get ~ more even distribution of the target variable

User avatar
skaak
Ferra Solutions

He he, I thought you edited it out a bit while the LB is still open. Nope, just see a dot.

Wait wait wait, there it is now.

Wow, you are not sparing the details ... I have to thank you kind sir that is nice.

I've used h2o in the past but it added too much complexity and I stay away from such overhead, but you know, if it (rapids) is as good as you describe then it is worth it.

"Undersampling" e.g. select same amount of + and - and thus undersample the many -? So it made the sample smaller? Is that not why you had good performance then?

Anyhow, will leave these for the live session, but definitely will check out rapids.

Initially I used lots of CalibratedClassifier, hoping it would improve results, but eventually ditched it for the time saving and it did not really make a difference as far as I could tell.

User avatar
skaak
Ferra Solutions

Woooooow thanks again.

I stayed away from optuna, it is too easy to overfit (I think), but I do a few very small gridsearches on hyperpars. Perhaps one day I will have to use optuna ... I tried RF. I think any tree based method here would work well, but given the time, had to cancel it eventually. If you could do RF with rapids it certainly is very impressive. Wow, need to read! it. RF + GBM I think would nail this one.

User avatar
wuuthraad

I'll share my notebook when @Zindi gives the all clear. knowledge is meant to be shared

User avatar
Rorisang

@wuuthraad, I heard of Rapids (only knew of rapyds) today from the discussion you had with @skaak. I will investigate it. Thanks for the links.

User avatar
Rorisang

Grid searches will definitely work but take a long time. What do you think of random search instead?

User avatar
Rorisang

I oversampled instead so as to have 'more' data but the approaches curb the same problem.

User avatar
skaak
Ferra Solutions

Random search - depends but sometimes can be the right thing to do, and, given the complexity of reality, any model is a type of random search (on a philosophical level ...)

The defaults are often good, and set by experts. To deviate you need some good reason, but it is always good to test with a simple grid around the defaults if you have good reason, and see the impact, or to deviate from the defaults to more heavily regularise the model.

The nice thing here is that there was lots of data .. sure a modern problem will easily have even more data, but you did not have trouble sampling or any sort of degrees of freedom problem.

On the issue of oversampling and undersampling... I did both as the training and testing (on the entire dataset) gave me a more generalized model so if I got an F1 scoore in training then I'm almost certain that's the score I will get in the LB to save on having to use submissions and time trying to figure out if the changes made were impactful.

For the oversampling I used a threshold of 0.7, 0.6 for undersampling and passed it through pipelines to get a more mixed dataset

The downside is that I could not build an accurate model due to class imbalance(60% +; 40% -) but LightGBM can handle imbalances well. However I was able to get the F1 score to 0.701 however it was only 0.689 in LB due to low precision but a huge improvement form the 0.680 I was at

I'm really thankful for the tip on rapids... I'm out here using a 6GB RAM GPU with 12GB RAM disk looking at @skaak's conversation with complete envy

User avatar
Rorisang

Same here. I can only envy @skaak 2. I was doing all the standard preprocessing steps so that I won't have to question the methodology. So I did the over/undersampling just to cover my bases.

User avatar
skaak
Ferra Solutions

As my kid's piano teacher used to say: Don't blame the equipment.

I've desperately needed a GPU for years, only recently got it, and now don't know what I did without it, in fact I want two more ... .

But without GPU, you can still build a decent model. This one in particular, just "normal" data, no images or DNA or video or audio. And with stuff @wuuthraad mentioned or using e.g. HDF5 you can get around the mem limitation. Models like GBM will anyhow sample the data and not use all of it, especially if the data is quite big.

Then there is always stuff like colab or kaggle ...

Ag wat, I feel your pain! Come on Sasol ... and other hosts and sponsors ... we need equipment desperately.

User avatar
MICADEE
LAHASCOM

@skaak Even though all non south africans were later disqualified at the ending stage of the competition, still i must say that this was one of the easiest compatition i took part in. I got a winning score at my 4th sub already. Probing my approach further only made me to upload more subs till almost 5, 6 days to go. Ensembling of two models, LGB & CAT did the magic. CAT with 0.69996 on Public LB and LGB with 0.69991 on Public LB using weights 0.60*CAT+0.40*LGB respectively, with an overall winning score of 0.70045xx on Private LB. But quite unfortunte, non south african participants were disqualified like 1 to 2 days to end this amazing competition. What worked best is the Feature Engineering techniques, encoding technique, your model implementation strategy using f1_score.

Models Time used: Only LGB takes almost one hour ro run, CAT takes only few munites.

Cheers !!! and Congrats to all the winners.

27 Nov 2023, 17:51
Upvotes 0
User avatar
skaak
Ferra Solutions

Wow @MICADEE thanks for sharing. Only now do I realise the non-SAers have been removed, nice ensemble ... as I mentioned earlier, you need this to lift private score. My best private was 0.70059... but I did not selected it. I got (based on current LB) a winning candidate at around 50 (I used *all* 100 subs) with my first 0.699.. sub (privagte = 0.7002...)

Oh well, but let's not swap fish tales ... yes, it would be nice if this was completely open, and of course a lot more difficult.

I struggled quite a bit to get a decent score, and final and first model are not that different, except for features, so I have to wonder how you did it so quickly? How do you generate features for this one? You make it sound easy ... I had to burn subs and time and effort to generate them ... any insights you may want to share?

User avatar
skaak
Ferra Solutions

...fwiw even towards the end I was still experimenting with features, trying to find the one that would make it all click. I think there was not one, but several you had to have ... but please, any insights would be valuable ...

User avatar
MICADEE
LAHASCOM

@Skaak... Smiles.... Awesome!!!. On how i did it quickly... It's as a result of devoting my time to Feature Engineering from the beggining without taking modelling into consideration at that moment. I put all manners of what i know into this Feature Engineering part as i knew that it will be very hard to rerun this F.E notebook all over again. Infact, my main target was to break the record of achieving 0.7xxx on Public LB (which is very much possible) but unfortunately i ran out of memory when i implemented all my ideas on this F.E aspect. Thus, i had to do away with few parts of these F.E ideas and only stayed with the ones that will contain my colaboratory memory. In fact, i wanted to pay for colab pro in the first place but once it's been mentioned that the project was strictly meant for south africans, i had to think twice on this. The following are the steps taking on my F.E aspect among other ideas that were excluded due to memory shortage of free colab:

1. Re-categorizing feature "tenure".

2. Aggregation of few numerical features

e.g agg_columns:'revenue','Amount','arpu_segment','data_volume','on_net' and groupby on cat_cols like: 'region', 'tenure'. Using statistics: {'25%': 'p25', '50%': 'p50', '75%': 'p75'}

3. Feature Intereactions.

4. Label encoding strategy:

from tqdm import tqdm

for col in tqdm(df_train.columns.drop('Target')):

if df_train[col].dtype == 'O':

df_train[col] = df_train[col].fillna('unseen_category')

df_test[col] = df_test[col].fillna('unseen_category')

le = LabelEncoder()

le.fit(list(df_train[col]) + list(df_test[col]))

df_train[col] = le.transform(df_train[col])

df_test[col] = le.transform(df_test[col])

df_train[col] = df_train[col].astype('category')

df_test[col] = df_test[col].astype('category')

else:

df_train[col] = df_train[col].fillna(-99999)

df_test[col] = df_test[col].fillna(-99999)

print('Missing data in train: {:.5f}%'.format(df_train.isnull().sum().sum() / (df_train.shape[0] * df_train.shape[1]) * 100))

print('Missing data in test: {:.5f}%'.format(df_test.isnull().sum().sum() / (df_test.shape[0] * df_test.shape[1]) * 100))

num_cols = X.select_dtypes(include = 'number').columns.to_list()    # numerical features

cat_cols = X.select_dtypes(exclude = 'number').columns.to_list()    # categorical features

NOTE: Possibility of attaining 0.70xxxx on Public LB:

By adding categorical column "top_pack" to the set of categorical columns used in my step 2 above, then one can easily get the job done actually but the memory shortage issue didn't allow this. Though the colab pro is worth paying for had it been other participants other than south africans were allowed. This is just a brief of few things done on this great project.

NOTE 2: I didn't even use GPU in my modelling aspect with both LGBM & CATBoost used. Total time taken for the two models ~ 1h 55mins.

Cheers !!!

User avatar
skaak
Ferra Solutions

Hi - thanks for sharing, I appreciate. Here I think you had to handle top_pack correctly to get good score ... it was one of the things that gave me a bit of a lift after being stuck for a while. I initially tried to simplify top_pack as there are so many possibilities, but you had to try them all (and then throw away based on feature importance the ones that did not contribute, so bit of both).

Oh well - I tried to do the feature interaction right here, with a mathematical approach of course. So I looked at products and ratios of important features and also tried to understand features and based on that combine them in products or ratios or logs etc. I also did some feature stretching, which added just a little bit, but it was significant nonetheless, but I was very weary of it, as it introduces (more) multi-colinearity in an already very correlated feature set. So I tested it rigorously and it still worked, but I think it would definitely contaminate the feature importances a bit.

At some stage I added new features by recording direction in which leaf node is reached. This is funky way to generate new, non-linear features I read about, but it did not help at all, so I returned to my mathematics.

This was lots of fun - in a way it is simple as you also describe. I think if you sort of do default encoding and GBM correctly you should get ~0.67, but then to go from there to 0.7 (I only reached 0.6999 on public) was quite a journey, but, of course, also a nice one.

User avatar
skaak
Ferra Solutions

btw how many features did you have in the end?

User avatar
MICADEE
LAHASCOM

@skaak Great !!! 👍. As per number of features generated. Yeah.. 266 features in total.

User avatar
skaak
Ferra Solutions

Yeah, there was a time I thought you were one and the same ... you did really well for a first time comp and look at boost you got going to private, based on stacking I guess.

Nice - what was your threshold?

28 Nov 2023, 10:42
Upvotes 0

We need to do something with that name handle lol. Well done SboneloM. I was also worried about time limit, is there a limit ?

28 Nov 2023, 10:58
Upvotes 0
User avatar
skaak
Ferra Solutions

lol ... perhaps one day you meet and you also look the same ... then you know the matrix glitched ...

Nope, no time limit, your model can take days to complete and sometimes it does, but go for fast model, then you can try more things.

How did you do the stacking? Mind share details for a look-see. Reason I ask is your score dropped a tiny bit going to private.

I saw that big drop: Here is my code

# I used these 3 models for stacking. I used grid search to tune the parameters, only tuned a few parameters which resulted in a slight improvement.

RandomForest = RandomForestClassifier(random_state=42,

n_estimators=param_RF["n_estimators"],

max_depth = param_RF["max_depth"])

XGBC = XGBClassifier(n_estimators=100, random_state = 42,

min_child_weight = param_XGBC["min_child_weight"],

subsample = param_XGBC["subsample"])

lgmClassifier = LGBMClassifier(random_state = 42,

n_estimators=param_lgm["n_estimators"],

num_leaves = param_lgm["num_leaves"])

estimators=[('RandomForest', RandomForest),

('XGBC', XGBC),

('lgmClassifier', lgmClassifier),

]

# This is my final stacking classifier

sclf = StackingClassifier(estimators=estimators, final_estimator=logisticRegression, cv=5)

models = {}

models['RandomForest'] = RandomForest

models['lgmClassifier'] = lgmClassifier

models['XGBC'] = XGBC

models['sclf'] = sclf

my_model ={}

# I used a pipeline to process and fit the data. I used Smote to deal with data imbalance

for model_name, model in models.items():

my_model[model_name] = ImbPipeline(steps=[

('preprocessor', preprocessor),

('over', over_sampling),

('under', under_sampling),

('model', model )

])

my_model[model_name]= my_model[model_name].fit(X_train, y_train)

prediction = my_model[model_name].predict(X_valid)

# Get error rate

print("F1 Score {} :".format(model_name), f1_score(y_valid, prediction))

The score I got from the sample  for stacking is  0.6987973760932944

# I fitted this model in the entire dataset and made predictions

User avatar
skaak
Ferra Solutions

Ok ... hmmm, just a few observations. Must have been torture waiting for the RF to finish. Also, seems you create a stacking classifier and fit both that and the individual models. I guess just to test them so I think it is fine.

I see you combine them using logistic regression. Here you could just assign arbitraty weights (e.g. equal or perhaps slightly more to the better models as @MICADEE did e.g.) or even just take the average of the models.

But have to say I am curious. I tried RF a few times but abandoned it - simply takes too long. What f1 did you get from the RF?

Yes on the final model, I did not fit the individual models, this was just for testing.

It took longer when I was fitting the entire dataset. I like the way MICADEE did it. I tried different models as the final estimator, but the logistic model was the one giving me better results. These are the scores I got for the individual models: F1 Score RandomForest : 0.6941705587268638 F1 Score lgmClassifier : 0.6951476793248945 F1 Score XGBC : 0.6951814712929275

I think the reason it dropped was because I did not use all the features. I guess perhaps I could have been patient and trained the models using the entire dataset so see if the score changed.

User avatar
skaak
Ferra Solutions

Well my friend, as they say: hindsight is a perfect science. No you did good, and at the time made the decisions that were the right ones then.

Let's look forward to the next journey!

I just saw one of my earlier submissions, I scored 0.698767193 on puplic score and 0.699673245, on private score.

29 Nov 2023, 10:06
Upvotes 0
User avatar
MasterDipp

Well done on coming first place.

4 Dec 2023, 01:04
Upvotes 0