💰 Data Talk: Train Data Formation

Zimnat Insurance Recommendation Challenge

Helping Zimbabwe

$5 000 USD

Completed (almost 6 years ago)

Skills you will learn

Prediction

Collaborative Filtering

1784 joined

612 active

Info Data Chat Leaderboard

Start

Jul 01, 20

Sep 13, 20

Reveal

Sep 13, 20

Sanjay_Arvind

Train Data Formation

Notebooks · 27 Aug 2020, 12:58 · 6

Which is the best way to prepare the data?

1. Melt the products columns and make it a binary-classification problem.

2. Duplicating the rows for all the products the customer has bought and removing each product per row and making it the product the y_label. Similiar to how the organiser has prepared the test_data

Please register your inputs...Thanks in advance

Discussion 6 answers

mathsgrinds

Hi Sanjay, I have tried both, and have not had much luck. I made it a binary problem on each problem and ran a logit model and got a score of about 0.06X. I then did a multinol (like you said by splitting it into more rows) and got nearly the same score! I don't think I'm doing as well as some :) since my best score is around 0.06X. I'm amazed how some got 0.03X. I've tried different models (using some of the variables or all of the variables or a forward selection process). I'm just not having the same luck. I must be missing something.

27 Aug 2020, 13:36 (edited 1 minute later)

Upvotes 0

Sanjay_Arvind

Thanks for your comment

For me, my CV scores are better for binary-classif problem

replied to mathsgrinds27 Aug 2020, 14:19

Upvotes 0

darrel

I've got 0.04 Log Loss using a single model, here are some tips:

1. Encoding your categorical variables - For most models, they are only able to use numeric data. Beware of encoding techniques that result in large expansion in the feature space that would hurt most models (See: Curse of dimensionality). How you encode the categorical variables impacts your models.

2. Feature engineering ---- Creating additional features from the dataset.

3. Hyperparameter Tuning --- Its not always sufficent to just use the default parameters for your model, you need to optimize the parameters based on what gets you the best results on you cross validation or development set. Using K Fold cross validation here is useful, along with RandomizedSearchCV for example.

3. Bias and Variance ---- If your model is performing poorly on both test and training then your model is not complex enough and maybe use a different model or lower the regularization parameter. If you are doing well on the training set and not the test set then maybe increase regularization.

4. Imbalanced Classes ---- Some models perform poorly when we have imbalanced classes. Try oversampling or undersampling techniques to help, and use metrics to score your models that handles well for imbalanced datasets. i.e Accuracy is a poor metric if your classes are imbalanced.

5. Error Analysis ---- Split into training and test and look at the precision and recall of the different predicted products. Is your model doing better for some products and not in others. Use this as a base to decide next steps as well.

replied to mathsgrinds29 Aug 2020, 10:14

Upvotes 0