Which is the best way to prepare the data?
1. Melt the products columns and make it a binary-classification problem.
2. Duplicating the rows for all the products the customer has bought and removing each product per row and making it the product the y_label. Similiar to how the organiser has prepared the test_data
Please register your inputs...Thanks in advance
Hi Sanjay, I have tried both, and have not had much luck. I made it a binary problem on each problem and ran a logit model and got a score of about 0.06X. I then did a multinol (like you said by splitting it into more rows) and got nearly the same score! I don't think I'm doing as well as some :) since my best score is around 0.06X. I'm amazed how some got 0.03X. I've tried different models (using some of the variables or all of the variables or a forward selection process). I'm just not having the same luck. I must be missing something.
Thanks for your comment
For me, my CV scores are better for binary-classif problem
I've got 0.04 Log Loss using a single model, here are some tips:
1. Encoding your categorical variables - For most models, they are only able to use numeric data. Beware of encoding techniques that result in large expansion in the feature space that would hurt most models (See: Curse of dimensionality). How you encode the categorical variables impacts your models.
2. Feature engineering ---- Creating additional features from the dataset.
3. Hyperparameter Tuning --- Its not always sufficent to just use the default parameters for your model, you need to optimize the parameters based on what gets you the best results on you cross validation or development set. Using K Fold cross validation here is useful, along with RandomizedSearchCV for example.
3. Bias and Variance ---- If your model is performing poorly on both test and training then your model is not complex enough and maybe use a different model or lower the regularization parameter. If you are doing well on the training set and not the test set then maybe increase regularization.
4. Imbalanced Classes ---- Some models perform poorly when we have imbalanced classes. Try oversampling or undersampling techniques to help, and use metrics to score your models that handles well for imbalanced datasets. i.e Accuracy is a poor metric if your classes are imbalanced.
5. Error Analysis ---- Split into training and test and look at the precision and recall of the different predicted products. Is your model doing better for some products and not in others. Use this as a base to decide next steps as well.
Good tips... Thanks
Please I don't know if anyone can give me clue on how to achieve this, I have been having problem with this part and Google is not giving me the best
awesome tips @darrel