I observe that education and rental mortgage have two values each in the training dataset. I can not fit my data with stratified or K-fold cross-validation. What is your opinion and solution to this?
if we use the train set to predict the missing target on the extra data, and use it to train a model, there is a big chance to train the model with wrong labels
I used StratifiedShuffleSplit and it worked, using StratifiedKFoldSplit will work if you don't set shuffle=True, since it creates a random fold each time, and education and rental mortgage might just be in one, and also trying encoding the target class that will help
Might not be entirely right tho, try and use the train set to predict the missing target on the extra data b4 u do the main work. This might help.
Thanks, I will look into that.
if we use the train set to predict the missing target on the extra data, and use it to train a model, there is a big chance to train the model with wrong labels
I used StratifiedShuffleSplit and it worked, using StratifiedKFoldSplit will work if you don't set shuffle=True, since it creates a random fold each time, and education and rental mortgage might just be in one, and also trying encoding the target class that will help
Yes, that is my observation. i am using train_test_split.
you can use StratifiedKFoldSplit it will work with you it is better for our case.
There is 2 categories in the target variable that occurs twice?.
please clarify more ?
If you checked the target variable distribution, Health and Rent mortgage has 2 values in the training data.
I believe this is bcos the data is an imbalance dataset that can be solve with smote, down sampling or over