The Zimnat Insurance Assurance Challenge by #ZindiWeekendz
$300 USD
Predict when an insurance policy will lapse in Zimbabwe
314 data scientists enrolled, 97 on the leaderboard
InsurancePredictionStructured
Africa
22 May 12:00 (60 hours)
Difficulties sorting and merging the datasets
published 22 May 2020, 22:52

I am having trouble with merging the datasets in such a way that the final dataset I want to use to train my algorithm has only unique policy ID's, just as in the training set. In the client data set policy ID'd occur multiple times for example, and so in the policy and payment data set. Any tips?? Thanks in advance :-)

Hey try this:

drop_duplicates(subset='Policy ID')

Thanks for your reply! However, then a lot of information will just be removed right? And all that information could be of interest! Now your command simply keeps a random right?

If you use Pandas, use .groupby('Policy ID').agg(); and then merge datasets together by Policy ID

hi all,

I have ask question, how to get the Lapse= 0 because all the lapse in train is 1?

train['Lapse' ] = np.where( ( train.Lapse == "?" ) & ( train['Lapse Year'] == "?" ), 0,1)

This should work

Thanks for your reply! What agg option do you use then? You can't sum up the sex variables or take the mean for example I guess ... So what argument do you use?

First or last for text vars