If you are doing EDA, you probably noticed that many duplicated rows still exist. How:
ID is not duplicated. Only features are duplicated. if you run the below code you will see it.
data[(data['Policy Start Date'] == '2010-02-11') & (data['Policy End Date'] == '2011-02-10') & (data['ProductName'] == 'CVTP') & (data['Age'] == 25)]
Car_Category should be truck for all
if you look at iveco with red/ red & white - they are duplicated
Interestingly if we take one duplicated row, one of them tells us that target is 0 but the other one tells 1.
because of this our algorithm gets confused which one to believe.
Because of this we need to delete these duplicated rows as they do not make sense from both real world and statistical sides.
So @Zindi could you tell us which one to delete the duplicated rows with target 0 or 1.
When we are close to competition , with 2 weeks pending no changes in data should be done. Whatever it is, need to contented with that, as its same for everyone.
They can extend the competition if the data still contains duplicated rows though, I mean the data has been difficult to crack since so.
If you're really sure there are duplicated rows, you can try using both targets [0 or 1] seperately to test which would yield better result with your model. I'm sure removing duplicate train values is not against the rules of the competition.