Primary competition visual

AutoInland Vehicle Insurance Claim Challenge

Helping Nigeria
$1 000 USD
Completed (over 4 years ago)
Prediction
1606 joined
632 active
Starti
Mar 26, 21
Closei
Jun 27, 21
Reveali
Jun 27, 21
Duplicated Rows
Notebooks · 14 Jun 2021, 04:45 · 3

If you are doing EDA, you probably noticed that many duplicated rows still exist. How:

ID is not duplicated. Only features are duplicated. if you run the below code you will see it.

data[(data['Policy Start Date'] == '2010-02-11') & (data['Policy End Date'] == '2011-02-10') &  (data['ProductName'] == 'CVTP') & (data['Age'] == 25)]

Car_Category should be truck for all

if you look at iveco with red/ red & white - they are duplicated

Interestingly if we take one duplicated row, one of them tells us that target is 0 but the other one tells 1.

because of this our algorithm gets confused which one to believe.

Because of this we need to delete these duplicated rows as they do not make sense from both real world and statistical sides.

So @Zindi could you tell us which one to delete the duplicated rows with target 0 or 1.

Discussion 3 answers

When we are close to competition , with 2 weeks pending no changes in data should be done. Whatever it is, need to contented with that, as its same for everyone.

14 Jun 2021, 07:36
Upvotes 0
User avatar
University of lagos

They can extend the competition if the data still contains duplicated rows though, I mean the data has been difficult to crack since so.

User avatar
University of lagos

If you're really sure there are duplicated rows, you can try using both targets [0 or 1] seperately to test which would yield better result with your model. I'm sure removing duplicate train values is not against the rules of the competition.

14 Jun 2021, 09:52
Upvotes 0