💰 Hot Topic: Duplicated Rows

AutoInland Vehicle Insurance Claim Challenge

Helping Nigeria

$1 000 USD

Completed (almost 5 years ago)

Skills you will learn

Prediction

1608 joined

632 active

Info Data Chat Leaderboard

Start

Mar 26, 21

Jun 27, 21

Reveal

Jun 27, 21

AIBoss

Duplicated Rows

Notebooks · 14 Jun 2021, 04:45 · 3

If you are doing EDA, you probably noticed that many duplicated rows still exist. How:

ID is not duplicated. Only features are duplicated. if you run the below code you will see it.

data[(data['Policy Start Date'] == '2010-02-11') & (data['Policy End Date'] == '2011-02-10') & (data['ProductName'] == 'CVTP') & (data['Age'] == 25)]

Car_Category should be truck for all

if you look at iveco with red/ red & white - they are duplicated

Interestingly if we take one duplicated row, one of them tells us that target is 0 but the other one tells 1.

because of this our algorithm gets confused which one to believe.

Because of this we need to delete these duplicated rows as they do not make sense from both real world and statistical sides.

So @Zindi could you tell us which one to delete the duplicated rows with target 0 or 1.

Discussion 3 answers

ravinder

When we are close to competition , with 2 weeks pending no changes in data should be done. Whatever it is, need to contented with that, as its same for everyone.

14 Jun 2021, 07:36

Upvotes 0

kolatimiDave

University of lagos

They can extend the competition if the data still contains duplicated rows though, I mean the data has been difficult to crack since so.

replied to ravinder14 Jun 2021, 09:49

Upvotes 0

kolatimiDave

University of lagos

If you're really sure there are duplicated rows, you can try using both targets [0 or 1] seperately to test which would yield better result with your model. I'm sure removing duplicate train values is not against the rules of the competition.

14 Jun 2021, 09:52

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status