🩺 Data Talk: Duplicated ID in Train data

SUA Outsmarting Outbreaks Challenge

Helping Tanzania, United Republic of

$12 500 USD + AWS credits

Completed (over 1 year ago)

Skills you will learn

Prediction

810 joined

390 active

Info Data Chat Leaderboard

Start

Dec 06, 24

Jan 31, 25

Reveal

Feb 01, 25

sys_ts__

Duplicated ID in Train data

Data · 24 Dec 2024, 16:36 · 9

In Train.csv data I found more than 4000 duplicate rows that have the same ID, with the number of groups for each ID between 2 - 8 data, each group of the same ID also has the same month, year, latitude and longitude data, but the problem is that one group of the same ID has different Total values.

If I want to take only one data from each group of the same ID, how do I handle the Total column? Should I take the largest value from the Total column, or should I take the Total amount for each group of the same ID?

thanks.

Discussion 9 answers

Koleshjr

Multimedia university of kenya

you also have the disease column which you should consider

24 Dec 2024, 16:46

Upvotes 0

sys_ts__

Yes, they have the same disease value too.

This is confusing, there are many data groups with the same ID, Disease, Month, Year, Latitude, and Longitude column values, but have different Total values.

one of them you can check :

df_train[df_train.ID == 'ID_00cd8292-dd85-4fa3-8148-9592e88a1651_10_2021_Malaria']

replied to Koleshjr24 Dec 2024, 16:52

Upvotes 0

Koleshjr

Multimedia university of kenya

you should take the total of all of them since they are of the same group I guess

replied to sys_ts__24 Dec 2024, 16:58

Upvotes 0

sys_ts__

Ok, thanks

replied to Koleshjr24 Dec 2024, 17:03

Upvotes 0

Koleshjr

Multimedia university of kenya

what did you end up doing?

replied to sys_ts__25 Dec 2024, 11:33

Upvotes 0

sys_ts__

I'm still trying out some aggregation methods (sum, total, mean, median), and I'm still looking for the most effective one. It would be great if there was a more detailed explanation from Zindi regarding this.

replied to Koleshjr26 Dec 2024, 01:09

Upvotes 0

Koleshjr

Multimedia university of kenya

oh nice

replied to sys_ts__26 Dec 2024, 10:45

Upvotes 0

brandenkmurray

Would be great to get a response from @Zindi on this. Is this a mistake that was made when producing the dataset?

13 Jan 2025, 02:00

Upvotes 2

just_one_more_epoch

Yes 'Disease == "Diarrhea" and Location == "ID_00cd8292-dd85-4fa3-8148-9592e88a1651" I see zero and positive numbers of the same location-disease-date combinations. Not sure, but probably better to take the non-zero values for now.

replied to brandenkmurray31 Jan 2025, 11:49

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status