Primary competition visual

SUA Outsmarting Outbreaks Challenge

Helping Tanzania, United Republic of
$12 500 USD + AWS credits
Completed (over 1 year ago)
Prediction
810 joined
390 active
Starti
Dec 06, 24
Closei
Jan 31, 25
Reveali
Feb 01, 25
Duplicated ID in Train data
Data · 24 Dec 2024, 16:36 · 9

In Train.csv data I found more than 4000 duplicate rows that have the same ID, with the number of groups for each ID between 2 - 8 data, each group of the same ID also has the same month, year, latitude and longitude data, but the problem is that one group of the same ID has different Total values.

If I want to take only one data from each group of the same ID, how do I handle the Total column? Should I take the largest value from the Total column, or should I take the Total amount for each group of the same ID?

thanks.

Discussion 9 answers
User avatar
Koleshjr
Multimedia university of kenya

you also have the disease column which you should consider

24 Dec 2024, 16:46
Upvotes 0

Yes, they have the same disease value too.

This is confusing, there are many data groups with the same ID, Disease, Month, Year, Latitude, and Longitude column values, but have different Total values.

one of them you can check :

df_train[df_train.ID == 'ID_00cd8292-dd85-4fa3-8148-9592e88a1651_10_2021_Malaria']

User avatar
Koleshjr
Multimedia university of kenya

you should take the total of all of them since they are of the same group I guess

User avatar
Koleshjr
Multimedia university of kenya

what did you end up doing?

I'm still trying out some aggregation methods (sum, total, mean, median), and I'm still looking for the most effective one. It would be great if there was a more detailed explanation from Zindi regarding this.

User avatar
Koleshjr
Multimedia university of kenya

oh nice

Would be great to get a response from @Zindi on this. Is this a mistake that was made when producing the dataset?

13 Jan 2025, 02:00
Upvotes 2

Yes 'Disease == "Diarrhea" and Location == "ID_00cd8292-dd85-4fa3-8148-9592e88a1651" I see zero and positive numbers of the same location-disease-date combinations. Not sure, but probably better to take the non-zero values for now.