In Train.csv data I found more than 4000 duplicate rows that have the same ID, with the number of groups for each ID between 2 - 8 data, each group of the same ID also has the same month, year, latitude and longitude data, but the problem is that one group of the same ID has different Total values.
If I want to take only one data from each group of the same ID, how do I handle the Total column? Should I take the largest value from the Total column, or should I take the Total amount for each group of the same ID?
thanks.
you also have the disease column which you should consider
Yes, they have the same disease value too.
This is confusing, there are many data groups with the same ID, Disease, Month, Year, Latitude, and Longitude column values, but have different Total values.
one of them you can check :
df_train[df_train.ID == 'ID_00cd8292-dd85-4fa3-8148-9592e88a1651_10_2021_Malaria']
you should take the total of all of them since they are of the same group I guess
Ok, thanks
what did you end up doing?
I'm still trying out some aggregation methods (sum, total, mean, median), and I'm still looking for the most effective one. It would be great if there was a more detailed explanation from Zindi regarding this.
oh nice
Would be great to get a response from @Zindi on this. Is this a mistake that was made when producing the dataset?
Yes 'Disease == "Diarrhea" and Location == "ID_00cd8292-dd85-4fa3-8148-9592e88a1651" I see zero and positive numbers of the same location-disease-date combinations. Not sure, but probably better to take the non-zero values for now.