I noticed that there are duplicate records in both the training and test datasets. While duplicates in the training set are straightforward to handle by removing them, seeing duplicates in the test set is more surprising. These duplicates appear when the ID column is not included. Do these duplicates exist by design, or were they unintentionally introduced during data preparation?
Data Conflict: Multiple records appear identical (differing only by ID) yet show contradictory adopted_within_07_days or adopted_within_90_days or adopted_within_120_days statuses.
Clarification: Does "adoption within x days" refer to specific topics trained on being adopted, rather than a one-time milestone triggered by the initial training session?
Yes that's a fantastic observation, and its true..... I did a time sensitive train validation split..... and I saw duplicates as follows : a) Train: 4054 / 7163 b) Val: 4098 / 6373 c) Test: 4322 / 5621