Data Talk: Duplicates

DigiCow Farmer Training Adoption Challenge

Helping Kenya

€8 250 EUR

Completed (4 months ago)

Skills you will learn

Data analysis

Classification

901 joined

377 active

Info Data Chat Leaderboard

Start

Jan 28, 26

Mar 01, 26

Reveal

Mar 02, 26

KinsAI

Duplicates

Data · 6 Feb 2026, 16:04 · 2

I noticed that there are duplicate records in both the training and test datasets. While duplicates in the training set are straightforward to handle by removing them, seeing duplicates in the test set is more surprising. These duplicates appear when the ID column is not included. Do these duplicates exist by design, or were they unintentionally introduced during data preparation?

Discussion 2 answers

KinsAI

Data Conflict: Multiple records appear identical (differing only by ID) yet show contradictory adopted_within_07_days or adopted_within_90_days or adopted_within_120_days statuses.

Clarification: Does "adoption within x days" refer to specific topics trained on being adopted, rather than a one-time milestone triggered by the initial training session?

6 Feb 2026, 17:22

Upvotes 1

natarajanlalgudi

Yes that's a fantastic observation, and its true..... I did a time sensitive train validation split..... and I saw duplicates as follows : a) Train: 4054 / 7163 b) Val: 4098 / 6373 c) Test: 4322 / 5621

replied to KinsAI15 Feb 2026, 04:04

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status