Data Talk: How is this not a target...

DigiCow Farmer Training Adoption Challenge

Helping Kenya

€8 250 EUR

Completed (4 months ago)

Skills you will learn

Data analysis

Classification

901 joined

377 active

Info Data Chat Leaderboard

Start

Jan 28, 26

Mar 01, 26

Reveal

Mar 02, 26

Abdallah_Abra

How is this not a target leakage?

21 Feb 2026, 15:20 · 7

I wanted to raise a concern regarding Prior.csv to see if I am misinterpreting the data structure, or if there might be an issue with target leakage still.

We know we have an overlap between Test and Prior: 3,526 out of 5,621 test farmers also appear in the prior data.

While the records in Prior.csv are earlier in time for these overlapping farmers (meaning this isn't strictly future leakage), Prior.csv includes all three target variables fully populated. My main concern is that a non-trivial number of Test.csv rows map back to farmers who already have a positive adoption recorded in Prior.csv.

Here is the breakdown of overlapping test rows that already show prior positive adoption:

adopted_within_07_days: 187 test rows
adopted_within_90_days: 270 test rows
adopted_within_120_days: 310 test rows

If a farmer has already adopted the practice in the prior data, wouldn't predicting their adoption again in the test set be considered target leakage (assuming adoption is a one-time event)?

Discussion 7 answers

Aelyson

let me answer according to my understanding

"All these features are historical and do not use any future information, so there is no leakage. Go ahead and add this code to your notebook, then run it to see if it improves your score. Good luck!"

21 Feb 2026, 19:10

Upvotes 0

Abdallah_Abra

Your response is a bit confusing!

As I already mentioned, I know the data in Prior is historical. My point is about target contamination. If a farmer already adopted the practice in the past, and adoption is a one-time event, then we already know their exact label for the test set.

If I am missing something, please point me to where this logic breaks down.

replied to Aelyson21 Feb 2026, 19:44

Upvotes 0

Aelyson

The competition defines adoption as per training, not a permanent state. If a farmer adopted in the past, it doesn’t guarantee they will adopt again after a new training.

Proof: In the train data, you have farmers with prior history but adopted = 0 for some trainings. So adoption is not permanent, and using Prior data for features is safe and not leakage.

21 Feb 2026, 20:16

Upvotes 0

ML_Wizzard

Nasarawa State University

In Prior.csv, many farmers have trainings 0–2 days apart. That means at the time of a new training, you would not yet know whether the previous training was “adopted_within_90/120 days” (and often not even 7 days).

But your features (e.g., prior_*_sum, ward/topic/trainer temporal rates) use all prior outcomes immediately, regardless of whether the outcome window has elapsed.

22 Feb 2026, 00:19

Upvotes 1

ML_Wizzard

Nasarawa State University

So I was thinking we make outcomes “available” only after the horizon passes:

For 7-day labels: a prior training’s 7-day outcome becomes usable at training_day + 7
For 90-day labels: usable at training_day + 90
For 120-day labels: usable at training_day + 120

So you merge stats using known_day (not training_day).

replied to ML_Wizzard22 Feb 2026, 00:37

Upvotes 1

Aelyson

Nice explanations, thank you @ML_Wizzard

replied to ML_Wizzard22 Feb 2026, 05:08

Upvotes 0

Abdallah_Abra

I see. Thanks for the explanation @ML_Wizzard

replied to ML_Wizzard22 Feb 2026, 10:28

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status