Primary competition visual

DigiCow Farmer Training Adoption Challenge

Helping Kenya
€8 250 EUR
Under code review
Data analysis
Classification
895 joined
388 active
Starti
Jan 28, 26
Closei
Mar 01, 26
Reveali
Mar 02, 26
User avatar
Abdallah_Abra
How is this not a target leakage?
21 Feb 2026, 15:20 · 7

I wanted to raise a concern regarding Prior.csv to see if I am misinterpreting the data structure, or if there might be an issue with target leakage still.

We know we have an overlap between Test and Prior: 3,526 out of 5,621 test farmers also appear in the prior data.

While the records in Prior.csv are earlier in time for these overlapping farmers (meaning this isn't strictly future leakage), Prior.csv includes all three target variables fully populated. My main concern is that a non-trivial number of Test.csv rows map back to farmers who already have a positive adoption recorded in Prior.csv.

Here is the breakdown of overlapping test rows that already show prior positive adoption:

  • adopted_within_07_days: 187 test rows
  • adopted_within_90_days: 270 test rows
  • adopted_within_120_days: 310 test rows

If a farmer has already adopted the practice in the prior data, wouldn't predicting their adoption again in the test set be considered target leakage (assuming adoption is a one-time event)?

Discussion 7 answers

let me answer according to my understanding

"All these features are historical and do not use any future information, so there is no leakage. Go ahead and add this code to your notebook, then run it to see if it improves your score. Good luck!"

21 Feb 2026, 19:10
Upvotes 0
User avatar
Abdallah_Abra

Your response is a bit confusing!

As I already mentioned, I know the data in Prior is historical. My point is about target contamination. If a farmer already adopted the practice in the past, and adoption is a one-time event, then we already know their exact label for the test set.

If I am missing something, please point me to where this logic breaks down.

The competition defines adoption as per training, not a permanent state. If a farmer adopted in the past, it doesn’t guarantee they will adopt again after a new training.

Proof: In the train data, you have farmers with prior history but adopted = 0 for some trainings. So adoption is not permanent, and using Prior data for features is safe and not leakage.

21 Feb 2026, 20:16
Upvotes 0
User avatar
ML_Wizzard
Nasarawa State University

In Prior.csv, many farmers have trainings 0–2 days apart. That means at the time of a new training, you would not yet know whether the previous training was “adopted_within_90/120 days” (and often not even 7 days).

But your features (e.g., prior_*_sum, ward/topic/trainer temporal rates) use all prior outcomes immediately, regardless of whether the outcome window has elapsed.

22 Feb 2026, 00:19
Upvotes 1
User avatar
ML_Wizzard
Nasarawa State University

So I was thinking we make outcomes “available” only after the horizon passes:

  • For 7-day labels: a prior training’s 7-day outcome becomes usable at training_day + 7
  • For 90-day labels: usable at training_day + 90
  • For 120-day labels: usable at training_day + 120

So you merge stats using known_day (not training_day).

Nice explanations, thank you @ML_Wizzard

User avatar
Abdallah_Abra

I see. Thanks for the explanation @ML_Wizzard