Hot Topic: Data Leakage?

DigiCow Farmer Training Adoption Challenge

Helping Kenya

€8 250 EUR

Completed (4 months ago)

Skills you will learn

Data analysis

Classification

901 joined

377 active

Info Data Chat Leaderboard

Start

Jan 28, 26

Mar 01, 26

Reveal

Mar 02, 26

Satoshi

Data Leakage?

Data · 31 Jan 2026, 13:10 · 4

I am writing to seek clarification regarding the temporal boundary of the features provided in the training and test datasets. The competition objective is to predict the probability of adoption within 7 days of a farmer's first training, using information available at the time of training.

Upon reviewing the data dictionary, I am concerned that several features may incorporate information from a future window (data occurring after the 7-day target period). If these features were calculated at the end of the data collection period rather than at the 7-day cutoff, they may constitute Target Leakage, allowing models to 'cheat' by observing post-event behavior.

Could you please confirm if the following features are strictly constrained to observations made on or before the 7th day, or if they include data from the full 30, 60, or total duration of the study?

Features in Question:

num_trainings_30d: Does this count trainings that occurred between Day 8 and Day 30?num_trainings_60d: Does this count trainings that occurred between Day 8 and Day 60?num_total_trainings: Does this include all sessions regardless of the date? num_repeat_trainings: Does this count sessions attended after the initial 7-day adoption window? num_unique_trainers: Was this calculated based on trainers met after the adoption window? days_to_second_training: If the second training occurred on Day 10, is this value available to the model, despite being 'future' information relative to the Day 7 target?

If these features contain information from after the 7-day window, they would not be available in a real-world deployment scenario. Clarification on whether we are permitted to use these for the final submission or if they should be treated as leakage would be greatly appreciated.

Discussion 4 answers

MediumChungus

I believe the target is whether or not training was **adopted** within 7 days. The features that we have are related to receiving trainings, but that does not tell us about whether the farmer actually adopted the training into their practices.

So IMO, I don't think we have to worry about leakage.

1 Feb 2026, 08:13

Upvotes 0

Dah-ta

I believe it would be a target leakage, given that we won't have the value of some of these features at the point we are trying to decide which farmer adopted the training within a week. By the time we have the values of the listed features, it will already be decided whether training was (not) adopted

replied to MediumChungus1 Feb 2026, 08:33

Upvotes 0

AJoel

Zindi

Hi @Satoshi, that you raising this concern. I have reviewed the data. The initial target was adoption after 120 days, so some of these features such as the num_trainings_xd made perfect sense since there were all within the target timeframe. However, I also want to point out that training does not imply adoption. I have revised the targets and updated the data. I will make a formal post on the topic.

1 Feb 2026, 12:35

Upvotes 0

henrystafford

From how these features are named and structured, it looks like many of them are aggregated over fixed windows (30d, 60d, total) relative to the farmer’s entire observation period, not strictly capped at the 7-day adoption window. If that’s the case, then yes — features like num_trainings_30d, num_trainings_60d, num_total_trainings, num_repeat_trainings, and even num_unique_trainers would implicitly include post–Day 7 behavior and would be unavailable at prediction time in a real deployment.

days_to_second_training is the biggest red flag for me. If the second training happens on Day 10, the model would be learning something that by definition occurs after the target window, which is classic target leakage.

bitlife

2 Feb 2026, 01:18

Upvotes 2

Join the largest network for
data scientists and AI builders

About FAQs

Status