I wanted to raise a concern regarding Prior.csv to see if I am misinterpreting the data structure, or if there might be an issue with target leakage still.
We know we have an overlap between Test and Prior: 3,526 out of 5,621 test farmers also appear in the prior data.
While the records in Prior.csv are earlier in time for these overlapping farmers (meaning this isn't strictly future leakage), Prior.csv includes all three target variables fully populated. My main concern is that a non-trivial number of Test.csv rows map back to farmers who already have a positive adoption recorded in Prior.csv.
Here is the breakdown of overlapping test rows that already show prior positive adoption:
If a farmer has already adopted the practice in the prior data, wouldn't predicting their adoption again in the test set be considered target leakage (assuming adoption is a one-time event)?
let me answer according to my understanding
"All these features are historical and do not use any future information, so there is no leakage. Go ahead and add this code to your notebook, then run it to see if it improves your score. Good luck!"
Your response is a bit confusing!
As I already mentioned, I know the data in Prior is historical. My point is about target contamination. If a farmer already adopted the practice in the past, and adoption is a one-time event, then we already know their exact label for the test set.
If I am missing something, please point me to where this logic breaks down.
The competition defines adoption as per training, not a permanent state. If a farmer adopted in the past, it doesn’t guarantee they will adopt again after a new training.
Proof: In the train data, you have farmers with prior history but adopted = 0 for some trainings. So adoption is not permanent, and using Prior data for features is safe and not leakage.
In Prior.csv, many farmers have trainings 0–2 days apart. That means at the time of a new training, you would not yet know whether the previous training was “adopted_within_90/120 days” (and often not even 7 days).
But your features (e.g., prior_*_sum, ward/topic/trainer temporal rates) use all prior outcomes immediately, regardless of whether the outcome window has elapsed.
So I was thinking we make outcomes “available” only after the horizon passes:
So you merge stats using known_day (not training_day).
Nice explanations, thank you @ML_Wizzard
I see. Thanks for the explanation @ML_Wizzard