I wouldn't be quick to conclude data leakage. If data leakage in test set, performance should improve, not get worse. I haven't made any submissions yet, but it is an interesting dataset for sure. And as @Jaw22 pointed out in a related post, it does require some 'skills' and domain knowledge.
Thanks @Amy_Bray if I understand them correctly the data leakage they are talking about is in their local cross validation but not the train, tests given?
Am I misunderstanding something? I thought that some customer_ids appear in both the train and test datasets. Could you please clarify your statement: "We ensured that no customers were split over train and test; they are either in one or the other"?
Yeah.... This actually authenticate this issue further. I believe @Amy_Bray@Zindi can patiently take their time to have a second look at this again. Cheers !!!
@halsted, Just wanted to say you right. There is data leakage. Not a significant amount, considering the dataset set, but it is there. Even with the logic that the unique_id is a combination three column names, it still missed a crucial consideration that would have prevented this.
The test set contains data from a region (Ghana) that is largely absent from the training set (Kenya). This introduces a domain shift, as the distributions of features (e.g., customer behaviors, preferences, or market conditions) might differ significantly between the two regions. So the model has to generalize across diverse regions
81% of test set is from kenya so we should except good corelation between cv and lb otherwise there some data leak in our cv or the public test data is so small which can't predicting well with our model so far.
oh I see , I will correct my previous discussion, but the fact that we have a completely different customer base from a different region makes it a bit difficult for cv/lb correlation don't you think so?
I noticed that customers can appear in both datasets, which aligns with the concept of repeat customers. This suggests that the same customer can take out multiple loans at different times no?
but customer in itself isn't the unique feature here right? the unique feature is a combination of customer id, loan Id and lender id if I'm not wrong.
so the bad thing is if both sets share this combination ?
@Koleshjr, you are correct a customer id is not a unique feature to identify a record with. It is rather a combination of customer id, loan id and lender id.
So i'm trying to understand how there is data leakage, i believe it was stated that a customer id can appear multiple times but with different lender id.
I used the code below to check if a unique combination of customer id, loan id and lender id appears in both datasets
What do you guys think?
# Combine customer_id, loan_id, and lender_id to create a unique identifier for each entry
That's my way of describing not considering all possibilities. In this case, all loans appear to be syndicated. Therefore, if you know the outcome of one part of the loan in the training set, you can infer the outcome for the other part in the test set.
In reality, a customer will take out a loan at a time, and it may or may not be spread across different lenders. I assume the default risk is computed per loan basis and not by lender. I would also assume repayment is made towards the entire loan and not a specific lender. If that's the case, for any given loan, the chances of default is the same across lenders. The current train test split logic did not consider this and results in some customers having the same loan id in train and test.
There aren't too many of them to ruin the dataset, so I'm not too worried about it.
Mine keeps flactuating. When I focus on improving cv, I end up getting worse results on LB
This data is so good in training but there is some change on test there is huge gap between cv and lb this scenario is appearing for all of us
CV 89.58 VERSUS LB 70.76
Which CV do you use?
There is huge gap I think we need to consider
Yes. One needs to consider this huge gap of course. Which will be tackled in due course.
Still using stratifiedkfold to experiment for now though.
I think it's due to "inherent" data leakage... since customers can be in both train and test sets... I have the same issue: 15 points more for CV
I tried using groupkfold to handle that, but it wasn't of much help.
me too :(
Since the test performance was poor, something must be very different in the test set compared to the train.
Yes we need to find the cause of
When you look test and train split it looks like random split except Ghana case so we can expect cv and lb should have at least moderate correlation.
I wouldn't be quick to conclude data leakage. If data leakage in test set, performance should improve, not get worse. I haven't made any submissions yet, but it is an interesting dataset for sure. And as @Jaw22 pointed out in a related post, it does require some 'skills' and domain knowledge.
We ensured that no customers were split over train and test, they are either in one or the other.
Thanks @Amy_Bray if I understand them correctly the data leakage they are talking about is in their local cross validation but not the train, tests given?
ah! you know me so well Kolesh, you know how I always panic when the word leak is used :")
😅
I kind of stand by my statement... is not 253952 customer in both sets? with loans for different times in both sets?
Hello @Amy_Bray,
Am I misunderstanding something? I thought that some customer_ids appear in both the train and test datasets. Could you please clarify your statement: "We ensured that no customers were split over train and test; they are either in one or the other"?
Thanks
@amy_gray. Hope you are well . Still waiting on your response.
Yeah.... This actually authenticate this issue further. I believe @Amy_Bray @Zindi can patiently take their time to have a second look at this again. Cheers !!!
@halsted, Just wanted to say you right. There is data leakage. Not a significant amount, considering the dataset set, but it is there. Even with the logic that the unique_id is a combination three column names, it still missed a crucial consideration that would have prevented this.
Fold 1 F1 Score: 0.9065
Fold 2 F1 Score: 0.9022
Fold 3 F1 Score: 0.9091
Fold 4 F1 Score: 0.9116
Fold 5 F1 Score: 0.9187
Ortalama F1 Score: 0.9096
It gives 0.68 score.
what a gap i looked the data for long times I couldn't to find any data leakage
I think this is expected no?
The test set contains data from a region (Ghana) that is largely absent from the training set (Kenya). This introduces a domain shift, as the distributions of features (e.g., customer behaviors, preferences, or market conditions) might differ significantly between the two regions. So the model has to generalize across diverse regions
81% of test set is from kenya so we should except good corelation between cv and lb otherwise there some data leak in our cv or the public test data is so small which can't predicting well with our model so far.
oh I see , I will correct my previous discussion, but the fact that we have a completely different customer base from a different region makes it a bit difficult for cv/lb correlation don't you think so?
some how but you can see there is more than 20% gap
fair
More than 20% gap because the effect is not linear.
253952 - in both sets as a customer, no?
I noticed that customers can appear in both datasets, which aligns with the concept of repeat customers. This suggests that the same customer can take out multiple loans at different times no?
that ... AND... that there's data for a customer split into both train and test, no?
but customer in itself isn't the unique feature here right? the unique feature is a combination of customer id, loan Id and lender id if I'm not wrong.
so the bad thing is if both sets share this combination ?
@Koleshjr, you are correct a customer id is not a unique feature to identify a record with. It is rather a combination of customer id, loan id and lender id.
Even with that, there is data leakage. To be precise, 920 samples. Not a significant amount, but it is definitely there.
So i'm trying to understand how there is data leakage, i believe it was stated that a customer id can appear multiple times but with different lender id.
I used the code below to check if a unique combination of customer id, loan id and lender id appears in both datasets
What do you guys think?
# Combine customer_id, loan_id, and lender_id to create a unique identifier for each entry
train_set = set(zip(train['customer_id'], train['tbl_loan_id'], train['lender_id']))
test_set = set(zip(test['customer_id'], test['tbl_loan_id'], test['lender_id']))
common_entries = train_set.intersection(test_set)
if common_entries:
print(f"Found {len(common_entries)} common entries between train and test datasets:")
for entry in common_entries:
print(entry)
else:
print("No common entries found between train and test datasets.")
All the ids are unique; the leakage is caused by a logic error.
What's that @da_? Logic error?
That's my way of describing not considering all possibilities. In this case, all loans appear to be syndicated. Therefore, if you know the outcome of one part of the loan in the training set, you can infer the outcome for the other part in the test set.
In reality, a customer will take out a loan at a time, and it may or may not be spread across different lenders. I assume the default risk is computed per loan basis and not by lender. I would also assume repayment is made towards the entire loan and not a specific lender. If that's the case, for any given loan, the chances of default is the same across lenders. The current train test split logic did not consider this and results in some customers having the same loan id in train and test.
There aren't too many of them to ruin the dataset, so I'm not too worried about it.
I need help.
I'm getting 88 on my cv but on the lb it keeps evaluating to 4
Are sure data it is free from data leak specialist if you are using target encoding
CV .92 Lb .65
On Which model are you working
How is your current cv-lb looking like now @Yisakberhanu? Your current LB is super cool 0.74+ . Can't break 0.72cv-0.69lb
I am still trying to find stability it so shocking