💳 Trending Now: Cv ~LB

I wouldn't be quick to conclude data leakage. If data leakage in test set, performance should improve, not get worse. I haven't made any submissions yet, but it is an interesting dataset for sure. And as @Jaw22 pointed out in a related post, it does require some 'skills' and domain knowledge.

replied to halsted4 Dec 2024, 16:05

Upvotes 3

Amy_Bray

Zindi

We ensured that no customers were split over train and test, they are either in one or the other.

replied to halsted5 Dec 2024, 06:40

Upvotes 4

Koleshjr

Multimedia university of kenya

Thanks @Amy_Bray if I understand them correctly the data leakage they are talking about is in their local cross validation but not the train, tests given?

replied to Amy_Bray5 Dec 2024, 06:43

Upvotes 2

Amy_Bray

Zindi

ah! you know me so well Kolesh, you know how I always panic when the word leak is used :")

replied to Koleshjr5 Dec 2024, 06:45

Upvotes 1

Koleshjr

Multimedia university of kenya

😅

replied to Amy_Bray5 Dec 2024, 06:49

Upvotes 0

halsted

I kind of stand by my statement... is not 253952 customer in both sets? with loans for different times in both sets?

replied to halsted6 Dec 2024, 03:09

Upvotes 1

Juliuss

Freelance

Hello @Amy_Bray,

Am I misunderstanding something? I thought that some customer_ids appear in both the train and test datasets. Could you please clarify your statement: "We ensured that no customers were split over train and test; they are either in one or the other"?

Thanks

replied to Amy_Bray10 Dec 2024, 21:35

Upvotes 4

Olayinka_Fadahunsi

@amy_gray. Hope you are well . Still waiting on your response.

replied to Juliuss12 Dec 2024, 07:21

Upvotes 2

MICADEE

LAHASCOM

Yeah.... This actually authenticate this issue further. I believe @Amy_Bray @Zindi can patiently take their time to have a second look at this again. Cheers !!!

replied to Juliuss12 Dec 2024, 07:30

Upvotes 3

da_

@halsted, Just wanted to say you right. There is data leakage. Not a significant amount, considering the dataset set, but it is there. Even with the logic that the unique_id is a combination three column names, it still missed a crucial consideration that would have prevented this.

replied to MICADEE12 Dec 2024, 16:45

Upvotes 2

private_1x

Fold 1 F1 Score: 0.9065

Fold 2 F1 Score: 0.9022

Fold 3 F1 Score: 0.9091

Fold 4 F1 Score: 0.9116

Fold 5 F1 Score: 0.9187

Ortalama F1 Score: 0.9096

It gives 0.68 score.

5 Dec 2024, 02:37

Upvotes 3

Yisakberhanu

wachemo university

what a gap i looked the data for long times I couldn't to find any data leakage

replied to private_1x5 Dec 2024, 04:39

Upvotes 1

Koleshjr

Multimedia university of kenya

I think this is expected no?

The test set contains data from a region (Ghana) that is largely absent from the training set (Kenya). This introduces a domain shift, as the distributions of features (e.g., customer behaviors, preferences, or market conditions) might differ significantly between the two regions. So the model has to generalize across diverse regions

5 Dec 2024, 06:38

Upvotes 3

Yisakberhanu

wachemo university

81% of test set is from kenya so we should except good corelation between cv and lb otherwise there some data leak in our cv or the public test data is so small which can't predicting well with our model so far.

replied to Koleshjr5 Dec 2024, 08:26

Upvotes 1

Koleshjr

Multimedia university of kenya

oh I see , I will correct my previous discussion, but the fact that we have a completely different customer base from a different region makes it a bit difficult for cv/lb correlation don't you think so?

replied to Yisakberhanu5 Dec 2024, 08:34

Upvotes 0

Yisakberhanu

wachemo university

some how but you can see there is more than 20% gap

replied to Koleshjr5 Dec 2024, 10:17

Upvotes 1

Koleshjr

Multimedia university of kenya

fair

replied to Yisakberhanu5 Dec 2024, 10:25

Upvotes 0

da_

More than 20% gap because the effect is not linear.

replied to Koleshjr5 Dec 2024, 10:39

Upvotes 2

halsted

253952 - in both sets as a customer, no?

6 Dec 2024, 03:10

Upvotes 0

Koleshjr

Multimedia university of kenya

I noticed that customers can appear in both datasets, which aligns with the concept of repeat customers. This suggests that the same customer can take out multiple loans at different times no?

replied to halsted6 Dec 2024, 03:23

Upvotes 1

halsted

that ... AND... that there's data for a customer split into both train and test, no?

replied to Koleshjr6 Dec 2024, 03:37

Upvotes 1

Koleshjr

Multimedia university of kenya

but customer in itself isn't the unique feature here right? the unique feature is a combination of customer id, loan Id and lender id if I'm not wrong.

so the bad thing is if both sets share this combination ?

replied to halsted6 Dec 2024, 03:54

Upvotes 1

AJoel

Zindi

@Koleshjr, you are correct a customer id is not a unique feature to identify a record with. It is rather a combination of customer id, loan id and lender id.

replied to Koleshjr13 Dec 2024, 06:24

Upvotes 0

da_

Even with that, there is data leakage. To be precise, 920 samples. Not a significant amount, but it is definitely there.

replied to AJoel13 Dec 2024, 09:59

Upvotes 0

AI_Maven

University of Benin

So i'm trying to understand how there is data leakage, i believe it was stated that a customer id can appear multiple times but with different lender id.

I used the code below to check if a unique combination of customer id, loan id and lender id appears in both datasets

What do you guys think?

# Combine customer_id, loan_id, and lender_id to create a unique identifier for each entry

train_set = set(zip(train['customer_id'], train['tbl_loan_id'], train['lender_id']))

test_set = set(zip(test['customer_id'], test['tbl_loan_id'], test['lender_id']))

common_entries = train_set.intersection(test_set)

if common_entries:

print(f"Found {len(common_entries)} common entries between train and test datasets:")

for entry in common_entries:

print(entry)

else:

print("No common entries found between train and test datasets.")

replied to da_14 Dec 2024, 13:01

Upvotes 1

da_

All the ids are unique; the leakage is caused by a logic error.

replied to AI_Maven14 Dec 2024, 15:49

Upvotes 0

AI_For_EAGood

What's that @da_? Logic error?

replied to da_17 Dec 2024, 11:43

Upvotes 0

da_

That's my way of describing not considering all possibilities. In this case, all loans appear to be syndicated. Therefore, if you know the outcome of one part of the loan in the training set, you can infer the outcome for the other part in the test set.

In reality, a customer will take out a loan at a time, and it may or may not be spread across different lenders. I assume the default risk is computed per loan basis and not by lender. I would also assume repayment is made towards the entire loan and not a specific lender. If that's the case, for any given loan, the chances of default is the same across lenders. The current train test split logic did not consider this and results in some customers having the same loan id in train and test.

There aren't too many of them to ruin the dataset, so I'm not too worried about it.

replied to AI_For_EAGood17 Dec 2024, 12:36

Upvotes 2

Astrrokid

I need help.

I'm getting 88 on my cv but on the lb it keeps evaluating to 4

6 Dec 2024, 04:54

Upvotes 0