Primary competition visual

African Credit Scoring Challenge

Helping Africa
$5 000 USD
Completed (~1 year ago)
1967 joined
1022 active
Starti
Nov 29, 24
Closei
Jan 12, 25
Reveali
Jan 13, 25
User avatar
Yisakberhanu
wachemo university
Cv ~LB
Notebooks · 4 Dec 2024, 14:11 · 47

Hi guys how is yours cv vs lb correlations ? MINE cv 78 lb 65

Discussion 47 answers
User avatar
analyst

Mine keeps flactuating. When I focus on improving cv, I end up getting worse results on LB

4 Dec 2024, 14:45
Upvotes 1
User avatar
Yisakberhanu
wachemo university

This data is so good in training but there is some change on test there is huge gap between cv and lb this scenario is appearing for all of us

User avatar
MICADEE
LAHASCOM

CV 89.58 VERSUS LB 70.76

4 Dec 2024, 14:47
Upvotes 1

Which CV do you use?

User avatar
Yisakberhanu
wachemo university

There is huge gap I think we need to consider

User avatar
MICADEE
LAHASCOM

Yes. One needs to consider this huge gap of course. Which will be tackled in due course.

User avatar
MICADEE
LAHASCOM

Still using stratifiedkfold to experiment for now though.

I think it's due to "inherent" data leakage... since customers can be in both train and test sets... I have the same issue: 15 points more for CV

4 Dec 2024, 15:15
Upvotes 3
User avatar
analyst

I tried using groupkfold to handle that, but it wasn't of much help.

Since the test performance was poor, something must be very different in the test set compared to the train.

User avatar
Yisakberhanu
wachemo university

Yes we need to find the cause of

User avatar
Yisakberhanu
wachemo university

When you look test and train split it looks like random split except Ghana case so we can expect cv and lb should have at least moderate correlation.

I wouldn't be quick to conclude data leakage. If data leakage in test set, performance should improve, not get worse. I haven't made any submissions yet, but it is an interesting dataset for sure. And as @Jaw22 pointed out in a related post, it does require some 'skills' and domain knowledge.

User avatar
Amy_Bray
Zindi

We ensured that no customers were split over train and test, they are either in one or the other.

User avatar
Koleshjr
Multimedia university of kenya

Thanks @Amy_Bray if I understand them correctly the data leakage they are talking about is in their local cross validation but not the train, tests given?

User avatar
Amy_Bray
Zindi

ah! you know me so well Kolesh, you know how I always panic when the word leak is used :")

User avatar
Koleshjr
Multimedia university of kenya

😅

I kind of stand by my statement... is not 253952 customer in both sets? with loans for different times in both sets?

User avatar
Juliuss
Freelance

Hello @Amy_Bray,

Am I misunderstanding something? I thought that some customer_ids appear in both the train and test datasets. Could you please clarify your statement: "We ensured that no customers were split over train and test; they are either in one or the other"?

Thanks

@amy_gray. Hope you are well . Still waiting on your response.

User avatar
MICADEE
LAHASCOM

Yeah.... This actually authenticate this issue further. I believe @Amy_Bray @Zindi can patiently take their time to have a second look at this again. Cheers !!!

@halsted, Just wanted to say you right. There is data leakage. Not a significant amount, considering the dataset set, but it is there. Even with the logic that the unique_id is a combination three column names, it still missed a crucial consideration that would have prevented this.

Fold 1 F1 Score: 0.9065

Fold 2 F1 Score: 0.9022

Fold 3 F1 Score: 0.9091

Fold 4 F1 Score: 0.9116

Fold 5 F1 Score: 0.9187

Ortalama F1 Score: 0.9096

It gives 0.68 score.

5 Dec 2024, 02:37
Upvotes 3
User avatar
Yisakberhanu
wachemo university

what a gap i looked the data for long times I couldn't to find any data leakage

User avatar
Koleshjr
Multimedia university of kenya

I think this is expected no?

The test set contains data from a region (Ghana) that is largely absent from the training set (Kenya). This introduces a domain shift, as the distributions of features (e.g., customer behaviors, preferences, or market conditions) might differ significantly between the two regions. So the model has to generalize across diverse regions

5 Dec 2024, 06:38
Upvotes 3
User avatar
Yisakberhanu
wachemo university

81% of test set is from kenya so we should except good corelation between cv and lb otherwise there some data leak in our cv or the public test data is so small which can't predicting well with our model so far.

User avatar
Koleshjr
Multimedia university of kenya

oh I see , I will correct my previous discussion, but the fact that we have a completely different customer base from a different region makes it a bit difficult for cv/lb correlation don't you think so?

User avatar
Yisakberhanu
wachemo university

some how but you can see there is more than 20% gap

User avatar
Koleshjr
Multimedia university of kenya

fair

More than 20% gap because the effect is not linear.

253952 - in both sets as a customer, no?

6 Dec 2024, 03:10
Upvotes 0
User avatar
Koleshjr
Multimedia university of kenya

I noticed that customers can appear in both datasets, which aligns with the concept of repeat customers. This suggests that the same customer can take out multiple loans at different times no?

that ... AND... that there's data for a customer split into both train and test, no?

User avatar
Koleshjr
Multimedia university of kenya

but customer in itself isn't the unique feature here right? the unique feature is a combination of customer id, loan Id and lender id if I'm not wrong.

so the bad thing is if both sets share this combination ?

User avatar
AJoel
Zindi

@Koleshjr, you are correct a customer id is not a unique feature to identify a record with. It is rather a combination of customer id, loan id and lender id.

Even with that, there is data leakage. To be precise, 920 samples. Not a significant amount, but it is definitely there.

User avatar
AI_Maven
University of Benin

So i'm trying to understand how there is data leakage, i believe it was stated that a customer id can appear multiple times but with different lender id.

I used the code below to check if a unique combination of customer id, loan id and lender id appears in both datasets

What do you guys think?

# Combine customer_id, loan_id, and lender_id to create a unique identifier for each entry

train_set = set(zip(train['customer_id'], train['tbl_loan_id'], train['lender_id']))

test_set = set(zip(test['customer_id'], test['tbl_loan_id'], test['lender_id']))

common_entries = train_set.intersection(test_set)

if common_entries:

print(f"Found {len(common_entries)} common entries between train and test datasets:")

for entry in common_entries:

print(entry)

else:

print("No common entries found between train and test datasets.")

All the ids are unique; the leakage is caused by a logic error.

What's that @da_? Logic error?

That's my way of describing not considering all possibilities. In this case, all loans appear to be syndicated. Therefore, if you know the outcome of one part of the loan in the training set, you can infer the outcome for the other part in the test set.

In reality, a customer will take out a loan at a time, and it may or may not be spread across different lenders. I assume the default risk is computed per loan basis and not by lender. I would also assume repayment is made towards the entire loan and not a specific lender. If that's the case, for any given loan, the chances of default is the same across lenders. The current train test split logic did not consider this and results in some customers having the same loan id in train and test.

There aren't too many of them to ruin the dataset, so I'm not too worried about it.

I need help.

I'm getting 88 on my cv but on the lb it keeps evaluating to 4

6 Dec 2024, 04:54
Upvotes 0
User avatar
Yisakberhanu
wachemo university

Are sure data it is free from data leak specialist if you are using target encoding

User avatar
crossentropy
Federal university of Technology, Akure

CV .92 Lb .65

6 Dec 2024, 09:05
Upvotes 0
User avatar
Yisakberhanu
wachemo university

On Which model are you working

User avatar
Juliuss
Freelance

How is your current cv-lb looking like now @Yisakberhanu? Your current LB is super cool 0.74+ . Can't break 0.72cv-0.69lb

26 Dec 2024, 10:41
Upvotes 1
User avatar
Yisakberhanu
wachemo university

I am still trying to find stability it so shocking