df['calc_Lender_portion_Funded'] = (df['Amount_Funded_By_Lender'] / df['Total_Amount'])
df['calc_Lender_portion_to_be_repaid'] = (df['calc_Lender_portion_Funded']) * df['Total_Amount_to_Repay']
calc_Lender_portion_to_be_repaid ---Lender_portion_to_be_repaid
0 120.85 --- 121.00
1 7793.7 --- 7794.00
2 1428.4 --- 1428.00
Why did you round the values in the 'Lender_portion_to_be_repaid' column?
You can replace it with the actual values if you feel that will give a better score. I also realized that.
Replacing with the actual values worsens matters. I wonder what am doing wrong, cannot even break past 0.69 and God knows how much I tried!
Hmm let me help you out.
You can use hyperparameter tuning to reach 70 but in my experience, only hyperparameter tuning can not make you pass the 70 score to even 71. Let me give you some parameters that can let you reach 70 with LightGBM:
best_params1 = {'booster': 'lightgbm',
'n_estimators': 500,
'max_depth': 8,
'learning_rate': 0.06487257646412693,
'num_leaves': 60,
'feature_fraction': 0.673436396881704,
'bagging_fraction': 0.987922773302477,
'lambda_l1': 0.21968694469084882,
'lambda_l2': 0.9887865080734871}
And these are the features:
# Combine datasets for consistent feature engineering
data = pd.concat([train, test]).reset_index(drop=True)
# Convert date columns to datetime
data['disbursement_date'] = pd.to_datetime(data['disbursement_date'], errors='coerce')
data['due_date'] = pd.to_datetime(data['due_date'], errors='coerce')
# Extract temporal features from dates
date_cols = ['disbursement_date', 'due_date']
for col in date_cols:
data[col] = pd.to_datetime(data[col])
# Extract month, day, year
data[col+'_month'] = data[col].dt.month
data[col+'_day'] = data[col].dt.day
data[col+'_year'] = data[col].dt.year
# Calculate loan term and weekday features
data[f'loan_term_days'] = (data['due_date'] - data['disbursement_date']).dt.days
data[f'disbursement_weekday'] = data['disbursement_date'].dt.weekday
data[f'due_weekday'] = data['due_date'].dt.weekday
# Create financial ratios and transformations
data['repayment_ratio'] = data['Total_Amount_to_Repay'] / data['Total_Amount']
data['log_Total_Amount'] = np.log1p(data['Total_Amount'])
# Handle categorical variables
cat_cols = data.select_dtypes(include='object').columns
# One-hot encoding for loan type
data = pd.get_dummies(data, columns=['loan_type'], prefix='loan_type', drop_first=False)
loan_type_cols = [col for col in data.columns if col.startswith('loan_type_')]
data[loan_type_cols] = data[loan_type_cols].astype(int)
# Label encoding for other categorical columns
le = LabelEncoder()
for col in [col for col in cat_cols if col not in ['loan_type', 'ID']]:
data[col] = le.fit_transform(data[col])
# Split back into train and test
train_df = data[data['ID'].isin(train['ID'].unique())]
test_df = data[data['ID'].isin(test['ID'].unique())]
# Define features for modeling
features_for_modelling = [col for col in train_df.columns if col not in date_cols + ['ID', 'target', 'country_id', 'customer_id', 'lender_id' ]]
print(f"The shape of train_df is: {train_df.shape}")
print(f"The shape of test_df is: {test_df.shape}")
print(f"The shape of train is: {train.shape}")
print(f"The shape of test is: {test.shape}")
print(f"The features for modelling are:\n{features_for_modelling}")
I hope this helps.
This is so insightful! Let me try out this and see.
Sure, good luck. I am still searching on ways to improve it to reach above 71. I got stuck.
Good luck be with you man-am sure the top guy are doing crazy POST PROCESSING-no doubt about it! You can cautiously try that out as well if you have any idease. How is your CV correlation to LB?
89 cv. But wait how will you do post processing here?
Threshold tuning, ensemble many ml model outputs, some idease maybe around Ghana predictions...??
Oh okay. Interesting. I'll try that
Oh sorry, I didn't see the last part. Did you say you want some ideas around the Ghanaian predictions?
No i meant that they could be trying out some tricks on Ghana predictions. Since this has a different distribution than train data. Do you believe this can be done with some ideas?
Yeah you have a point. I will try that.
how do you decide when to use such hyperparams, gridsearch??:'learning_rate': 0.06487257646412693,
'num_leaves': 60,
'feature_fraction': 0.673436396881704,
'bagging_fraction': 0.987922773302477,
'lambda_l1': 0.21968694469084882,
'lambda_l2': 0.9887865080734871}
To be frank, you will have to play with it for a while. The lambda l1 and l2 are more into regularization to prevent overfitting.
performs better than xgb