Primary competition visual

Xente Fraud Detection Challenge

Helping Uganda
$4 500 USD
Completed (over 6 years ago)
Classification
2031 joined
545 active
Starti
May 20, 19
Closei
Sep 22, 19
Reveali
Sep 23, 19
User avatar
Mugisha_
Starter_code[python]
Notebooks ยท 21 Sep 2019, 14:37 ยท 5

Here is python starter code for those that may want to explore further, not so much done in there but LB[0.76363..]using xgboost; (No sampling techniques used; only cv)

Be mindful to change the paths that feed the data into the notebook incase your running your own version of the notebook

https://github.com/steph-en-m/competitive-programming/tree/master/fraud_detection

Happy learning and am also open to feedback via github issues on the repo.

Discussion 5 answers

Nice one bro.

Been stuck on 70% using logistic regression for a long time.

I am open to forming a team with you if you don't mind

Email: bammy2050@gmail.

21 Sep 2019, 14:43
Upvotes 0
User avatar
Mugisha_

I wouldn't mind but teams are unfortunately not enabled for this particular competition.

Thanks for sharing.

Just noticed this,

in couple of places you use, the label encoder

#label encoding columns
columns = train.columns.tolist()[1:11]
test_columns = test.columns.tolist()[1:11]

le = LabelEncoder()
for each in columns:
    train[each] = le.fit_transform(train[each])

for column in test_columns:
  test[column] = le.fit_transform(test[column])

here you fit and transform the train and test set separately which is not a good idea. You want to learn label encodings only from the train data and apply on the new test data (by using "transform" operation) - there are a few categorical fields that have different levels in the train and test data and you will need to have code to handle new levels in the test data (one such way is regrouping levels occuring < 1% into a new category). An additional point with using label encoder is some features have several hundred levels so you also add sparsity.

same with the approach of combining train and test data, to generate dummies or other feature engineering (in real life you will never have access to such data to do this step) - this introduces leakage as the model will learn from data it's not supposed to see. Unfortunately, it's quite a common practice in kaggle and other competitions to use this approach (of combining train and test data but it's not one you will use in real life).

len_train = len(train)
new_df = pd.concat([train, test], sort=False)

#getting categorical dummies
categorical_columns = ["ProviderId", "ProductCategory", "ProductId", "ChannelId"]
new_df = pd.get_dummies(new_df, columns=categorical_columns)
new_df.head()

User avatar
Mugisha_

Thanks alot for noting that @kpaillard. I greatly appreciate your feedback and will take it into consideration.

User avatar
Mugisha_

I've added my final solution notebook (0.75555 on private LB) and a 0.12 difference from the top ranked solution; to the github repo so feel free to check it out. be sure to go through the README

https://github.com/steph-en-m/competitive-programming/tree/develop/fraud_detection