🔍 Hot Topic: Starter_code[python]

Xente Fraud Detection Challenge

Helping Uganda

$4 500 USD

Completed (almost 7 years ago)

Skills you will learn

Classification

2058 joined

545 active

Info Data Chat Leaderboard

Start

May 20, 19

Sep 22, 19

Reveal

Sep 23, 19

Mugisha_

Starter_code[python]

Notebooks · 21 Sep 2019, 14:37 · 5

Here is python starter code for those that may want to explore further, not so much done in there but LB[0.76363..]using xgboost; (No sampling techniques used; only cv)

Be mindful to change the paths that feed the data into the notebook incase your running your own version of the notebook

https://github.com/steph-en-m/competitive-programming/tree/master/fraud_detection

Happy learning and am also open to feedback via github issues on the repo.

Discussion 5 answers

Bammy

Nice one bro.

Been stuck on 70% using logistic regression for a long time.

I am open to forming a team with you if you don't mind

Email: bammy2050@gmail.

21 Sep 2019, 14:43

Upvotes 0

Mugisha_

I wouldn't mind but teams are unfortunately not enabled for this particular competition.

replied to Bammy21 Sep 2019, 14:54

Upvotes 0

kpaillard

Thanks for sharing.

Just noticed this,

in couple of places you use, the label encoder

#label encoding columns
columns = train.columns.tolist()[1:11]
test_columns = test.columns.tolist()[1:11]

le = LabelEncoder()
for each in columns:
    train[each] = le.fit_transform(train[each])

for column in test_columns:
  test[column] = le.fit_transform(test[column])

here you fit and transform the train and test set separately which is not a good idea. You want to learn label encodings only from the train data and apply on the new test data (by using "transform" operation) - there are a few categorical fields that have different levels in the train and test data and you will need to have code to handle new levels in the test data (one such way is regrouping levels occuring < 1% into a new category). An additional point with using label encoder is some features have several hundred levels so you also add sparsity.

same with the approach of combining train and test data, to generate dummies or other feature engineering (in real life you will never have access to such data to do this step) - this introduces leakage as the model will learn from data it's not supposed to see. Unfortunately, it's quite a common practice in kaggle and other competitions to use this approach (of combining train and test data but it's not one you will use in real life).

len_train = len(train)
new_df = pd.concat([train, test], sort=False)

#getting categorical dummies
categorical_columns = ["ProviderId", "ProductCategory", "ProductId", "ChannelId"]
new_df = pd.get_dummies(new_df, columns=categorical_columns)
new_df.head()

23 Sep 2019, 10:42 (edited 3 minutes later)

Upvotes 0

Mugisha_

Thanks alot for noting that @kpaillard. I greatly appreciate your feedback and will take it into consideration.

replied to kpaillard23 Sep 2019, 10:47

Upvotes 0

Mugisha_

I've added my final solution notebook (0.75555 on private LB) and a 0.12 difference from the top ranked solution; to the github repo so feel free to check it out. be sure to go through the README

https://github.com/steph-en-m/competitive-programming/tree/develop/fraud_detection

23 Sep 2019, 10:44 (edited 8 minutes later)

Upvotes 0

Join the largest network for
data scientists and AI builders

About FAQs

Status