Data Science Nigeria Challenge #1: Loan Default Prediction
Knowledge
A learning competition designed for DSN Bootcamp 2018
746 data scientists enrolled, 146 on the leaderboard
Financial ServicesPredictionStructured
11 October 2018

There are 3 different datasets for both train and test.

Note that the sample submission has 2 outcomes- good (1) or bad (0).

a) Demographic data (traindemographics.csv)

  • customerid (Primary key used to merge to other data)
  • birthdate (date of birth of the customer)
  • bank_account_type (type of primary bank account)
  • longitude_gps
  • latitude_gps
  • bank_name_clients (name of the bank)
  • bank_branch_clients (location of the branch - not compulsory - so missing in a lot of the cases)
  • employment_status_clients (type of employment that customer has)
  • level_of_education_clients (highest level of education)

b) Performance data (trainperf.csv) : This is the repeat loan that the customer has taken for which we need to predict the performance of. Basically, we need to predict whether this loan would default given all previous loans and demographics of a customer.

  • customerid (Primary key used to merge to other data)
  • systemloanid (The id associated with the particular loan. The same customerId can have multiple systemloanid’s for each loan he/she has taken out)
  • loannumber (The number of the loan that you have to predict)
  • approveddate (Date that loan was approved)
  • creationdate (Date that loan application was created)
  • loanamount (Loan value taken)
  • totaldue (Total repayment required to settle the loan - this is the capital loan value disbursed +interest and fees)
  • termdays (Term of loan)
  • referredby (customerId of the customer that referred this person - is missing, then not referred)
  • good_bad_flag (good = settled loan on time; bad = did not settled loan on time) - this is the target variable that we need to predict

c) Previous loans data (trainprevloans.csv) : This dataset contains all previous loans that the customer had prior to the loan above that we want to predict the performance of. Each loan will have a different systemloanid, but the same customerid for each customer.

  • customerid (Primary key used to merge to other data)
  • systemloanid (The id associated with the particular loan. The same customerId can have multiple systemloanid’s for each loan he/she has taken out)
  • loannumber (The number of the loan that you have to predict)
  • approveddate (Date that loan was approved)
  • creationdate (Date that loan application was created)
  • loanamount (Date that loan application was created)
  • totaldue (Total repayment required to settle the loan - this is the capital loan value disbursed +interest and fees) termdays (Term of loan)
  • closeddate (Date that the loan was settled)
  • referredby (customerId of the customer that referred this person - is missing, then not refrerred)
  • firstduedate (Date of first payment due in cases where the term is longer than 30 days. So in the case where the term is 60+ days - then there are multiple monthly payments due - and this dates reflects the date of the first payment)
  • firstrepaiddate (Actual date that he/she paid the first payment as defined above)