1. What is Lending Club?
2. Data used
3. What is Random Forest Classification
1. What is Lending Club?
Lending club is a peer to peer lending company which is headquartered in San Francisco.
The company connects people who need money (Borrowers) with people who have money (Investors) through its online marketplace. Investors, who are looking to get a solid return for their investment, purchase Notes which are fractions of loans. Borrowers, who need loans for various reasons such as to consolidate debt, improve homes or make a major purchase, can apply for a loan by creating an account on Lendingclub.com. They will submit a loan application which will mention the amount required through the loan. Lending Club will screen the borrowers, facilitate the transaction and service the loan. Borrowers will repay the loan by making monthly payments to Lending Club.
2. Decision trees and Random Forest Classification:
Decision trees and random forest classifiers help us classify our data. For example, if a customer has made a purchase (Yes or No), Gender of a person (Male or Female) etc. In this project we will be trying to predict if the borrower is able to repay the loan or not. The goal of decision trees model is to predict the value of the target variable based on several input variables. In a simple way, decision tree algorithm will ask questions to the input variable regarding its attributes. Each time it receives an answer a follow up question will be asked till it reaches a conclusion about the variable.
Random forest classification is one of an ensemble algorithm. It combines lot of decision tree methods. Instead of running the decision tree method once, we will be running multiple methods of decision trees and that will give us a random forest method. We start by selecting random variables from the training data. Then we build a decision tree based on these random variables. Next, we will select the number of trees (Ntrees) we want to build and repeat the previous steps. For any new variable, we will make the Ntrees vote the category to which the variable belongs to and assign the variable to that category based on majority of votes.
We will use lending data from 2007-2010 and be trying to classify and predict whether or not the borrower paid back their loan in full.
The data is publicly available on lendingclub.com Here are what the columns represent:
a. credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
b. purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
c. int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
d. installment: The monthly installments owed by the borrower if the loan is funded.
e. log.annual.inc: The natural log of the self-reported annual income of the borrower.
f. dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
g. fico: The FICO credit score of the borrower.
h. days.with.cr.line: The number of days the borrower has had a credit line.
i. revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
j. revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
k. inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
l. delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
m. pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).
4. Exploratory data analysis:
After performing some exploratory data analysis we observed that:
a. People with low FICO score tend to not meet the credit underwriting criteria of Lending Club.
b. Majority of people are still in process of repaying the loan.
c. Debt consolidation is a popular reason for pursuing the loan.
d. As FICO score increases, there is better credit and less interest rate on the loan.
5. Setting up Data for the Model:
As there are some categorical features in the data, we will use pandas ability to create dummy variables so that sci-kit learn will be able to understand them. We will use sci-kit learns ability to split the data into train and test data set. We build the model using the train data set and evaluate the model performance on test data set. Usually such a split is 70:30 in ratio.
6. Build the Model and evaluate the performance:
We first build the decision tree model using sklearn.tree to import DecisionTreeClassifier. We fit the classifier on train data and predict the results for test data. Also, we are trying the ensemble methods, we will use the random forest classifier to get a better prediction accuracy. Ensemble methods use various machine learning techniques to deliver the best prediction for our data. As we have seen above Random Forest classification uses multiple decision trees to improve our prediction accuracy.