Loan Data Analysis and Visualization using Lending Club Data

LendingClub, data shows Corp LC is the first and largest online Peer-to-Peer (“P2P”) platform to facilitate lending and borrowing of unsecured loans ranging from $1,000 to $35,000. Aiming at providing lower cost transaction fees than other financial intermediaries, LendingClub hit the highest IPO in the tech sector in 2014.

This project analyzes the personal loan payment dataset of LendingClub Corp, LC, available on Kaggle.com (click here) to better understand the best borrower profile for investors.

The dataset covers an extensive amount of information on the borrower's side that was originally available to lenders when they made investment choices. By further segmenting the loan dataset into finished cases and current outstanding loans, this project breaks down the composition of the default cases and exam ines the correlation among indicators. In the end, the goal is to provide investors and borrowers , as well as LendingClub , additional insights regarding investment opportunities and contingent loan collection advice. (Please note that for the purpose of the visualization effects and simplicity of diagrams, this project re-coded some of the items with little or no observations.)

II. Data Analysis

II.1 Interest Rate Vs. Number of Approved Cases

Loan Data Analysis and Visualization using Lending Club Data

Figure 2. Time Series Plot of Approved Loans Count

We can almost always regard interest rates charged upon loan insurance as a form of cost that borrowers have to incur and the number of approved cases as an indicator of demand. By rough eye balling, the two time series plot of average interest rate and number of approved loans over time corresponds quite closely with each other. Exceptions are the plummet of interest rate s in late 2007 , thanks to VC fund injection in the figure above, and fluctuations for the number of Approved Cases around 2015 in the figure below ( because of the managerial scandals ) . (click here for more information) .

Loan Data Analysis and Visualization using Lending Club Data

Figure 3. Scatterplot of Interest Rate and Approved Loan Counts

Therefore, it comes as no surprise that a scatter plot of interest rate s and number of approved cases for the time period presents a positive relationship, as all else being equal, increasing demand drives up the prices.

II.2 Sample Default Indicator Breakdown

This section briefly discusses two of the indicators as an example of the richness of the dataset: Home Ownership Types and the borrower's rating grade.

As can be inferred from Figure 4, the stack of counts under the account 'Fully Paid' is much higher than the ones under 'Default' . Thus fortunately for the LendingClub investors, most of them were able to receive their funds with pre-allocated interest rate. We can also infer from the histogram that there are relatively more applications with mortgage and rental places than those who own their own place.

Loan Data Analysis and Visualization using Lending Club Data

Figure 5. Default Ratios on Borrower's Grade

Rplot6

Figure 6. Default Ratios on Borrower's Home Ownership

As can be seen from the graph above, there is no relationship between the type of Home Ownership and default rate. (However, a closer examination o f the ratio of default by types of homeownership, the probability of default for the past observations are almost identical.). What this means is that there is an equal chance for applicants with different housing types to default. Rating grade, on the other hand, has a more direct relationship to default . The probability of default increases stepwise as we move down the rating grade of borrowers.

II.3 Interest Rate Vs. Default Rate

Rplot4

Figure 7. Average Interest Rate by Month

Rplot8

Figure 8. Spatial Plot for Average Interest Rate

Since interest rates are calculated based on the profile of an applicant, interest rate plots are good indications of the quality of the application pool. As can be seen above, average observed interest rates differ by month, year, and geography. The lowest average interest rate occurred in July and November and highest occurred in June. Applicants from Idaho and Iowa, and Maine experienced relatively much lower rates on average than the ones from Indiana and Tennessee.

Rplot9

Figure 9. Spatial Plot for Default Rate

Rplot_10

Figure 10. Scatterplot for Default Rate and Interest Rate

Interestingly, the shade of color for average default rate by state reflects pretty much the opposite of the one for interest rate. And by plotting them together in a scatter plot with LM curve, there is a clear positive relation quite comparable to the relationship of increasing risk premium to compensate risk.

II.4 An Example of Expected Loss Prediction

Last but not the least, to demonstrate the predictive power of the dataset, this section presents an application of logistic regression to estimate the expected loss using the segmented data on loans whose status are listed as 'Current'.

The expected loss is defined by the following equation:

where the expected loss for state i is the summation of each probability of default times the payment gap, defined as the difference between total amount of the loan and the amount already paid at a specific point in time .

The probability of default is obtained by matrix transformation based on the parameters estimated from a training set, with variables as annual income, funded amount, home ownership, borrower's grade and the amount of the installment. The logit probability cut off is set at 0.7 for visualization effects. The results , based on the model assumptions , show that the states of California, Texas, New York and Florida are the ones with heaviest risk of large loss es , whereas the mid-west states present a much more optimistic loan payment expectation.

Rplot11

Figure 11. Expected Loss Preview

III. Conclusion:

The project uses visualization to analyze LendingClub’s loan applicants and extends to an application of logit regression for future loss estimation. I find that the trait of applicants usually exhibit quite different default probabilities, especially the probability of default for rating grades goes up stepwise with lower ratings.

In addition, average interest rates differs quite a lot across states and time, and serve as a good indicator of the application pool of the borrowers. Lastly, the expected loss for the outstanding loans at time being is relatively much higher in California, Texas, New York, and Florida, that more resources should be allotted to loan recollection and screening for new applications in these states.