How to Use Multiple Linear Regression for Loan Analysis
May 2016
Much has been written about the increasing importance of data analysis in lending. While it shouldn’t replace years of industry experience, such an approach can provide very useful insights when used in conjunction with prior knowledge for loan analysis.

One important tool that the data-minded lender ought to be familiar with is multiple linear regression, a method of statistical analysis that looks to quantify the effect a list of variables has on a target variable. This approach could be useful to understand things such as which borrower attributes have an effect on loan delinquency. Using this tool we can even go as far as to estimate how many additional days each variable contributes to the delinquency of the loan.

We will explain more on this later, but keep in mind that regression analysis can help us determine and quantify predictors of things we are interested in measuring.

If these ideas seem foreign to you, have no fear. This article will serve as a beginner’s tutorial to multiple linear regression. We will analyze some simple methods, as well as basic tools to get you started in the ever-growing world of financial analytics. Feel free to follow along with the code and data set that can be downloaded by filling out the form at the end of the article.

#### Get the Right Tools

The first step in performing proper analysis is acquiring the right tools. Your first inclination might be to use Microsoft Excel to analyze your data. While Excel is certainly useful for many things, there are a number of more advanced resources that are better equipped to handle advanced loan analysis. One very popular platform is a statistical computing package simply named R. Because it is free and open-source, professionals from statisticians to lending analysts find it convenient and useful.

Before we start crunching numbers it is important to think about how each variable is recorded. In general, there are three types we want to keep in mind.

• Continuous – Data that can be counted, ordered or measured. It can be whole numbers or decimals. Items such as Interest Rate in this format (.06), or Original Credit Score in this format (702).
• Binary – Data that is either True or False. Such as Is Delinquent in the format (1) or (0)
• Categorical – Data that can be sorted. Items such as City in this format (New York).

Because of their numeric nature, regression models handle continuous variables very well. However, they have trouble with binary or categorical text variables. So in order for our analysis to work we will either need to let R know these variables are binary/categorical, or manually insert numbers. For example, ‘Is Delinquent’ could be re-formatted so that if it is delinquent we input 1 and if it is not delinquent we input 0.

Because we want to know how much each variable contributes to delinquency and not necessarily delinquency status, we will ignore ‘Is Delinquent’ variable. Furthermore, we will want to make sure that R knows that ‘City’ is a categorical variable. To do this, use the as.factor() function in R[iii]. The second step in our analysis is to ask the right questions. It will be difficult to come to meaningful conclusions if we don’t first ask ourselves meaningful questions.

Let’s continue with the example above; let’s say we are interested in identifying and quantifying what attributes contribute to delinquency. We track data on dozens of loan and borrower attributes; it is likely that some are good indicators of delinquency while most are not. How well does credit score predict delinquency? Loan to value? City?
Let’s take a look at our data set [ii]. You will notice that we have information on 10 auto loans, each with a field indicating the number of days past due (‘Days.Delinquent’). Moreover, each loan record contains information for interest rate, credit score, LTV, etc.

Once we know what questions we are trying to answer our next step is to utilize R. Let’s open R and read in our data, using the read.csv() function. (You can read a further explanation on the read.cv() function here) Before any inference can be made, we first need to feel comfortable with our data set. To do this we will plot our X variables against days delinquent one-by-one, which should look like the following: The code above yields the following plot and best-fit line for LTV. As you can see in the graph below we see a linear relationship; we are off to a good start. We can follow this process for any variable we see fit. You may remember this from your introductory stat course in college; this is called simple linear regression.

#### Pick Potential Variables

We don’t want to overthink this step. All we want to do here is determine which variables could potentially contribute to delinquency. In our example we will eliminate the following variables:

• ‘Loan.Type.Description’: Each loan in our sample is an auto loan
• ‘Is.Delinquent’: This is just another measure of delinquency
• ‘Origination.Date’: This would require formatting beyond the scope of this article

Eliminating these three we are left to test the remaining covariates in our model.

#### Fit Our Model

Our goal here will be to create a model that best ‘fits’ the data. Remember in figure 1 we drew a ‘best-fit’ line? In that case we drew a line that came as close as possible to all points on our scatter plot. This was easy to visualize because there was only one X (LTV) and one Y (days delinquent). This visualization becomes impossible as we add more variables (X’s), but the principle is the same. We want to draw a line (create a linear equation) that best ‘fits’ our data points.

In R the code is quite simple. Coming up with the best model, however, can be tricky. Statistics is a highly mathematical discipline, but it often requires an artist’s touch. Model fitting is one of those areas.

Let’s first fit the full model, incorporating all candidate variables using the following code: We now have a model, but we need to determine if it is statistically significant. In other words, let’s see if this model is a good way to predict delinquency. The summary() function shows the following output: There is a lot of information to process here.[i] For the purposes of this article we will keep things simple. We see that the p-value (Pr(>|t|)) of each variable is rather large.[ii] There is likely a lot of noise from a few insignificant variables that is washing out the true indicators of delinquency.

Again, this part of regression analysis is more an art than science. That being said, we will show at least one approach to help us fit a better model.

The code below shows a stepwise approach to help us find a better model. For simplicity, it is enough to understand that this form of the step() function begins with the full model and analyzes every possible combination of variables until it finds the best fit. It is very useful and simple tool. The summary of our new model backward returns the following: We can see a few obvious differences:

1. We have restricted the model to two variables, ‘Original.Credit.Score’, and ‘CityWestland.’
2. We see that adjusted R-squared figure in our new model is slightly higher[v].
3. All variables show a significant p-value.

#### Interpret Coefficients

Now that we have a reasonable model and it is giving us output, what does it all mean? Remember that we are trying to quantify the effect each X has on Y. We can feel good that credit score, and city affect delinquency. The simplest way to understand this table is by saying, “All else constant as X increases by 1, Y changes by…” and plug in the listed coefficient estimate.

• Credit Score: As credit score increases by one, we can expect days delinquent to decrease by .73
• City: Categorical values are a bit different. Here we say that if the loan is originated in Westland, we would it expect it to be around 56 days more delinquent.

Finally, we can conclude that these are likely the two best predictors of delinquency.

#### Conclusion[vi]

In conclusion, using multiple linear regression for loan analysis can help identify and quantify patterns in our data. In this small example, we identified the two strongest predictors of delinquency, and to what degree they do so. If you would like to practice this more on your own you can download the practice data by filling out the form below, enjoy!

[i] Mac users visit this link. http://cran.r-project.org/bin/macosx/
[ii] Sample data set generated on 10 fictitious loans for the purposes of this tutorial.
[iii] The str() function shows how each variables in our data set is formatted. Int corresponds to continuous, while factor corresponds to categorical.
[iv] For more information on interpreting R’s linear model summary output see http://blog.yhathq.com/posts/r-lm-summary.html
[v] Generally, a p-value less than .05 means the given independent variable (X) significantly contributes to the response variable(Y). For simplicity, we will accept this assumption.
[vi] It is also very important to verify model assumptions. Please see attached document for guidance on analysis of residuals.