| |

Credit Approval Data Set – Predicting Credit Approval Using Logistic Regression and Matching Predictions to DataSet

When someone applies for Credit, it will be unfair to reject those who duly qualify and it might be detrimental to the company to wrongly accept the wrong people.  This is likely to happen if we try to make such decisions based on gut feelings.

So how do we use Machine Learning to increase our chances of selecting the right people who qualify for credit and also eliminate those who are not yet qualified, using a couple of information we know about those applicants.

 

In this tutorial, we will be using a Credit Approval Data Set  from UCI Machine Learning Repository. The data can be downloaded from here: Credit Approval Data Set

 

Now let’s read in our dataset and proceed with our model building. We will be using Logistic Regression in R but there a host of other algorithms you can use


 

Let’s read in the credit data

 

Lets preview our credit data

 

 

It is good to have an idea about the class of the variables, their levels and some sample data

 

Lets see the full column names in our credit data set

 

 

Let’s examine the distribution of the target variable, the one we are trying to predict to view , the distribution, we will convert the values to numeric values for this purpose We saw that the target variables values were “+” and “-”, hence in our conversions, the “+” = 1 and “-”” = 0

 

Lets see how are target variable now looks

 

 

Lets now actually change the target variable to 0s and 1s in our dataset for the rest of the analysis

 

Lets create check the distribution of each of the attributes of the credit data using histograms

 

 

 

The variable V2 seems to have bell curve shape but most of the numeric variables of our dataset are skewed to the right ( their tails point to the right). This suggests we can preprocess or transform the data using techniques like Box-Cox to see if it enhances the model performance after our first model built. (However, we will not be covering that technique in this article)

A closer look at variable V14 particularly V15 (and also deducing from the summary statistics indicate there might be some outliers . For instance, the average of the V15 variable is 1017.4 whereas the maximum is 100000.0 which is way from the normal. There advanced techniques to statistically detect outliers (That is not covered here) and we will not be removing these data points in this analysis

Lets take a closer look at V15 again

 

 

 

Let’s look at the non numerical values to have a sense of how they are distributed

 

How many are there?

 

 

What are they ?

 

Lets see their distribution. V16 was a factor variable but that was later converted to numeric. but we will view it as part of this distribution

 

Lets examine the relationship between the numerical features of the application

 

Lets check the correlations between the numerical features

 

 

lets get a pairwise visualization of all the dataset

 

summary

 

 

What percentage of the dataset is missing some VALUES. if more than 10% we should go and find more and appropriate data Will this percentage of missen data have an impact on our ? How do we handle missen data ?

 

 

This shows about 5.36% missen data

HANDLING MISSEN VALUES

Input the missen values in the variable 1 (V1) with the most occuring value – MODE The list of variables are few hence we will be inputing the missen values one after the other. Ideally, you would a create function that will shorten the entire process.

Create function to get the mode

 

 

Input the missen values with the mode

 

Preview the inputation

 

 

Input the missen values in the variable V2 with the average

 

Preview the inputation

 

 

Input the missen values in the V4 with the mode

 

 

Input the missen values in the V5 with the mode

 

 

V6: input the missen values with the mode

 

 

Any missen values

 

 

V7: input the missen values with the mode

 

 

V14: input the missen values with the mode

 

 

Lets call the mlbench and caret libraries. You can install them if they are not already installed.

 

 

Split data into test and train set Define an 70%/30% train/test split of the dataset

 

Run logistic regression on the training dataset

 

 

Lets see the summary of our model results

 

 

From the result, we can see that variables V6x, V9t , V13p, and V15 are statistically significant as they have p-values less than 0.05.

For instance from the result we can see that a unit change in variable V9t will increase the person’s chance of being approved for credit by 3.901e+00 whilst holding the other variables constant (because the coefficient of V9t is positive)

Now lets check our deviances. The smaller the deviance value, the better.

Our Null deviance = 664.19. OUr Null deviance is how welll our model is perforning if we do not have any predictor variables considered but only accounting for the intercept. And our Residual Deviance = 210.11 and this indicates how well our model is performaning when we add in predictor variables to our model.

We can see that deviance reduces (which means our model performs better when we add in our predictor variables). Which indicates, we can make better decisions on either approving a person for a credit or not if we consider other variables than just making a guess without taking some significant variables into consideration.

Let’s see a table of deviances by running anova on our model

 

 

Taking note of the Null deviance, it can be observed that as we add our predictor variables sequentially to the model (first to last as they appear in the result), the deviance reduces (model performs better). The deviance drops most where the model indicates

Let’s make predictions using the data we kept aside, which is the Test Data

 

 

 

Let’s summarize the accuracy of the predictions

 

 

It can be seen that the correct predictions are on the diagonal and the wrong ones are on the “off-diagonal”

It can be seen that we predicted *70 to be “1” (TRUE or they should be approved) and they were indeed “1”(TRUE, correct that we should approve them). That is our TRUE-POSITIVE (TP).

We also predicted 112 to be “0” (FALSE, that they should be rejected) and they were indeed “0” (FALSE, in the test dataset, they were actually rejected) which indicate our TRUE-NEGATIVE (TN)

and we predicted 10 to be “1”(TRUE, that they should be accepted) but that prediction was FALSE (WRONG, in the Test Dataset, they were not accepted) . That is our FALSE-POSITIVE (FP)

and we predicted 15 to be “0” (FALSE, they should be rejected) but that was WRONG as they were “1” (TRUE, they were accepted in the Test Dataset). That is the FALSE-NEGATIVE (FN)

Lets check the overall Accuracy of our model

 

 

The overall Accuracy of our model is 87.9% . This can be improved further with other advanced techniques and Parameter tuning (but that will not be covered here)

Though we get 85.5% overall Accuracy, it is always better to check our recall (TRUE-POSITIVE rate or sensitivity). Recall = TP / (TP + FN)

and also your precision which is Precision = TP/ (TP + FP)

And it is also good to have a peep at your TRUE NEGATIVE RATE (which is Specificity) and that is TRUE NEGATIVE RATE = TN / (TN + FP)

Let’s match predictions to original dataset

 

 

 

[/crayon]

 

 

Lets check the Area Under the Curve (AUC) and ROC plot

 

 

 

 

 

 

 

The ROC Curve shows the plot high above and also closer to “1” which is good

Lets check the area under the curve accuracy

 

 

Hope this gives you a basis for predicting whether or not someone will be approved for a Credit based on some characters (variables) we know about them!

 

 

Want more information like this?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *