When someone applies for Credit, it will be unfair to reject those who duly qualify and it might be detrimental to the company to wrongly accept the wrong people. This is likely to happen if we try to make such decisions based on gut feelings.
So how do we use Machine Learning to increase our chances of selecting the right people who qualify for credit and also eliminate those who are not yet qualified, using a couple of information we know about those applicants.
In this tutorial, we will be using a Credit Approval Data Set from UCI Machine Learning Repository. The data can be downloaded from here: Credit Approval Data Set
Now let’s read in our dataset and proceed with our model building. We will be using Logistic Regression in R but there a host of other algorithms you can use
library(data.table)
Let’s read in the credit data
crx.data <- data.table(read.table("crx.data.txt", header = FALSE, sep = ",", na.strings = "?"))
Lets preview our credit data
head(crx.data)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ## 1: b 30.83 0.000 u g w v 1.25 t t 1 f g 202 0 + ## 2: a 58.67 4.460 u g q h 3.04 t t 6 f g 43 560 + ## 3: a 24.50 0.500 u g q h 1.50 t f 0 f g 280 824 + ## 4: b 27.83 1.540 u g w v 3.75 t t 5 t g 100 3 + ## 5: b 20.17 5.625 u g w v 1.71 t f 0 f s 120 0 + ## 6: b 32.08 4.000 u g m v 2.50 t f 0 t g 360 0 +
It is good to have an idea about the class of the variables, their levels and some sample data
str(crx.data)
## Classes 'data.table' and 'data.frame': 690 obs. of 16 variables: ## $ V1 : Factor w/ 2 levels "a","b": 2 1 1 2 2 2 2 1 2 2 ... ## $ V2 : num 30.8 58.7 24.5 27.8 20.2 ... ## $ V3 : num 0 4.46 0.5 1.54 5.62 ... ## $ V4 : Factor w/ 3 levels "l","u","y": 2 2 2 2 2 2 2 2 3 3 ... ## $ V5 : Factor w/ 3 levels "g","gg","p": 1 1 1 1 1 1 1 1 3 3 ... ## $ V6 : Factor w/ 14 levels "aa","c","cc",..: 13 11 11 13 13 10 12 3 9 13 ... ## $ V7 : Factor w/ 9 levels "bb","dd","ff",..: 8 4 4 8 8 8 4 8 4 8 ... ## $ V8 : num 1.25 3.04 1.5 3.75 1.71 ... ## $ V9 : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 2 ... ## $ V10: Factor w/ 2 levels "f","t": 2 2 1 2 1 1 1 1 1 1 ... ## $ V11: int 1 6 0 5 0 0 0 0 0 0 ... ## $ V12: Factor w/ 2 levels "f","t": 1 1 1 2 1 2 2 1 1 2 ... ## $ V13: Factor w/ 3 levels "g","p","s": 1 1 1 1 3 1 1 1 1 1 ... ## $ V14: int 202 43 280 100 120 360 164 80 180 52 ... ## $ V15: int 0 560 824 3 0 0 31285 1349 314 1442 ... ## $ V16: Factor w/ 2 levels "-","+": 2 2 2 2 2 2 2 2 2 2 ... ## - attr(*, ".internal.selfref")=<externalptr>
Lets see the full column names in our credit data set
names(crx.data)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" ## [12] "V12" "V13" "V14" "V15" "V16"
Let’s examine the distribution of the target variable, the one we are trying to predict to view , the distribution, we will convert the values to numeric values for this purpose We saw that the target variables values were “+” and “-”, hence in our conversions, the “+” = 1 and “-”” = 0
hist(as.numeric(crx.data$V16)-1)
Lets see how are target variable now looks
as.numeric(crx.data$V16)-1
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [106] 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [246] 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 ## [281] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [316] 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [351] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [386] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [421] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [456] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [491] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 ## [526] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [561] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [596] 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 ## [631] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [666] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Lets now actually change the target variable to 0s and 1s in our dataset for the rest of the analysis
crx.data$V16 <- as.numeric(crx.data$V16)-1
Lets create check the distribution of each of the attributes of the credit data using histograms
numeric_data <- as.data.frame(crx.data[,c(2,3,8,11,14,15)]) summary(numeric_data)
## V2 V3 V8 V11 ## Min. :13.75 Min. : 0.000 Min. : 0.000 Min. : 0.0 ## 1st Qu.:22.60 1st Qu.: 1.000 1st Qu.: 0.165 1st Qu.: 0.0 ## Median :28.46 Median : 2.750 Median : 1.000 Median : 0.0 ## Mean :31.57 Mean : 4.759 Mean : 2.223 Mean : 2.4 ## 3rd Qu.:38.23 3rd Qu.: 7.207 3rd Qu.: 2.625 3rd Qu.: 3.0 ## Max. :80.25 Max. :28.000 Max. :28.500 Max. :67.0 ## NA's :12 ## V14 V15 ## Min. : 0 Min. : 0.0 ## 1st Qu.: 75 1st Qu.: 0.0 ## Median : 160 Median : 5.0 ## Mean : 184 Mean : 1017.4 ## 3rd Qu.: 276 3rd Qu.: 395.5 ## Max. :2000 Max. :100000.0 ## NA's :13
par(mfrow=c(2,3)) for(i in 1:6) { hist(numeric_data[,i], main=names(numeric_data)[i]) }
The variable V2 seems to have bell curve shape but most of the numeric variables of our dataset are skewed to the right ( their tails point to the right). This suggests we can preprocess or transform the data using techniques like Box-Cox to see if it enhances the model performance after our first model built. (However, we will not be covering that technique in this article)
A closer look at variable V14 particularly V15 (and also deducing from the summary statistics indicate there might be some outliers . For instance, the average of the V15 variable is 1017.4 whereas the maximum is 100000.0 which is way from the normal. There advanced techniques to statistically detect outliers (That is not covered here) and we will not be removing these data points in this analysis
Lets take a closer look at V15 again
par(mfrow=c(1,1)) hist((numeric_data[,"V15"]))
#hist(log10(numeric_data[,"V15"])) summary(numeric_data$V15)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0 0.0 5.0 1017.0 395.5 100000.0
Let’s look at the non numerical values to have a sense of how they are distributed
non_numeric <- numeric_data <- as.data.frame(crx.data[,-c(2,3,8,11,14,15)])
How many are there?
dim(non_numeric)
## [1] 690 10
What are they ?
names(non_numeric)
## [1] "V1" "V4" "V5" "V6" "V7" "V9" "V10" "V12" "V13" "V16"
Lets see their distribution. V16 was a factor variable but that was later converted to numeric. but we will view it as part of this distribution
par(mfrow=c(2,5)) for(i in 1:10) { plot(non_numeric[,i], main=names(non_numeric)[i]) }
Lets examine the relationship between the numerical features of the application
plot(crx.data[,c(2,3,8,11,14,15)], col=crx.data$V16)
Lets check the correlations between the numerical features
correlations <- cor(crx.data[,c(2,3,8,11,14,15)]) print(correlations)
## V2 V3 V8 V11 V14 V15 ## V2 1 NA NA NA NA NA ## V3 NA 1.0000000 0.29890156 0.27120674 NA 0.12312115 ## V8 NA 0.2989016 1.00000000 0.32232967 NA 0.05134493 ## V11 NA 0.2712067 0.32232967 1.00000000 NA 0.06369244 ## V14 NA NA NA NA 1 NA ## V15 NA 0.1231212 0.05134493 0.06369244 NA 1.00000000
lets get a pairwise visualization of all the dataset
pairs(V16~., data=non_numeric, col=non_numeric$V16)
summary
summary(crx.data)
## V1 V2 V3 V4 V5 ## a :210 Min. :13.75 Min. : 0.000 l : 2 g :519 ## b :468 1st Qu.:22.60 1st Qu.: 1.000 u :519 gg : 2 ## NA's: 12 Median :28.46 Median : 2.750 y :163 p :163 ## Mean :31.57 Mean : 4.759 NA's: 6 NA's: 6 ## 3rd Qu.:38.23 3rd Qu.: 7.207 ## Max. :80.25 Max. :28.000 ## NA's :12 ## V6 V7 V8 V9 V10 ## c :137 v :399 Min. : 0.000 f:329 f:395 ## q : 78 h :138 1st Qu.: 0.165 t:361 t:295 ## w : 64 bb : 59 Median : 1.000 ## i : 59 ff : 57 Mean : 2.223 ## aa : 54 j : 8 3rd Qu.: 2.625 ## (Other):289 (Other): 20 Max. :28.500 ## NA's : 9 NA's : 9 ## V11 V12 V13 V14 V15 ## Min. : 0.0 f:374 g:625 Min. : 0 Min. : 0.0 ## 1st Qu.: 0.0 t:316 p: 8 1st Qu.: 75 1st Qu.: 0.0 ## Median : 0.0 s: 57 Median : 160 Median : 5.0 ## Mean : 2.4 Mean : 184 Mean : 1017.4 ## 3rd Qu.: 3.0 3rd Qu.: 276 3rd Qu.: 395.5 ## Max. :67.0 Max. :2000 Max. :100000.0 ## NA's :13 ## V16 ## Min. :0.0000 ## 1st Qu.:0.0000 ## Median :0.0000 ## Mean :0.4449 ## 3rd Qu.:1.0000 ## Max. :1.0000 ##
What percentage of the dataset is missing some VALUES. if more than 10% we should go and find more and appropriate data Will this percentage of missen data have an impact on our ? How do we handle missen data ?
missen <- sum(!complete.cases(crx.data))/dim(crx.data)[1] *100 missen
## [1] 5.362319
This shows about 5.36% missen data
HANDLING MISSEN VALUES
Input the missen values in the variable 1 (V1) with the most occuring value – MODE The list of variables are few hence we will be inputing the missen values one after the other. Ideally, you would a create function that will shorten the entire process.
Create function to get the mode
getmode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] }
v1.mode <- getmode(crx.data$V1)
Input the missen values with the mode
crx.data[is.na(crx.data$V1), 1] <- v1.mode
Preview the inputation
table(crx.data$V1)
## ## a b ## 210 480
Input the missen values in the variable V2 with the average
crx.data[is.na(crx.data$V2), 2] <- mean(crx.data$V2, na.rm = TRUE)
Preview the inputation
summary(crx.data$V2)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 13.75 22.67 28.62 31.57 37.71 80.25
Input the missen values in the V4 with the mode
crx.data[is.na(crx.data$V4), 4] <- getmode(crx.data$V4) #preview the inputation table(crx.data$V4)
## ## l u y ## 2 525 163
Input the missen values in the V5 with the mode
crx.data[is.na(crx.data$V5), 5] <- getmode(crx.data$V5) #preview the inputation table(crx.data$V5)
## ## g gg p ## 525 2 163
V6: input the missen values with the mode
crx.data[is.na(crx.data$V6), 6] <- getmode(crx.data$V6) #preview the inputation table(crx.data$V6)
## ## aa c cc d e ff i j k m q r w x ## 54 146 41 30 25 53 59 10 51 38 78 3 64 38
Any missen values
table(is.na(crx.data$V6))
## ## FALSE ## 690
V7: input the missen values with the mode
crx.data[is.na(crx.data$V7), 7] <- getmode(crx.data$V7) #preview the inputation table(crx.data$V7)
## ## bb dd ff h j n o v z ## 59 6 57 138 8 4 2 408 8
V14: input the missen values with the mode
crx.data[is.na(crx.data$V14), 14] <- as.integer(mean(crx.data$V14, na.rm = TRUE)) #preview the inputation summary(crx.data$v14)
## Length Class Mode ## 0 NULL NULL
Lets call the mlbench and caret libraries. You can install them if they are not already installed.
#install.packages("mlbench") #install.packages("caret") library(mlbench) library(caret)
## Loading required package: ggplot2
Split data into test and train set Define an 70%/30% train/test split of the dataset
set.seed(257) trainIndex <- createDataPartition(crx.data$V16, p=0.70, list=FALSE) dataTrain <- crx.data[ trainIndex,] dataTest <- crx.data[-trainIndex,]
Run logistic regression on the training dataset
model <- glm(V16 ~.,family=binomial(link='logit'),data=dataTrain)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Lets see the summary of our model results
summary(model)
## ## Call: ## glm(formula = V16 ~ ., family = binomial(link = "logit"), data = dataTrain) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.4383 -0.3272 -0.1221 0.4432 3.3254 ## ## Coefficients: (2 not defined because of singularities) ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -4.243e+00 2.058e+03 -0.002 0.998355 ## V1b -2.702e-01 3.618e-01 -0.747 0.455196 ## V2 1.543e-02 1.524e-02 1.012 0.311384 ## V3 -1.895e-02 3.193e-02 -0.594 0.552808 ## V4u 1.338e-01 2.058e+03 0.000 0.999948 ## V4y -3.783e-01 2.058e+03 0.000 0.999853 ## V5gg NA NA NA NA ## V5p NA NA NA NA ## V6c 3.418e-01 5.888e-01 0.580 0.561654 ## V6cc 1.250e+00 8.525e-01 1.466 0.142633 ## V6d 6.549e-01 9.276e-01 0.706 0.480152 ## V6e 2.194e+00 1.282e+00 1.711 0.087154 . ## V6ff -1.944e+01 1.455e+03 -0.013 0.989340 ## V6i 3.286e-01 7.948e-01 0.413 0.679283 ## V6j -1.753e+01 1.455e+03 -0.012 0.990392 ## V6k -1.303e-01 7.723e-01 -0.169 0.866028 ## V6m 3.100e-01 8.137e-01 0.381 0.703214 ## V6q 6.079e-01 6.437e-01 0.944 0.344941 ## V6r -1.399e+01 1.455e+03 -0.010 0.992329 ## V6w 1.226e+00 6.897e-01 1.777 0.075571 . ## V6x 2.662e+00 9.430e-01 2.823 0.004762 ** ## V7dd -3.426e-01 1.970e+00 -0.174 0.861909 ## V7ff 1.865e+01 1.455e+03 0.013 0.989778 ## V7h 1.237e+00 6.711e-01 1.844 0.065242 . ## V7j 1.900e+01 1.455e+03 0.013 0.989584 ## V7n 3.458e+00 1.660e+00 2.083 0.037278 * ## V7o -1.259e+01 1.455e+03 -0.009 0.993098 ## V7v 5.835e-01 6.088e-01 0.958 0.337841 ## V7z -2.992e+00 1.861e+00 -1.607 0.107950 ## V8 3.096e-02 5.223e-02 0.593 0.553275 ## V9t 3.621e+00 3.997e-01 9.060 < 2e-16 *** ## V10t 9.237e-01 4.395e-01 2.102 0.035592 * ## V11 7.929e-02 6.318e-02 1.255 0.209516 ## V12t -1.047e-01 3.268e-01 -0.321 0.748577 ## V13p 4.057e+00 1.051e+00 3.861 0.000113 *** ## V13s 2.915e-01 6.023e-01 0.484 0.628408 ## V14 -2.854e-03 9.985e-04 -2.859 0.004256 ** ## V15 6.357e-04 2.190e-04 2.903 0.003699 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 667.84 on 482 degrees of freedom ## Residual deviance: 291.88 on 447 degrees of freedom ## AIC: 363.88 ## ## Number of Fisher Scoring iterations: 14
From the result, we can see that variables V6x, V9t , V13p, and V15 are statistically significant as they have p-values less than 0.05.
For instance from the result we can see that a unit change in variable V9t will increase the person’s chance of being approved for credit by 3.901e+00 whilst holding the other variables constant (because the coefficient of V9t is positive)
Now lets check our deviances. The smaller the deviance value, the better.
Our Null deviance = 664.19. OUr Null deviance is how welll our model is perforning if we do not have any predictor variables considered but only accounting for the intercept. And our Residual Deviance = 210.11 and this indicates how well our model is performaning when we add in predictor variables to our model.
We can see that deviance reduces (which means our model performs better when we add in our predictor variables). Which indicates, we can make better decisions on either approving a person for a credit or not if we consider other variables than just making a guess without taking some significant variables into consideration.
Let’s see a table of deviances by running anova on our model
anova(model, test="Chisq")
## Analysis of Deviance Table ## ## Model: binomial, link: logit ## ## Response: V16 ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## NULL 482 667.84 ## V1 1 1.406 481 666.43 0.2356401 ## V2 1 14.303 480 652.13 0.0001556 *** ## V3 1 10.101 479 642.03 0.0014819 ** ## V4 2 20.973 477 621.05 2.791e-05 *** ## V5 0 0.000 477 621.05 ## V6 13 71.731 464 549.32 3.847e-10 *** ## V7 8 7.839 456 541.48 0.4493155 ## V8 1 11.281 455 530.20 0.0007831 *** ## V9 1 175.302 454 354.90 < 2.2e-16 *** ## V10 1 21.597 453 333.30 3.364e-06 *** ## V11 1 5.416 452 327.89 0.0199566 * ## V12 1 2.241 451 325.65 0.1343758 ## V13 2 13.604 449 312.04 0.0011113 ** ## V14 1 8.351 448 303.69 0.0038541 ** ## V15 1 11.814 447 291.88 0.0005877 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Taking note of the Null deviance, it can be observed that as we add our predictor variables sequentially to the model (first to last as they appear in the result), the deviance reduces (model performs better). The deviance drops most where the model indicates
Let’s make predictions using the data we kept aside, which is the Test Data
probabilities <- predict(model, newdata = dataTest[,-16], type='response')
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = ## ifelse(type == : prediction from a rank-deficient fit may be misleading
predictions <- ifelse(probabilities > 0.5,'1','0')
Let’s summarize the accuracy of the predictions
table(predictions, dataTest$V16)
## ## predictions 0 1 ## 0 112 10 ## 1 15 70
It can be seen that the correct predictions are on the diagonal and the wrong ones are on the “off-diagonal”
It can be seen that we predicted *70 to be “1” (TRUE or they should be approved) and they were indeed “1”(TRUE, correct that we should approve them). That is our TRUE-POSITIVE (TP).
We also predicted 112 to be “0” (FALSE, that they should be rejected) and they were indeed “0” (FALSE, in the test dataset, they were actually rejected) which indicate our TRUE-NEGATIVE (TN)
and we predicted 10 to be “1”(TRUE, that they should be accepted) but that prediction was FALSE (WRONG, in the Test Dataset, they were not accepted) . That is our FALSE-POSITIVE (FP)
and we predicted 15 to be “0” (FALSE, they should be rejected) but that was WRONG as they were “1” (TRUE, they were accepted in the Test Dataset). That is the FALSE-NEGATIVE (FN)
Lets check the overall Accuracy of our model
misClasificError <- mean(predictions != dataTest$V16) print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.879227053140097"
The overall Accuracy of our model is 87.9% . This can be improved further with other advanced techniques and Parameter tuning (but that will not be covered here)
Though we get 85.5% overall Accuracy, it is always better to check our recall (TRUE-POSITIVE rate or sensitivity). Recall = TP / (TP + FN)
and also your precision which is Precision = TP/ (TP + FP)
And it is also good to have a peep at your TRUE NEGATIVE RATE (which is Specificity) and that is TRUE NEGATIVE RATE = TN / (TN + FP)
Let’s match predictions to original dataset
copy_dataTest <- data.frame(dataTest) dim(copy_dataTest)
## [1] 207 16
copy_dataTest_pred <- cbind(copy_dataTest, predictions) copy_dataTest_pred
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 ## 1 b 30.83000 0.000 u g w v 1.250 t t 1 f g 202 0 ## 2 a 24.50000 0.500 u g q h 1.500 t f 0 f g 280 824 ## 3 b 27.83000 1.540 u g w v 3.750 t t 5 t g 100 3 ## 4 b 32.08000 4.000 u g m v 2.500 t f 0 t g 360 0 ## 5 b 33.17000 1.040 u g r h 6.500 t f 0 t g 164 31285 ## 6 a 22.92000 11.585 u g cc v 0.040 t f 0 f g 80 1349 ## 7 b 29.92000 1.835 u g c h 4.335 t f 0 f g 260 200 ## 8 a 23.25000 5.875 u g q v 3.170 t t 10 f g 120 245 ## 9 b 25.00000 11.250 u g c v 2.500 t t 17 f g 200 1208 ## 10 a 47.75000 8.000 u g c v 7.875 t t 6 t g 0 1260 ## 11 b 54.58000 9.415 u g ff ff 14.415 t t 11 t g 30 300 ## 12 b 34.17000 9.170 u g c v 4.500 t t 12 t g 0 221 ## 13 b 29.67000 1.415 u g w h 0.750 t t 1 f g 240 100 ## 14 b 41.50000 1.540 u g i bb 3.500 f f 0 f g 216 0 ## 15 b 26.00000 1.000 u g q v 1.750 t f 0 t g 280 0 ## 16 b 37.42000 2.040 u g w v 0.040 t f 0 t g 400 5800 ## 17 b 43.25000 3.000 u g q h 6.000 t t 11 f g 80 0 ## 18 a 36.00000 1.000 u g c v 2.000 t t 11 f g 0 456 ## 19 b 20.67000 5.290 u g q v 0.375 t t 1 f g 160 0 ## 20 a 49.00000 1.500 u g j j 0.000 t f 0 t g 100 27 ## 21 b 31.56817 5.000 y p aa v 8.500 t f 0 f g 0 0 ## 22 a 28.58000 3.540 u g i bb 0.500 t f 0 t g 171 0 ## 23 b 37.50000 1.750 y p c bb 0.250 t f 0 t g 164 400 ## 24 b 27.83000 4.000 y p i h 5.750 t t 2 t g 75 0 ## 25 b 24.58000 12.500 u g w v 0.875 t f 0 t g 260 0 ## 26 a 33.75000 0.750 u g k bb 1.000 t t 3 t g 212 0 ## 27 b 20.67000 1.250 y p c h 1.375 t t 3 t g 140 210 ## 28 b 39.92000 6.210 u g q v 0.040 t t 1 f g 200 300 ## 29 a 44.17000 6.665 u g q v 7.375 t t 3 t g 0 0 ## 30 b 67.75000 5.500 u g e z 13.000 t t 1 t g 0 0 ## 31 b 20.42000 1.835 u g c v 2.250 t t 1 f g 100 150 ## 32 b 33.58000 2.750 u g m v 4.250 t t 6 f g 204 0 ## 33 b 43.00000 0.290 y p cc h 1.750 t t 8 f g 100 375 ## 34 b 32.83000 2.500 u g cc h 2.750 t t 6 f g 160 2072 ## 35 a 40.33000 7.540 y p q h 8.000 t t 14 f g 0 2300 ## 36 a 30.50000 6.500 u g c bb 4.000 t t 7 t g 0 3065 ## 37 b 23.08000 2.500 u g c v 1.085 t t 11 t g 60 2184 ## 38 a 28.50000 3.040 y p x h 2.540 t t 1 f g 70 0 ## 39 b 44.00000 2.000 u g m v 1.750 t t 2 t g 0 15 ## 40 b 25.00000 12.500 u g aa v 3.000 t f 0 t s 20 0 ## 41 b 20.17000 8.170 u g aa v 1.960 t t 14 f g 60 158 ## 42 a 21.25000 2.335 u g i bb 0.500 t t 4 f s 80 0 ## 43 a 57.08000 19.500 u g c v 5.500 t t 7 f g 0 3000 ## 44 a 22.42000 5.665 u g q v 2.585 t t 7 f g 129 3257 ## 45 b 21.33000 10.500 u g c v 3.000 t f 0 t g 0 0 ## 46 b 22.67000 1.585 y p w v 3.085 t t 6 f g 80 0 ## 47 b 24.83000 2.750 u g c v 2.250 t t 6 f g 184 600 ## 48 a 20.75000 10.250 u g q v 0.710 t t 2 t g 49 0 ## 49 b 36.33000 2.125 y p w v 0.085 t t 1 f g 50 1187 ## 50 b 28.67000 9.335 u g q h 5.665 t t 6 f g 381 168 ## 51 b 39.33000 5.875 u g cc h 10.000 t t 14 t g 399 0 ## 52 b 26.67000 2.710 y p cc v 5.250 t t 1 f g 211 0 ## 53 b 48.17000 3.500 u g aa v 3.500 t f 0 f s 230 0 ## 54 a 50.08000 12.540 u g aa v 2.290 t t 3 t g 156 0 ## 55 b 23.25000 4.000 u g c bb 0.250 t f 0 t g 160 0 ## 56 b 18.08000 5.500 u g k v 0.500 t f 0 f g 80 0 ## 57 a 58.42000 21.000 u g i bb 10.000 t t 13 f g 0 6700 ## 58 a 20.67000 1.835 u g q v 2.085 t t 5 f g 220 2503 ## 59 b 21.33000 7.500 u g aa v 1.415 t t 1 f g 80 9800 ## 60 b 28.33000 5.000 u g w v 11.000 t f 0 t g 70 0 ## 61 b 33.17000 3.040 y p c h 2.040 t t 1 t g 180 18027 ## 62 b 23.17000 11.125 u g x h 0.460 t t 1 f g 100 0 ## 63 b 31.56817 0.625 u g k v 0.250 f f 0 f g 380 2010 ## 64 b 20.00000 0.000 u g d v 0.500 f f 0 f g 144 0 ## 65 a 24.50000 1.750 y p c v 0.165 f f 0 f g 132 0 ## 66 a 52.17000 0.000 y p ff ff 0.000 f f 0 f g 0 0 ## 67 b 17.08000 0.085 y p c v 0.040 f f 0 f g 140 722 ## 68 b 18.33000 1.210 y p e dd 0.000 f f 0 f g 100 0 ## 69 b 30.67000 2.500 u g cc h 2.250 f f 0 t s 340 0 ## 70 b 16.25000 0.835 u g m v 0.085 t f 0 f s 200 0 ## 71 b 17.58000 10.000 u g w h 0.165 f t 1 f g 120 1 ## 72 a 31.56817 1.500 u g ff ff 0.000 f t 2 t g 200 105 ## 73 b 29.50000 0.580 u g w v 0.290 f t 1 f g 340 2803 ## 74 a 21.75000 1.750 y p j j 0.000 f f 0 f g 160 0 ## 75 b 35.75000 2.415 u g w v 0.125 f t 2 f g 220 1 ## 76 b 32.92000 2.500 u g aa v 1.750 f t 2 t g 720 0 ## 77 b 16.33000 2.750 u g aa v 0.665 f t 1 f g 80 21 ## 78 b 23.42000 1.000 u g c v 0.500 f f 0 t s 280 0 ## 79 b 23.50000 2.750 u g ff ff 4.500 f f 0 f g 160 25 ## 80 b 18.58000 10.290 u g ff ff 0.415 f f 0 f g 80 0 ## 81 b 27.75000 1.290 u g k h 0.250 f f 0 t s 140 0 ## 82 a 24.83000 4.500 u g w v 1.000 f f 0 t g 360 6 ## 83 a 18.58000 10.000 u g d v 0.415 f f 0 f g 80 42 ## 84 b 16.25000 0.000 y p aa v 0.250 f f 0 f g 60 0 ## 85 b 17.50000 22.000 l gg ff o 0.000 f f 0 t p 450 100000 ## 86 a 33.67000 0.375 u g cc v 0.375 f f 0 f g 300 44 ## 87 b 30.17000 1.085 y p c v 0.040 f f 0 f g 170 179 ## 88 b 47.33000 6.500 u g c v 1.000 f f 0 t g 0 228 ## 89 a 34.83000 1.250 y p i h 0.500 f f 0 t g 160 0 ## 90 b 38.92000 1.750 u g k v 0.500 f f 0 t g 300 2 ## 91 b 62.75000 7.000 u g e z 0.000 f f 0 f g 0 12 ## 92 b 63.33000 0.540 u g c v 0.585 t t 3 t g 180 0 ## 93 b 41.17000 1.335 u g d v 0.165 f f 0 f g 168 0 ## 94 b 32.42000 3.000 u g d v 0.165 f f 0 t g 120 0 ## 95 a 30.25000 5.500 u g k v 5.500 f f 0 t s 100 0 ## 96 b 23.08000 2.500 u g ff ff 0.085 f f 0 t g 100 4208 ## 97 b 26.83000 0.540 u g k ff 0.000 f f 0 f g 100 0 ## 98 a 22.75000 6.165 u g aa v 0.165 f f 0 f g 220 1000 ## 99 b 33.00000 2.500 y p w v 7.000 f f 0 t g 280 0 ## 100 b 26.33000 13.000 u g e dd 0.000 f f 0 t g 140 1110 ## 101 b 26.25000 1.540 u g w v 0.125 f f 0 f g 100 0 ## 102 b 28.67000 14.500 u g d v 0.125 f f 0 f g 0 286 ## 103 b 20.67000 0.835 y p c v 2.000 f f 0 t s 240 0 ## 104 a 22.67000 7.000 u g c v 0.165 f f 0 f g 160 0 ## 105 b 22.08000 11.460 u g k v 1.585 f f 0 t g 100 1212 ## 106 b 22.58000 1.500 y p aa v 0.540 f f 0 t g 120 67 ## 107 b 21.17000 0.000 u g c v 0.500 f f 0 t s 0 0 ## 108 b 24.75000 0.540 u g m v 1.000 f f 0 t g 120 1 ## 109 b 41.17000 1.250 y p w v 0.250 f f 0 f g 0 195 ## 110 a 33.08000 1.625 u g d v 0.540 f f 0 t g 0 0 ## 111 b 20.75000 5.085 y p j v 0.290 f f 0 f g 140 184 ## 112 b 28.92000 0.375 u g c v 0.290 f f 0 f g 220 140 ## 113 a 22.67000 0.335 u g q v 0.750 f f 0 f s 160 0 ## 114 a 19.58000 0.665 y p c v 1.000 f t 1 f g 2000 2 ## 115 b 17.08000 0.250 u g q v 0.335 f t 4 f g 160 8 ## 116 b 31.25000 2.835 u g ff ff 0.000 f t 5 f g 176 146 ## 117 a 22.67000 0.790 u g i v 0.085 f f 0 f g 144 0 ## 118 b 40.58000 1.500 u g i bb 0.000 f f 0 f s 300 0 ## 119 a 22.25000 1.250 y p ff ff 3.250 f f 0 f g 280 0 ## 120 b 22.50000 0.125 y p k v 0.125 f f 0 f g 200 70 ## 121 a 26.58000 2.540 y p ff ff 0.000 f f 0 t g 180 60 ## 122 b 35.00000 2.500 u g i v 1.000 f f 0 t g 210 0 ## 123 b 26.17000 0.835 u g cc v 1.165 f f 0 f g 100 0 ## 124 b 33.67000 2.165 u g c v 1.500 f f 0 f p 120 0 ## 125 b 24.58000 1.250 u g c v 0.250 f f 0 f g 110 0 ## 126 b 37.50000 0.835 u g e v 0.040 f f 0 f g 120 5 ## 127 b 22.92000 3.165 y p c v 0.165 f f 0 f g 160 1058 ## 128 b 19.00000 0.000 y p ff ff 0.000 f t 4 f g 45 1 ## 129 b 19.58000 0.585 u g ff ff 0.000 f t 3 f g 350 769 ## 130 a 53.33000 0.165 u g ff ff 0.000 f f 0 t s 62 27 ## 131 a 27.17000 1.250 u g ff ff 0.000 f t 1 f g 92 300 ## 132 b 39.58000 5.000 u g ff ff 0.000 f t 2 f g 17 1 ## 133 b 16.50000 0.125 u g c v 0.165 f f 0 f g 132 0 ## 134 a 27.33000 1.665 u g ff ff 0.000 f f 0 f g 340 1 ## 135 b 39.50000 1.625 u g c v 1.500 f f 0 f g 0 316 ## 136 b 29.75000 0.665 u g w v 0.250 f f 0 t g 300 0 ## 137 b 25.67000 0.290 y p c v 1.500 f f 0 t g 160 0 ## 138 a 24.50000 2.415 y p c v 0.000 f f 0 f g 120 0 ## 139 b 21.92000 0.500 u g c v 0.125 f f 0 f g 360 0 ## 140 a 30.42000 1.375 u g w h 0.040 f t 3 f g 0 33 ## 141 b 21.08000 4.125 y p i h 0.040 f f 0 f g 140 100 ## 142 b 26.75000 2.000 u g d v 0.750 f f 0 t g 80 0 ## 143 b 23.58000 0.835 u g i h 0.085 f f 0 t g 220 5 ## 144 b 39.17000 2.500 y p i h 10.000 f f 0 t s 200 0 ## 145 b 22.75000 11.500 u g i v 0.415 f f 0 f g 0 0 ## 146 b 23.50000 3.165 y p k v 0.415 f t 1 t g 280 80 ## 147 b 34.67000 1.080 u g m v 1.165 f f 0 f s 28 0 ## 148 b 24.50000 13.335 y p aa v 0.040 f f 0 t g 120 475 ## 149 b 24.17000 0.875 u g q v 4.625 t t 2 t g 520 2000 ## 150 b 28.25000 5.125 u g x v 4.750 t t 2 f g 420 7 ## 151 b 21.00000 4.790 y p w v 2.250 t t 1 t g 80 300 ## 152 b 13.75000 4.000 y p w v 1.750 t t 2 t g 120 1000 ## 153 b 31.56817 10.500 u g x v 6.500 t f 0 f g 0 0 ## 154 b 22.83000 3.000 u g m v 1.290 t t 1 f g 260 800 ## 155 a 22.50000 8.500 u g q v 1.750 t t 10 f g 80 990 ## 156 b 41.58000 1.750 u g k v 0.210 t f 0 f g 160 0 ## 157 a 57.08000 0.335 u g i bb 1.000 t f 0 t g 252 2197 ## 158 a 25.33000 2.085 u g c h 2.750 t f 0 t g 360 1 ## 159 b 40.92000 0.835 u g ff ff 0.000 t f 0 f g 130 1 ## 160 a 33.92000 1.585 y p ff ff 0.000 t f 0 f g 320 0 ## 161 a 24.92000 1.250 u g ff ff 0.000 t f 0 f g 80 0 ## 162 b 19.67000 10.000 y p k h 0.835 t f 0 t g 140 0 ## 163 b 44.25000 11.000 y p d v 1.500 t f 0 f s 0 0 ## 164 b 34.75000 15.000 u g r n 5.375 t t 9 t g 0 134 ## 165 b 38.58000 3.335 u g w v 4.000 t t 14 f g 383 1344 ## 166 a 22.42000 11.250 y p x h 0.750 t t 4 f g 0 321 ## 167 b 26.75000 1.125 u g x h 1.250 t f 0 f g 0 5298 ## 168 a 20.83000 3.000 u g aa v 0.040 t f 0 f g 100 0 ## 169 b 23.08000 11.500 u g w h 2.125 t t 11 t g 290 284 ## 170 b 43.08000 0.375 y p c v 0.375 t t 8 t g 300 162 ## 171 b 21.00000 3.000 y p d v 1.085 t t 8 t g 160 1 ## 172 b 32.25000 0.165 y p c h 3.250 t t 1 t g 432 8000 ## 173 b 30.17000 0.500 u g c v 1.750 t t 11 f g 32 540 ## 174 b 31.67000 0.830 u g x v 1.335 t t 8 t g 303 3290 ## 175 b 32.67000 9.000 y p w h 5.250 t f 0 t g 154 0 ## 176 a 28.08000 15.000 y p e z 0.000 t f 0 f g 0 13212 ## 177 b 64.08000 20.000 u g x h 17.500 t t 9 t g 0 1000 ## 178 b 26.67000 1.750 y p c v 1.000 t t 5 t g 160 5777 ## 179 b 38.67000 0.210 u g k v 0.085 t f 0 t g 280 0 ## 180 b 25.75000 0.750 u g c bb 0.250 t f 0 f g 349 23 ## 181 a 29.50000 0.460 u g k v 0.540 t t 4 f g 380 500 ## 182 b 29.83000 1.250 y p k v 0.250 f f 0 f g 224 0 ## 183 b 16.17000 0.040 u g c v 0.040 f f 0 f g 0 0 ## 184 b 22.00000 0.790 u g w v 0.290 f t 1 f g 420 283 ## 185 a 38.33000 4.415 u g c v 0.125 f f 0 f g 160 0 ## 186 b 29.42000 1.250 u g c h 0.250 f t 2 t g 400 108 ## 187 b 22.67000 0.750 u g i v 1.585 f t 1 t g 400 9 ## 188 b 29.58000 4.750 u g m v 2.000 f t 1 t g 460 68 ## 189 b 22.00000 7.835 y p i bb 0.165 f f 0 t g 184 0 ## 190 a 27.25000 0.290 u g m h 0.125 f t 1 t g 272 108 ## 191 b 32.42000 2.165 y p k ff 0.000 f f 0 f g 120 0 ## 192 b 34.17000 2.750 u g i bb 2.500 f f 0 t g 232 200 ## 193 b 36.17000 0.420 y p w v 0.290 f f 0 t g 309 2 ## 194 a 15.83000 7.625 u g q v 0.125 f t 1 t g 0 160 ## 195 a 15.75000 0.375 u g c v 1.000 f f 0 f g 120 18 ## 196 a 28.58000 3.750 u g c v 0.250 f t 1 t g 40 154 ## 197 b 22.25000 9.000 u g aa v 0.085 f f 0 f g 0 0 ## 198 b 29.83000 3.500 u g c v 0.165 f f 0 f g 216 0 ## 199 b 31.08000 1.500 y p w v 0.040 f f 0 f s 160 0 ## 200 b 25.83000 12.835 u g cc v 0.500 f f 0 f g 0 2 ## 201 a 37.33000 2.500 u g i h 0.210 f f 0 f g 260 246 ## 202 a 41.58000 1.040 u g aa v 0.665 f f 0 f g 240 237 ## 203 a 17.92000 10.210 u g ff ff 0.000 f f 0 f g 0 50 ## 204 a 20.08000 1.250 u g c v 0.000 f f 0 f g 0 0 ## 205 b 27.83000 1.000 y p d h 3.000 f f 0 f g 176 537 ## 206 b 36.42000 0.750 y p d v 0.585 f f 0 f g 240 3 ## 207 b 40.58000 3.290 u g m v 3.500 f f 0 t s 400 0 ## V16 predictions ## 1 1 1 ## 2 1 1 ## 3 1 1 ## 4 1 0 ## 5 1 1 ## 6 1 1 ## 7 1 1 ## 8 1 1 ## 9 1 1 ## 10 1 1 ## 11 1 1 ## 12 1 1 ## 13 1 1 ## 14 1 0 ## 15 1 0 ## 16 1 1 ## 17 1 1 ## 18 1 1 ## 19 0 1 ## 20 0 1 ## 21 0 0 ## 22 0 0 ## 23 0 0 ## 24 0 1 ## 25 0 1 ## 26 0 1 ## 27 0 1 ## 28 1 1 ## 29 1 1 ## 30 1 1 ## 31 1 1 ## 32 1 1 ## 33 1 1 ## 34 1 1 ## 35 1 1 ## 36 1 1 ## 37 1 1 ## 38 1 1 ## 39 1 1 ## 40 1 1 ## 41 1 1 ## 42 1 1 ## 43 1 1 ## 44 1 1 ## 45 1 1 ## 46 1 1 ## 47 1 1 ## 48 1 1 ## 49 1 1 ## 50 1 1 ## 51 1 1 ## 52 1 1 ## 53 1 1 ## 54 1 1 ## 55 1 0 ## 56 1 0 ## 57 1 1 ## 58 1 1 ## 59 1 1 ## 60 1 1 ## 61 1 1 ## 62 1 1 ## 63 0 0 ## 64 0 0 ## 65 0 0 ## 66 0 0 ## 67 0 0 ## 68 0 0 ## 69 0 0 ## 70 0 1 ## 71 0 0 ## 72 0 0 ## 73 0 0 ## 74 0 0 ## 75 0 0 ## 76 0 0 ## 77 0 0 ## 78 0 0 ## 79 0 0 ## 80 0 0 ## 81 0 0 ## 82 0 0 ## 83 0 0 ## 84 0 0 ## 85 1 1 ## 86 1 0 ## 87 0 0 ## 88 0 0 ## 89 0 0 ## 90 0 0 ## 91 0 0 ## 92 0 1 ## 93 0 0 ## 94 0 0 ## 95 0 0 ## 96 0 0 ## 97 0 1 ## 98 0 0 ## 99 0 0 ## 100 0 0 ## 101 0 0 ## 102 0 0 ## 103 0 0 ## 104 0 0 ## 105 0 0 ## 106 0 0 ## 107 0 0 ## 108 0 0 ## 109 0 0 ## 110 0 0 ## 111 0 0 ## 112 0 0 ## 113 0 0 ## 114 0 0 ## 115 0 0 ## 116 0 0 ## 117 0 0 ## 118 0 0 ## 119 0 0 ## 120 0 0 ## 121 0 0 ## 122 0 0 ## 123 0 0 ## 124 0 1 ## 125 0 0 ## 126 0 0 ## 127 0 0 ## 128 0 0 ## 129 0 0 ## 130 0 0 ## 131 0 0 ## 132 0 0 ## 133 0 0 ## 134 0 0 ## 135 0 0 ## 136 0 0 ## 137 0 0 ## 138 0 0 ## 139 0 0 ## 140 0 0 ## 141 0 0 ## 142 0 0 ## 143 0 0 ## 144 0 0 ## 145 0 0 ## 146 0 0 ## 147 0 0 ## 148 0 0 ## 149 1 1 ## 150 1 1 ## 151 1 1 ## 152 1 1 ## 153 1 1 ## 154 1 1 ## 155 0 1 ## 156 0 0 ## 157 0 1 ## 158 0 1 ## 159 0 0 ## 160 0 0 ## 161 0 0 ## 162 0 0 ## 163 0 1 ## 164 1 0 ## 165 1 1 ## 166 1 1 ## 167 1 1 ## 168 1 1 ## 169 1 1 ## 170 1 1 ## 171 1 1 ## 172 1 1 ## 173 1 1 ## 174 1 1 ## 175 1 1 ## 176 1 1 ## 177 1 1 ## 178 1 1 ## 179 1 0 ## 180 1 0 ## 181 1 1 ## 182 0 0 ## 183 1 0 ## 184 0 0 ## 185 0 0 ## 186 0 0 ## 187 0 0 ## 188 0 0 ## 189 0 0 ## 190 0 0 ## 191 0 1 ## 192 0 0 ## 193 0 0 ## 194 0 0 ## 195 0 0 ## 196 0 0 ## 197 0 0 ## 198 0 0 ## 199 0 0 ## 200 0 0 ## 201 0 0 ## 202 0 0 ## 203 0 0 ## 204 0 0 ## 205 0 0 ## 206 0 0 ## 207 0 0
write.csv(copy_dataTest_pred, file = "match_predictions.csv")
Lets check the Area Under the Curve (AUC) and ROC plot
#install.packages("ROCR") library(ROCR)
## Loading required package: gplots
## ## Attaching package: 'gplots'
## The following object is masked from 'package:stats': ## ## lowess
predicted <- predict(model, newdata=dataTest[,-16], type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = ## ifelse(type == : prediction from a rank-deficient fit may be misleading
pr <- prediction(predicted, dataTest$V16) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf)
The ROC Curve shows the plot high above and also closer to “1” which is good
Lets check the area under the curve accuracy
auc <- performance(pr, measure = "auc") auc <- auc@y.values[[1]] auc
## [1] 0.9265748
Hope this gives you a basis for predicting whether or not someone will be approved for a Credit based on some characters (variables) we know about them!