Credit Approval Data Set – Predicting Credit Approval Using Logistic Regression and Matching Predictions to DataSet
When someone applies for Credit, it will be unfair to reject those who duly qualify and it might be detrimental to the company to wrongly accept the wrong people. This is likely to happen if we try to make such decisions based on gut feelings.
So how do we use Machine Learning to increase our chances of selecting the right people who qualify for credit and also eliminate those who are not yet qualified, using a couple of information we know about those applicants.
In this tutorial, we will be using a Credit Approval Data Set from UCI Machine Learning Repository. The data can be downloaded from here: Credit Approval Data Set
Now let’s read in our dataset and proceed with our model building. We will be using Logistic Regression in R but there a host of other algorithms you can use
1 |
library(data.table) |
Let’s read in the credit data
1 |
crx.data <- data.table(read.table("crx.data.txt", header = FALSE, sep = ",", na.strings = "?")) |
Lets preview our credit data
1 |
head(crx.data) |
1 2 3 4 5 6 7 |
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ## 1: b 30.83 0.000 u g w v 1.25 t t 1 f g 202 0 + ## 2: a 58.67 4.460 u g q h 3.04 t t 6 f g 43 560 + ## 3: a 24.50 0.500 u g q h 1.50 t f 0 f g 280 824 + ## 4: b 27.83 1.540 u g w v 3.75 t t 5 t g 100 3 + ## 5: b 20.17 5.625 u g w v 1.71 t f 0 f s 120 0 + ## 6: b 32.08 4.000 u g m v 2.50 t f 0 t g 360 0 + |
It is good to have an idea about the class of the variables, their levels and some sample data
1 |
str(crx.data) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
## Classes 'data.table' and 'data.frame': 690 obs. of 16 variables: ## $ V1 : Factor w/ 2 levels "a","b": 2 1 1 2 2 2 2 1 2 2 ... ## $ V2 : num 30.8 58.7 24.5 27.8 20.2 ... ## $ V3 : num 0 4.46 0.5 1.54 5.62 ... ## $ V4 : Factor w/ 3 levels "l","u","y": 2 2 2 2 2 2 2 2 3 3 ... ## $ V5 : Factor w/ 3 levels "g","gg","p": 1 1 1 1 1 1 1 1 3 3 ... ## $ V6 : Factor w/ 14 levels "aa","c","cc",..: 13 11 11 13 13 10 12 3 9 13 ... ## $ V7 : Factor w/ 9 levels "bb","dd","ff",..: 8 4 4 8 8 8 4 8 4 8 ... ## $ V8 : num 1.25 3.04 1.5 3.75 1.71 ... ## $ V9 : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 2 ... ## $ V10: Factor w/ 2 levels "f","t": 2 2 1 2 1 1 1 1 1 1 ... ## $ V11: int 1 6 0 5 0 0 0 0 0 0 ... ## $ V12: Factor w/ 2 levels "f","t": 1 1 1 2 1 2 2 1 1 2 ... ## $ V13: Factor w/ 3 levels "g","p","s": 1 1 1 1 3 1 1 1 1 1 ... ## $ V14: int 202 43 280 100 120 360 164 80 180 52 ... ## $ V15: int 0 560 824 3 0 0 31285 1349 314 1442 ... ## $ V16: Factor w/ 2 levels "-","+": 2 2 2 2 2 2 2 2 2 2 ... ## - attr(*, ".internal.selfref")= |
Lets see the full column names in our credit data set
1 |
names(crx.data) |
1 2 |
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" ## [12] "V12" "V13" "V14" "V15" "V16" |
Let’s examine the distribution of the target variable, the one we are trying to predict to view , the distribution, we will convert the values to numeric values for this purpose We saw that the target variables values were “+” and “-”, hence in our conversions, the “+” = 1 and “-”” = 0
1 |
hist(as.numeric(crx.data$V16)-1) |
Lets see how are target variable now looks
1 |
as.numeric(crx.data$V16)-1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [106] 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [246] 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 ## [281] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [316] 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [351] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [386] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [421] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [456] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [491] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 ## [526] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [561] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [596] 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 ## [631] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [666] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
Lets now actually change the target variable to 0s and 1s in our dataset for the rest of the analysis
1 |
crx.data$V16 <- as.numeric(crx.data$V16)-1 |
Lets create check the distribution of each of the attributes of the credit data using histograms
1 2 |
numeric_data <- as.data.frame(crx.data[,c(2,3,8,11,14,15)]) summary(numeric_data) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
## V2 V3 V8 V11 ## Min. :13.75 Min. : 0.000 Min. : 0.000 Min. : 0.0 ## 1st Qu.:22.60 1st Qu.: 1.000 1st Qu.: 0.165 1st Qu.: 0.0 ## Median :28.46 Median : 2.750 Median : 1.000 Median : 0.0 ## Mean :31.57 Mean : 4.759 Mean : 2.223 Mean : 2.4 ## 3rd Qu.:38.23 3rd Qu.: 7.207 3rd Qu.: 2.625 3rd Qu.: 3.0 ## Max. :80.25 Max. :28.000 Max. :28.500 Max. :67.0 ## NA's :12 ## V14 V15 ## Min. : 0 Min. : 0.0 ## 1st Qu.: 75 1st Qu.: 0.0 ## Median : 160 Median : 5.0 ## Mean : 184 Mean : 1017.4 ## 3rd Qu.: 276 3rd Qu.: 395.5 ## Max. :2000 Max. :100000.0 ## NA's :13 |
1 2 3 4 |
par(mfrow=c(2,3)) for(i in 1:6) { hist(numeric_data[,i], main=names(numeric_data)[i]) } |
The variable V2 seems to have bell curve shape but most of the numeric variables of our dataset are skewed to the right ( their tails point to the right). This suggests we can preprocess or transform the data using techniques like Box-Cox to see if it enhances the model performance after our first model built. (However, we will not be covering that technique in this article)
A closer look at variable V14 particularly V15 (and also deducing from the summary statistics indicate there might be some outliers . For instance, the average of the V15 variable is 1017.4 whereas the maximum is 100000.0 which is way from the normal. There advanced techniques to statistically detect outliers (That is not covered here) and we will not be removing these data points in this analysis
Lets take a closer look at V15 again
1 2 |
par(mfrow=c(1,1)) hist((numeric_data[,"V15"])) |
1 2 |
#hist(log10(numeric_data[,"V15"])) summary(numeric_data$V15) |
1 2 |
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0 0.0 5.0 1017.0 395.5 100000.0 |
Let’s look at the non numerical values to have a sense of how they are distributed
1 |
non_numeric <- numeric_data <- as.data.frame(crx.data[,-c(2,3,8,11,14,15)]) |
How many are there?
1 |
dim(non_numeric) |
1 |
## [1] 690 10 |
What are they ?
1 |
names(non_numeric) |
1 |
## [1] "V1" "V4" "V5" "V6" "V7" "V9" "V10" "V12" "V13" "V16" |
Lets see their distribution. V16 was a factor variable but that was later converted to numeric. but we will view it as part of this distribution
1 2 3 4 |
par(mfrow=c(2,5)) for(i in 1:10) { plot(non_numeric[,i], main=names(non_numeric)[i]) } |
Lets examine the relationship between the numerical features of the application
1 |
plot(crx.data[,c(2,3,8,11,14,15)], col=crx.data$V16) |
Lets check the correlations between the numerical features
1 2 |
correlations <- cor(crx.data[,c(2,3,8,11,14,15)]) print(correlations) |
1 2 3 4 5 6 7 |
## V2 V3 V8 V11 V14 V15 ## V2 1 NA NA NA NA NA ## V3 NA 1.0000000 0.29890156 0.27120674 NA 0.12312115 ## V8 NA 0.2989016 1.00000000 0.32232967 NA 0.05134493 ## V11 NA 0.2712067 0.32232967 1.00000000 NA 0.06369244 ## V14 NA NA NA NA 1 NA ## V15 NA 0.1231212 0.05134493 0.06369244 NA 1.00000000 |
lets get a pairwise visualization of all the dataset
1 |
pairs(V16~., data=non_numeric, col=non_numeric$V16) |
summary
1 |
summary(crx.data) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
## V1 V2 V3 V4 V5 ## a :210 Min. :13.75 Min. : 0.000 l : 2 g :519 ## b :468 1st Qu.:22.60 1st Qu.: 1.000 u :519 gg : 2 ## NA's: 12 Median :28.46 Median : 2.750 y :163 p :163 ## Mean :31.57 Mean : 4.759 NA's: 6 NA's: 6 ## 3rd Qu.:38.23 3rd Qu.: 7.207 ## Max. :80.25 Max. :28.000 ## NA's :12 ## V6 V7 V8 V9 V10 ## c :137 v :399 Min. : 0.000 f:329 f:395 ## q : 78 h :138 1st Qu.: 0.165 t:361 t:295 ## w : 64 bb : 59 Median : 1.000 ## i : 59 ff : 57 Mean : 2.223 ## aa : 54 j : 8 3rd Qu.: 2.625 ## (Other):289 (Other): 20 Max. :28.500 ## NA's : 9 NA's : 9 ## V11 V12 V13 V14 V15 ## Min. : 0.0 f:374 g:625 Min. : 0 Min. : 0.0 ## 1st Qu.: 0.0 t:316 p: 8 1st Qu.: 75 1st Qu.: 0.0 ## Median : 0.0 s: 57 Median : 160 Median : 5.0 ## Mean : 2.4 Mean : 184 Mean : 1017.4 ## 3rd Qu.: 3.0 3rd Qu.: 276 3rd Qu.: 395.5 ## Max. :67.0 Max. :2000 Max. :100000.0 ## NA's :13 ## V16 ## Min. :0.0000 ## 1st Qu.:0.0000 ## Median :0.0000 ## Mean :0.4449 ## 3rd Qu.:1.0000 ## Max. :1.0000 ## |
What percentage of the dataset is missing some VALUES. if more than 10% we should go and find more and appropriate data Will this percentage of missen data have an impact on our ? How do we handle missen data ?
1 2 |
missen <- sum(!complete.cases(crx.data))/dim(crx.data)[1] *100 missen |
1 |
## [1] 5.362319 |
This shows about 5.36% missen data
HANDLING MISSEN VALUES
Input the missen values in the variable 1 (V1) with the most occuring value – MODE The list of variables are few hence we will be inputing the missen values one after the other. Ideally, you would a create function that will shorten the entire process.
Create function to get the mode
1 2 3 4 |
getmode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] } |
1 |
v1.mode <- getmode(crx.data$V1) |
Input the missen values with the mode
1 |
crx.data[is.na(crx.data$V1), 1] <- v1.mode |
Preview the inputation
1 |
table(crx.data$V1) |
1 2 3 |
## ## a b ## 210 480 |
Input the missen values in the variable V2 with the average
1 |
crx.data[is.na(crx.data$V2), 2] <- mean(crx.data$V2, na.rm = TRUE) |
Preview the inputation
1 |
summary(crx.data$V2) |
1 2 |
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 13.75 22.67 28.62 31.57 37.71 80.25 |
Input the missen values in the V4 with the mode
1 2 3 |
crx.data[is.na(crx.data$V4), 4] <- getmode(crx.data$V4) #preview the inputation table(crx.data$V4) |
1 2 3 |
## ## l u y ## 2 525 163 |
Input the missen values in the V5 with the mode
1 2 3 |
crx.data[is.na(crx.data$V5), 5] <- getmode(crx.data$V5) #preview the inputation table(crx.data$V5) |
1 2 3 |
## ## g gg p ## 525 2 163 |
V6: input the missen values with the mode
1 2 3 |
crx.data[is.na(crx.data$V6), 6] <- getmode(crx.data$V6) #preview the inputation table(crx.data$V6) |
1 2 3 |
## ## aa c cc d e ff i j k m q r w x ## 54 146 41 30 25 53 59 10 51 38 78 3 64 38 |
Any missen values
1 |
table(is.na(crx.data$V6)) |
1 2 3 |
## ## FALSE ## 690 |
V7: input the missen values with the mode
1 2 3 |
crx.data[is.na(crx.data$V7), 7] <- getmode(crx.data$V7) #preview the inputation table(crx.data$V7) |
1 2 3 |
## ## bb dd ff h j n o v z ## 59 6 57 138 8 4 2 408 8 |
V14: input the missen values with the mode
1 2 3 |
crx.data[is.na(crx.data$V14), 14] <- as.integer(mean(crx.data$V14, na.rm = TRUE)) #preview the inputation summary(crx.data$v14) |
1 2 |
## Length Class Mode ## 0 NULL NULL |
Lets call the mlbench and caret libraries. You can install them if they are not already installed.
1 2 3 4 |
#install.packages("mlbench") #install.packages("caret") library(mlbench) library(caret) |
1 |
## Loading required package: ggplot2 |
Split data into test and train set Define an 70%/30% train/test split of the dataset
1 2 3 4 |
set.seed(257) trainIndex <- createDataPartition(crx.data$V16, p=0.70, list=FALSE) dataTrain <- crx.data[ trainIndex,] dataTest <- crx.data[-trainIndex,] |
Run logistic regression on the training dataset
1 |
model <- glm(V16 ~.,family=binomial(link='logit'),data=dataTrain) |
1 |
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred |
Lets see the summary of our model results
1 |
summary(model) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
## ## Call: ## glm(formula = V16 ~ ., family = binomial(link = "logit"), data = dataTrain) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.4383 -0.3272 -0.1221 0.4432 3.3254 ## ## Coefficients: (2 not defined because of singularities) ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -4.243e+00 2.058e+03 -0.002 0.998355 ## V1b -2.702e-01 3.618e-01 -0.747 0.455196 ## V2 1.543e-02 1.524e-02 1.012 0.311384 ## V3 -1.895e-02 3.193e-02 -0.594 0.552808 ## V4u 1.338e-01 2.058e+03 0.000 0.999948 ## V4y -3.783e-01 2.058e+03 0.000 0.999853 ## V5gg NA NA NA NA ## V5p NA NA NA NA ## V6c 3.418e-01 5.888e-01 0.580 0.561654 ## V6cc 1.250e+00 8.525e-01 1.466 0.142633 ## V6d 6.549e-01 9.276e-01 0.706 0.480152 ## V6e 2.194e+00 1.282e+00 1.711 0.087154 . ## V6ff -1.944e+01 1.455e+03 -0.013 0.989340 ## V6i 3.286e-01 7.948e-01 0.413 0.679283 ## V6j -1.753e+01 1.455e+03 -0.012 0.990392 ## V6k -1.303e-01 7.723e-01 -0.169 0.866028 ## V6m 3.100e-01 8.137e-01 0.381 0.703214 ## V6q 6.079e-01 6.437e-01 0.944 0.344941 ## V6r -1.399e+01 1.455e+03 -0.010 0.992329 ## V6w 1.226e+00 6.897e-01 1.777 0.075571 . ## V6x 2.662e+00 9.430e-01 2.823 0.004762 ** ## V7dd -3.426e-01 1.970e+00 -0.174 0.861909 ## V7ff 1.865e+01 1.455e+03 0.013 0.989778 ## V7h 1.237e+00 6.711e-01 1.844 0.065242 . ## V7j 1.900e+01 1.455e+03 0.013 0.989584 ## V7n 3.458e+00 1.660e+00 2.083 0.037278 * ## V7o -1.259e+01 1.455e+03 -0.009 0.993098 ## V7v 5.835e-01 6.088e-01 0.958 0.337841 ## V7z -2.992e+00 1.861e+00 -1.607 0.107950 ## V8 3.096e-02 5.223e-02 0.593 0.553275 ## V9t 3.621e+00 3.997e-01 9.060 < 2e-16 *** ## V10t 9.237e-01 4.395e-01 2.102 0.035592 * ## V11 7.929e-02 6.318e-02 1.255 0.209516 ## V12t -1.047e-01 3.268e-01 -0.321 0.748577 ## V13p 4.057e+00 1.051e+00 3.861 0.000113 *** ## V13s 2.915e-01 6.023e-01 0.484 0.628408 ## V14 -2.854e-03 9.985e-04 -2.859 0.004256 ** ## V15 6.357e-04 2.190e-04 2.903 0.003699 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 667.84 on 482 degrees of freedom ## Residual deviance: 291.88 on 447 degrees of freedom ## AIC: 363.88 ## ## Number of Fisher Scoring iterations: 14 |
From the result, we can see that variables V6x, V9t , V13p, and V15 are statistically significant as they have p-values less than 0.05.
For instance from the result we can see that a unit change in variable V9t will increase the person’s chance of being approved for credit by 3.901e+00 whilst holding the other variables constant (because the coefficient of V9t is positive)
Now lets check our deviances. The smaller the deviance value, the better.
Our Null deviance = 664.19. OUr Null deviance is how welll our model is perforning if we do not have any predictor variables considered but only accounting for the intercept. And our Residual Deviance = 210.11 and this indicates how well our model is performaning when we add in predictor variables to our model.
We can see that deviance reduces (which means our model performs better when we add in our predictor variables). Which indicates, we can make better decisions on either approving a person for a credit or not if we consider other variables than just making a guess without taking some significant variables into consideration.
Let’s see a table of deviances by running anova on our model
1 |
anova(model, test="Chisq") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
## Analysis of Deviance Table ## ## Model: binomial, link: logit ## ## Response: V16 ## ## Terms added sequentially (first to last) ## ## ## Df Deviance Resid. Df Resid. Dev Pr(>Chi) ## NULL 482 667.84 ## V1 1 1.406 481 666.43 0.2356401 ## V2 1 14.303 480 652.13 0.0001556 *** ## V3 1 10.101 479 642.03 0.0014819 ** ## V4 2 20.973 477 621.05 2.791e-05 *** ## V5 0 0.000 477 621.05 ## V6 13 71.731 464 549.32 3.847e-10 *** ## V7 8 7.839 456 541.48 0.4493155 ## V8 1 11.281 455 530.20 0.0007831 *** ## V9 1 175.302 454 354.90 < 2.2e-16 *** ## V10 1 21.597 453 333.30 3.364e-06 *** ## V11 1 5.416 452 327.89 0.0199566 * ## V12 1 2.241 451 325.65 0.1343758 ## V13 2 13.604 449 312.04 0.0011113 ** ## V14 1 8.351 448 303.69 0.0038541 ** ## V15 1 11.814 447 291.88 0.0005877 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 |
Taking note of the Null deviance, it can be observed that as we add our predictor variables sequentially to the model (first to last as they appear in the result), the deviance reduces (model performs better). The deviance drops most where the model indicates
Let’s make predictions using the data we kept aside, which is the Test Data
1 |
probabilities <- predict(model, newdata = dataTest[,-16], type='response') |
1 2 |
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = ## ifelse(type == : prediction from a rank-deficient fit may be misleading |
1 |
predictions <- ifelse(probabilities > 0.5,'1','0') |
Let’s summarize the accuracy of the predictions
1 |
table(predictions, dataTest$V16) |
1 2 3 4 |
## ## predictions 0 1 ## 0 112 10 ## 1 15 70 |
It can be seen that the correct predictions are on the diagonal and the wrong ones are on the “off-diagonal”
It can be seen that we predicted *70 to be “1” (TRUE or they should be approved) and they were indeed “1”(TRUE, correct that we should approve them). That is our TRUE-POSITIVE (TP).
We also predicted 112 to be “0” (FALSE, that they should be rejected) and they were indeed “0” (FALSE, in the test dataset, they were actually rejected) which indicate our TRUE-NEGATIVE (TN)
and we predicted 10 to be “1”(TRUE, that they should be accepted) but that prediction was FALSE (WRONG, in the Test Dataset, they were not accepted) . That is our FALSE-POSITIVE (FP)
and we predicted 15 to be “0” (FALSE, they should be rejected) but that was WRONG as they were “1” (TRUE, they were accepted in the Test Dataset). That is the FALSE-NEGATIVE (FN)
Lets check the overall Accuracy of our model
1 2 |
misClasificError <- mean(predictions != dataTest$V16) print(paste('Accuracy',1-misClasificError)) |
1 |
## [1] "Accuracy 0.879227053140097" |
The overall Accuracy of our model is 87.9% . This can be improved further with other advanced techniques and Parameter tuning (but that will not be covered here)
Though we get 85.5% overall Accuracy, it is always better to check our recall (TRUE-POSITIVE rate or sensitivity). Recall = TP / (TP + FN)
and also your precision which is Precision = TP/ (TP + FP)
And it is also good to have a peep at your TRUE NEGATIVE RATE (which is Specificity) and that is TRUE NEGATIVE RATE = TN / (TN + FP)
Let’s match predictions to original dataset
1 2 |
copy_dataTest <- data.frame(dataTest) dim(copy_dataTest) |
1 |
## [1] 207 16 |
1 2 |
copy_dataTest_pred <- cbind(copy_dataTest, predictions) copy_dataTest_pred |
1 |
[crayon-6843343c9f9de377214498 inline="true" ] |
[/crayon]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 |
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 ## 1 b 30.83000 0.000 u g w v 1.250 t t 1 f g 202 0 ## 2 a 24.50000 0.500 u g q h 1.500 t f 0 f g 280 824 ## 3 b 27.83000 1.540 u g w v 3.750 t t 5 t g 100 3 ## 4 b 32.08000 4.000 u g m v 2.500 t f 0 t g 360 0 ## 5 b 33.17000 1.040 u g r h 6.500 t f 0 t g 164 31285 ## 6 a 22.92000 11.585 u g cc v 0.040 t f 0 f g 80 1349 ## 7 b 29.92000 1.835 u g c h 4.335 t f 0 f g 260 200 ## 8 a 23.25000 5.875 u g q v 3.170 t t 10 f g 120 245 ## 9 b 25.00000 11.250 u g c v 2.500 t t 17 f g 200 1208 ## 10 a 47.75000 8.000 u g c v 7.875 t t 6 t g 0 1260 ## 11 b 54.58000 9.415 u g ff ff 14.415 t t 11 t g 30 300 ## 12 b 34.17000 9.170 u g c v 4.500 t t 12 t g 0 221 ## 13 b 29.67000 1.415 u g w h 0.750 t t 1 f g 240 100 ## 14 b 41.50000 1.540 u g i bb 3.500 f f 0 f g 216 0 ## 15 b 26.00000 1.000 u g q v 1.750 t f 0 t g 280 0 ## 16 b 37.42000 2.040 u g w v 0.040 t f 0 t g 400 5800 ## 17 b 43.25000 3.000 u g q h 6.000 t t 11 f g 80 0 ## 18 a 36.00000 1.000 u g c v 2.000 t t 11 f g 0 456 ## 19 b 20.67000 5.290 u g q v 0.375 t t 1 f g 160 0 ## 20 a 49.00000 1.500 u g j j 0.000 t f 0 t g 100 27 ## 21 b 31.56817 5.000 y p aa v 8.500 t f 0 f g 0 0 ## 22 a 28.58000 3.540 u g i bb 0.500 t f 0 t g 171 0 ## 23 b 37.50000 1.750 y p c bb 0.250 t f 0 t g 164 400 ## 24 b 27.83000 4.000 y p i h 5.750 t t 2 t g 75 0 ## 25 b 24.58000 12.500 u g w v 0.875 t f 0 t g 260 0 ## 26 a 33.75000 0.750 u g k bb 1.000 t t 3 t g 212 0 ## 27 b 20.67000 1.250 y p c h 1.375 t t 3 t g 140 210 ## 28 b 39.92000 6.210 u g q v 0.040 t t 1 f g 200 300 ## 29 a 44.17000 6.665 u g q v 7.375 t t 3 t g 0 0 ## 30 b 67.75000 5.500 u g e z 13.000 t t 1 t g 0 0 ## 31 b 20.42000 1.835 u g c v 2.250 t t 1 f g 100 150 ## 32 b 33.58000 2.750 u g m v 4.250 t t 6 f g 204 0 ## 33 b 43.00000 0.290 y p cc h 1.750 t t 8 f g 100 375 ## 34 b 32.83000 2.500 u g cc h 2.750 t t 6 f g 160 2072 ## 35 a 40.33000 7.540 y p q h 8.000 t t 14 f g 0 2300 ## 36 a 30.50000 6.500 u g c bb 4.000 t t 7 t g 0 3065 ## 37 b 23.08000 2.500 u g c v 1.085 t t 11 t g 60 2184 ## 38 a 28.50000 3.040 y p x h 2.540 t t 1 f g 70 0 ## 39 b 44.00000 2.000 u g m v 1.750 t t 2 t g 0 15 ## 40 b 25.00000 12.500 u g aa v 3.000 t f 0 t s 20 0 ## 41 b 20.17000 8.170 u g aa v 1.960 t t 14 f g 60 158 ## 42 a 21.25000 2.335 u g i bb 0.500 t t 4 f s 80 0 ## 43 a 57.08000 19.500 u g c v 5.500 t t 7 f g 0 3000 ## 44 a 22.42000 5.665 u g q v 2.585 t t 7 f g 129 3257 ## 45 b 21.33000 10.500 u g c v 3.000 t f 0 t g 0 0 ## 46 b 22.67000 1.585 y p w v 3.085 t t 6 f g 80 0 ## 47 b 24.83000 2.750 u g c v 2.250 t t 6 f g 184 600 ## 48 a 20.75000 10.250 u g q v 0.710 t t 2 t g 49 0 ## 49 b 36.33000 2.125 y p w v 0.085 t t 1 f g 50 1187 ## 50 b 28.67000 9.335 u g q h 5.665 t t 6 f g 381 168 ## 51 b 39.33000 5.875 u g cc h 10.000 t t 14 t g 399 0 ## 52 b 26.67000 2.710 y p cc v 5.250 t t 1 f g 211 0 ## 53 b 48.17000 3.500 u g aa v 3.500 t f 0 f s 230 0 ## 54 a 50.08000 12.540 u g aa v 2.290 t t 3 t g 156 0 ## 55 b 23.25000 4.000 u g c bb 0.250 t f 0 t g 160 0 ## 56 b 18.08000 5.500 u g k v 0.500 t f 0 f g 80 0 ## 57 a 58.42000 21.000 u g i bb 10.000 t t 13 f g 0 6700 ## 58 a 20.67000 1.835 u g q v 2.085 t t 5 f g 220 2503 ## 59 b 21.33000 7.500 u g aa v 1.415 t t 1 f g 80 9800 ## 60 b 28.33000 5.000 u g w v 11.000 t f 0 t g 70 0 ## 61 b 33.17000 3.040 y p c h 2.040 t t 1 t g 180 18027 ## 62 b 23.17000 11.125 u g x h 0.460 t t 1 f g 100 0 ## 63 b 31.56817 0.625 u g k v 0.250 f f 0 f g 380 2010 ## 64 b 20.00000 0.000 u g d v 0.500 f f 0 f g 144 0 ## 65 a 24.50000 1.750 y p c v 0.165 f f 0 f g 132 0 ## 66 a 52.17000 0.000 y p ff ff 0.000 f f 0 f g 0 0 ## 67 b 17.08000 0.085 y p c v 0.040 f f 0 f g 140 722 ## 68 b 18.33000 1.210 y p e dd 0.000 f f 0 f g 100 0 ## 69 b 30.67000 2.500 u g cc h 2.250 f f 0 t s 340 0 ## 70 b 16.25000 0.835 u g m v 0.085 t f 0 f s 200 0 ## 71 b 17.58000 10.000 u g w h 0.165 f t 1 f g 120 1 ## 72 a 31.56817 1.500 u g ff ff 0.000 f t 2 t g 200 105 ## 73 b 29.50000 0.580 u g w v 0.290 f t 1 f g 340 2803 ## 74 a 21.75000 1.750 y p j j 0.000 f f 0 f g 160 0 ## 75 b 35.75000 2.415 u g w v 0.125 f t 2 f g 220 1 ## 76 b 32.92000 2.500 u g aa v 1.750 f t 2 t g 720 0 ## 77 b 16.33000 2.750 u g aa v 0.665 f t 1 f g 80 21 ## 78 b 23.42000 1.000 u g c v 0.500 f f 0 t s 280 0 ## 79 b 23.50000 2.750 u g ff ff 4.500 f f 0 f g 160 25 ## 80 b 18.58000 10.290 u g ff ff 0.415 f f 0 f g 80 0 ## 81 b 27.75000 1.290 u g k h 0.250 f f 0 t s 140 0 ## 82 a 24.83000 4.500 u g w v 1.000 f f 0 t g 360 6 ## 83 a 18.58000 10.000 u g d v 0.415 f f 0 f g 80 42 ## 84 b 16.25000 0.000 y p aa v 0.250 f f 0 f g 60 0 ## 85 b 17.50000 22.000 l gg ff o 0.000 f f 0 t p 450 100000 ## 86 a 33.67000 0.375 u g cc v 0.375 f f 0 f g 300 44 ## 87 b 30.17000 1.085 y p c v 0.040 f f 0 f g 170 179 ## 88 b 47.33000 6.500 u g c v 1.000 f f 0 t g 0 228 ## 89 a 34.83000 1.250 y p i h 0.500 f f 0 t g 160 0 ## 90 b 38.92000 1.750 u g k v 0.500 f f 0 t g 300 2 ## 91 b 62.75000 7.000 u g e z 0.000 f f 0 f g 0 12 ## 92 b 63.33000 0.540 u g c v 0.585 t t 3 t g 180 0 ## 93 b 41.17000 1.335 u g d v 0.165 f f 0 f g 168 0 ## 94 b 32.42000 3.000 u g d v 0.165 f f 0 t g 120 0 ## 95 a 30.25000 5.500 u g k v 5.500 f f 0 t s 100 0 ## 96 b 23.08000 2.500 u g ff ff 0.085 f f 0 t g 100 4208 ## 97 b 26.83000 0.540 u g k ff 0.000 f f 0 f g 100 0 ## 98 a 22.75000 6.165 u g aa v 0.165 f f 0 f g 220 1000 ## 99 b 33.00000 2.500 y p w v 7.000 f f 0 t g 280 0 ## 100 b 26.33000 13.000 u g e dd 0.000 f f 0 t g 140 1110 ## 101 b 26.25000 1.540 u g w v 0.125 f f 0 f g 100 0 ## 102 b 28.67000 14.500 u g d v 0.125 f f 0 f g 0 286 ## 103 b 20.67000 0.835 y p c v 2.000 f f 0 t s 240 0 ## 104 a 22.67000 7.000 u g c v 0.165 f f 0 f g 160 0 ## 105 b 22.08000 11.460 u g k v 1.585 f f 0 t g 100 1212 ## 106 b 22.58000 1.500 y p aa v 0.540 f f 0 t g 120 67 ## 107 b 21.17000 0.000 u g c v 0.500 f f 0 t s 0 0 ## 108 b 24.75000 0.540 u g m v 1.000 f f 0 t g 120 1 ## 109 b 41.17000 1.250 y p w v 0.250 f f 0 f g 0 195 ## 110 a 33.08000 1.625 u g d v 0.540 f f 0 t g 0 0 ## 111 b 20.75000 5.085 y p j v 0.290 f f 0 f g 140 184 ## 112 b 28.92000 0.375 u g c v 0.290 f f 0 f g 220 140 ## 113 a 22.67000 0.335 u g q v 0.750 f f 0 f s 160 0 ## 114 a 19.58000 0.665 y p c v 1.000 f t 1 f g 2000 2 ## 115 b 17.08000 0.250 u g q v 0.335 f t 4 f g 160 8 ## 116 b 31.25000 2.835 u g ff ff 0.000 f t 5 f g 176 146 ## 117 a 22.67000 0.790 u g i v 0.085 f f 0 f g 144 0 ## 118 b 40.58000 1.500 u g i bb 0.000 f f 0 f s 300 0 ## 119 a 22.25000 1.250 y p ff ff 3.250 f f 0 f g 280 0 ## 120 b 22.50000 0.125 y p k v 0.125 f f 0 f g 200 70 ## 121 a 26.58000 2.540 y p ff ff 0.000 f f 0 t g 180 60 ## 122 b 35.00000 2.500 u g i v 1.000 f f 0 t g 210 0 ## 123 b 26.17000 0.835 u g cc v 1.165 f f 0 f g 100 0 ## 124 b 33.67000 2.165 u g c v 1.500 f f 0 f p 120 0 ## 125 b 24.58000 1.250 u g c v 0.250 f f 0 f g 110 0 ## 126 b 37.50000 0.835 u g e v 0.040 f f 0 f g 120 5 ## 127 b 22.92000 3.165 y p c v 0.165 f f 0 f g 160 1058 ## 128 b 19.00000 0.000 y p ff ff 0.000 f t 4 f g 45 1 ## 129 b 19.58000 0.585 u g ff ff 0.000 f t 3 f g 350 769 ## 130 a 53.33000 0.165 u g ff ff 0.000 f f 0 t s 62 27 ## 131 a 27.17000 1.250 u g ff ff 0.000 f t 1 f g 92 300 ## 132 b 39.58000 5.000 u g ff ff 0.000 f t 2 f g 17 1 ## 133 b 16.50000 0.125 u g c v 0.165 f f 0 f g 132 0 ## 134 a 27.33000 1.665 u g ff ff 0.000 f f 0 f g 340 1 ## 135 b 39.50000 1.625 u g c v 1.500 f f 0 f g 0 316 ## 136 b 29.75000 0.665 u g w v 0.250 f f 0 t g 300 0 ## 137 b 25.67000 0.290 y p c v 1.500 f f 0 t g 160 0 ## 138 a 24.50000 2.415 y p c v 0.000 f f 0 f g 120 0 ## 139 b 21.92000 0.500 u g c v 0.125 f f 0 f g 360 0 ## 140 a 30.42000 1.375 u g w h 0.040 f t 3 f g 0 33 ## 141 b 21.08000 4.125 y p i h 0.040 f f 0 f g 140 100 ## 142 b 26.75000 2.000 u g d v 0.750 f f 0 t g 80 0 ## 143 b 23.58000 0.835 u g i h 0.085 f f 0 t g 220 5 ## 144 b 39.17000 2.500 y p i h 10.000 f f 0 t s 200 0 ## 145 b 22.75000 11.500 u g i v 0.415 f f 0 f g 0 0 ## 146 b 23.50000 3.165 y p k v 0.415 f t 1 t g 280 80 ## 147 b 34.67000 1.080 u g m v 1.165 f f 0 f s 28 0 ## 148 b 24.50000 13.335 y p aa v 0.040 f f 0 t g 120 475 ## 149 b 24.17000 0.875 u g q v 4.625 t t 2 t g 520 2000 ## 150 b 28.25000 5.125 u g x v 4.750 t t 2 f g 420 7 ## 151 b 21.00000 4.790 y p w v 2.250 t t 1 t g 80 300 ## 152 b 13.75000 4.000 y p w v 1.750 t t 2 t g 120 1000 ## 153 b 31.56817 10.500 u g x v 6.500 t f 0 f g 0 0 ## 154 b 22.83000 3.000 u g m v 1.290 t t 1 f g 260 800 ## 155 a 22.50000 8.500 u g q v 1.750 t t 10 f g 80 990 ## 156 b 41.58000 1.750 u g k v 0.210 t f 0 f g 160 0 ## 157 a 57.08000 0.335 u g i bb 1.000 t f 0 t g 252 2197 ## 158 a 25.33000 2.085 u g c h 2.750 t f 0 t g 360 1 ## 159 b 40.92000 0.835 u g ff ff 0.000 t f 0 f g 130 1 ## 160 a 33.92000 1.585 y p ff ff 0.000 t f 0 f g 320 0 ## 161 a 24.92000 1.250 u g ff ff 0.000 t f 0 f g 80 0 ## 162 b 19.67000 10.000 y p k h 0.835 t f 0 t g 140 0 ## 163 b 44.25000 11.000 y p d v 1.500 t f 0 f s 0 0 ## 164 b 34.75000 15.000 u g r n 5.375 t t 9 t g 0 134 ## 165 b 38.58000 3.335 u g w v 4.000 t t 14 f g 383 1344 ## 166 a 22.42000 11.250 y p x h 0.750 t t 4 f g 0 321 ## 167 b 26.75000 1.125 u g x h 1.250 t f 0 f g 0 5298 ## 168 a 20.83000 3.000 u g aa v 0.040 t f 0 f g 100 0 ## 169 b 23.08000 11.500 u g w h 2.125 t t 11 t g 290 284 ## 170 b 43.08000 0.375 y p c v 0.375 t t 8 t g 300 162 ## 171 b 21.00000 3.000 y p d v 1.085 t t 8 t g 160 1 ## 172 b 32.25000 0.165 y p c h 3.250 t t 1 t g 432 8000 ## 173 b 30.17000 0.500 u g c v 1.750 t t 11 f g 32 540 ## 174 b 31.67000 0.830 u g x v 1.335 t t 8 t g 303 3290 ## 175 b 32.67000 9.000 y p w h 5.250 t f 0 t g 154 0 ## 176 a 28.08000 15.000 y p e z 0.000 t f 0 f g 0 13212 ## 177 b 64.08000 20.000 u g x h 17.500 t t 9 t g 0 1000 ## 178 b 26.67000 1.750 y p c v 1.000 t t 5 t g 160 5777 ## 179 b 38.67000 0.210 u g k v 0.085 t f 0 t g 280 0 ## 180 b 25.75000 0.750 u g c bb 0.250 t f 0 f g 349 23 ## 181 a 29.50000 0.460 u g k v 0.540 t t 4 f g 380 500 ## 182 b 29.83000 1.250 y p k v 0.250 f f 0 f g 224 0 ## 183 b 16.17000 0.040 u g c v 0.040 f f 0 f g 0 0 ## 184 b 22.00000 0.790 u g w v 0.290 f t 1 f g 420 283 ## 185 a 38.33000 4.415 u g c v 0.125 f f 0 f g 160 0 ## 186 b 29.42000 1.250 u g c h 0.250 f t 2 t g 400 108 ## 187 b 22.67000 0.750 u g i v 1.585 f t 1 t g 400 9 ## 188 b 29.58000 4.750 u g m v 2.000 f t 1 t g 460 68 ## 189 b 22.00000 7.835 y p i bb 0.165 f f 0 t g 184 0 ## 190 a 27.25000 0.290 u g m h 0.125 f t 1 t g 272 108 ## 191 b 32.42000 2.165 y p k ff 0.000 f f 0 f g 120 0 ## 192 b 34.17000 2.750 u g i bb 2.500 f f 0 t g 232 200 ## 193 b 36.17000 0.420 y p w v 0.290 f f 0 t g 309 2 ## 194 a 15.83000 7.625 u g q v 0.125 f t 1 t g 0 160 ## 195 a 15.75000 0.375 u g c v 1.000 f f 0 f g 120 18 ## 196 a 28.58000 3.750 u g c v 0.250 f t 1 t g 40 154 ## 197 b 22.25000 9.000 u g aa v 0.085 f f 0 f g 0 0 ## 198 b 29.83000 3.500 u g c v 0.165 f f 0 f g 216 0 ## 199 b 31.08000 1.500 y p w v 0.040 f f 0 f s 160 0 ## 200 b 25.83000 12.835 u g cc v 0.500 f f 0 f g 0 2 ## 201 a 37.33000 2.500 u g i h 0.210 f f 0 f g 260 246 ## 202 a 41.58000 1.040 u g aa v 0.665 f f 0 f g 240 237 ## 203 a 17.92000 10.210 u g ff ff 0.000 f f 0 f g 0 50 ## 204 a 20.08000 1.250 u g c v 0.000 f f 0 f g 0 0 ## 205 b 27.83000 1.000 y p d h 3.000 f f 0 f g 176 537 ## 206 b 36.42000 0.750 y p d v 0.585 f f 0 f g 240 3 ## 207 b 40.58000 3.290 u g m v 3.500 f f 0 t s 400 0 ## V16 predictions ## 1 1 1 ## 2 1 1 ## 3 1 1 ## 4 1 0 ## 5 1 1 ## 6 1 1 ## 7 1 1 ## 8 1 1 ## 9 1 1 ## 10 1 1 ## 11 1 1 ## 12 1 1 ## 13 1 1 ## 14 1 0 ## 15 1 0 ## 16 1 1 ## 17 1 1 ## 18 1 1 ## 19 0 1 ## 20 0 1 ## 21 0 0 ## 22 0 0 ## 23 0 0 ## 24 0 1 ## 25 0 1 ## 26 0 1 ## 27 0 1 ## 28 1 1 ## 29 1 1 ## 30 1 1 ## 31 1 1 ## 32 1 1 ## 33 1 1 ## 34 1 1 ## 35 1 1 ## 36 1 1 ## 37 1 1 ## 38 1 1 ## 39 1 1 ## 40 1 1 ## 41 1 1 ## 42 1 1 ## 43 1 1 ## 44 1 1 ## 45 1 1 ## 46 1 1 ## 47 1 1 ## 48 1 1 ## 49 1 1 ## 50 1 1 ## 51 1 1 ## 52 1 1 ## 53 1 1 ## 54 1 1 ## 55 1 0 ## 56 1 0 ## 57 1 1 ## 58 1 1 ## 59 1 1 ## 60 1 1 ## 61 1 1 ## 62 1 1 ## 63 0 0 ## 64 0 0 ## 65 0 0 ## 66 0 0 ## 67 0 0 ## 68 0 0 ## 69 0 0 ## 70 0 1 ## 71 0 0 ## 72 0 0 ## 73 0 0 ## 74 0 0 ## 75 0 0 ## 76 0 0 ## 77 0 0 ## 78 0 0 ## 79 0 0 ## 80 0 0 ## 81 0 0 ## 82 0 0 ## 83 0 0 ## 84 0 0 ## 85 1 1 ## 86 1 0 ## 87 0 0 ## 88 0 0 ## 89 0 0 ## 90 0 0 ## 91 0 0 ## 92 0 1 ## 93 0 0 ## 94 0 0 ## 95 0 0 ## 96 0 0 ## 97 0 1 ## 98 0 0 ## 99 0 0 ## 100 0 0 ## 101 0 0 ## 102 0 0 ## 103 0 0 ## 104 0 0 ## 105 0 0 ## 106 0 0 ## 107 0 0 ## 108 0 0 ## 109 0 0 ## 110 0 0 ## 111 0 0 ## 112 0 0 ## 113 0 0 ## 114 0 0 ## 115 0 0 ## 116 0 0 ## 117 0 0 ## 118 0 0 ## 119 0 0 ## 120 0 0 ## 121 0 0 ## 122 0 0 ## 123 0 0 ## 124 0 1 ## 125 0 0 ## 126 0 0 ## 127 0 0 ## 128 0 0 ## 129 0 0 ## 130 0 0 ## 131 0 0 ## 132 0 0 ## 133 0 0 ## 134 0 0 ## 135 0 0 ## 136 0 0 ## 137 0 0 ## 138 0 0 ## 139 0 0 ## 140 0 0 ## 141 0 0 ## 142 0 0 ## 143 0 0 ## 144 0 0 ## 145 0 0 ## 146 0 0 ## 147 0 0 ## 148 0 0 ## 149 1 1 ## 150 1 1 ## 151 1 1 ## 152 1 1 ## 153 1 1 ## 154 1 1 ## 155 0 1 ## 156 0 0 ## 157 0 1 ## 158 0 1 ## 159 0 0 ## 160 0 0 ## 161 0 0 ## 162 0 0 ## 163 0 1 ## 164 1 0 ## 165 1 1 ## 166 1 1 ## 167 1 1 ## 168 1 1 ## 169 1 1 ## 170 1 1 ## 171 1 1 ## 172 1 1 ## 173 1 1 ## 174 1 1 ## 175 1 1 ## 176 1 1 ## 177 1 1 ## 178 1 1 ## 179 1 0 ## 180 1 0 ## 181 1 1 ## 182 0 0 ## 183 1 0 ## 184 0 0 ## 185 0 0 ## 186 0 0 ## 187 0 0 ## 188 0 0 ## 189 0 0 ## 190 0 0 ## 191 0 1 ## 192 0 0 ## 193 0 0 ## 194 0 0 ## 195 0 0 ## 196 0 0 ## 197 0 0 ## 198 0 0 ## 199 0 0 ## 200 0 0 ## 201 0 0 ## 202 0 0 ## 203 0 0 ## 204 0 0 ## 205 0 0 ## 206 0 0 ## 207 0 0 |
1 |
write.csv(copy_dataTest_pred, file = "match_predictions.csv") |
Lets check the Area Under the Curve (AUC) and ROC plot
1 2 |
#install.packages("ROCR") library(ROCR) |
1 |
## Loading required package: gplots |
1 2 |
## ## Attaching package: 'gplots' |
1 2 3 |
## The following object is masked from 'package:stats': ## ## lowess |
1 |
predicted <- predict(model, newdata=dataTest[,-16], type="response") |
1 2 |
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = ## ifelse(type == : prediction from a rank-deficient fit may be misleading |
1 2 3 |
pr <- prediction(predicted, dataTest$V16) prf <- performance(pr, measure = "tpr", x.measure = "fpr") plot(prf) |
The ROC Curve shows the plot high above and also closer to “1” which is good
Lets check the area under the curve accuracy
1 2 3 |
auc <- performance(pr, measure = "auc") auc <- auc@y.values[[1]] auc |
1 |
## [1] 0.9265748 |
Hope this gives you a basis for predicting whether or not someone will be approved for a Credit based on some characters (variables) we know about them!