Provides utility functions for binary classification problems.
‘CollapseLevels’ provides functions to collapse levels of an attribute based on response rates.
Also provides functions to compute and display Information Value, and Weight of Evidence for attributes,and to convert numeric variables to categorical by binning.
These functions only work for binary classification problems.
This package provides utility functions for the data exploration part of binary classification.
The binary outcome variable may be a factor with two levels or an integer (or numeric ) with two unique values.
This package includes a data set named “German_Credit”. This data set classifies customers as “Good” or “Bad” as per their credit risks. This data set was contributed by Professor Dr. Hans Hofmann,and can be downloaded from the UCI Machine Learning Repository. The outcome variable of the downloaded data set is an integer with two unique values 1, and 2.
## Loading required package: magrittr
## 'data.frame': 1000 obs. of 21 variables:
## $ Account_Balance : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Credit_History : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ Purpose : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ Credit_Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ Saving_Accounts_Bonds : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ Current_Employment_Length : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ Installment_Rate : int 4 2 2 2 3 2 3 2 2 4 ...
## $ MaritalStatusnGender : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ Guarantors : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ Duration in Current Address: int 4 2 3 4 4 4 4 2 4 2 ...
## $ Valuable_Asset : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ Other_Credit : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ Existing_Credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ Job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ Dependents : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ ForeignWorker : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
## $ Good_Bad : int 1 2 1 1 2 1 1 1 1 2 ...
The functions in the package are as follows
This function displays the response rates by the levels of an attribute Levels with similar response rates may be combined
We will explore the levels of the attribute “Credit_History” in the German_Gredit data set
data("German_Credit")
# Create an empty list to hold the data structures returned by numericToCategorical
l<-list()
l<-levelsCollapser(German_Credit,resp="Good_Bad",bins=10)
## Warning: `group_by_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `group_by()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the CollapseLevels package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# dset holds the data set
# German_Credit is the data set
# resp specifies the name of the binary response variable in the data set
# bins denotes the number of bins for categorizing/binning numeric variables
# Default value for the parameter bin is 10
# If you are supplying default values for bin , the parameter need not be specified in the function
# The function returns a list.
# For every attribute in the data set , the list contains a table thats shows the response rates
# by the levels of the attribute
# Collapse levels with similar response percentages.
l$Credit_History
## Credit_History tot response non_response response_pct non_response_pct
## 1 A30 40 25 15 8.333333 2.142857
## 2 A31 49 28 21 9.333333 3.000000
## 4 A33 88 28 60 9.333333 8.571429
## 5 A34 293 50 243 16.666667 34.714286
## 3 A32 530 169 361 56.333333 51.571429
## response_pct_change
## 1 0.00000
## 2 10.71429
## 4 0.00000
## 5 44.00000
## 3 70.41420
The column response gives the total number of responses (binary outcome variable is 1) for the level.
The column response_pct gives the response percentage for the level.
The table is sorted by response_pct.
The column response_pct_change gives the percentage change in response_pct from one column to the next.
As seen from response_pct_change column the change in response_pct from level A31 to A33 is 0 . So these levels may be combined.
This functions categorizes a numeric attribute.
We will categorize the numeric attribute “Duration” in the German_Gredit data set
# Create an empty list to hold the data structures returned by numericToCategorical
l<-list()
# Call the function numericToCategorical to categorize the numeric attribute Duration
# dset holds the data set
# German_Credit is the data set
# col specifies the name of the numeric variable we want to categorize
# resp specifies the name of the binary response variable
# bins denotes the number of bins
# adjFactor denotes what is to be added to the response or non_response values for
# a level of the attribute if the response or non_response is zero for that level
l<-numericToCategorical(dset=German_Credit,col="Duration",resp="Good_Bad",bins=10,adjFactor=0.5)
# Default value of bins is 10, and that of adjFactor is 0.5.
# If you are supplying default values for these parameters , then they need not be specified in the
# function call
# l$categoricalVariable gives the binned categorized variable.
# A bin [a,b) denotes >=a and <b
# A bin [a,b] denotes >=a and <=b
head(l$categoricalVariable)
## [1] [0,9) [0,9) [0,9) [0,9) [0,9) [0,9)
## Levels: [0,9) [9,12) [12,15) [15,18) [18,24) [24,30) [30,36) [36,72]
## categoricalDuration tot response non_response response_pct non_response_pct
## 1 [0,9) 94 10 84 0.03333333 0.12000000
## 2 [9,12) 86 17 69 0.05666667 0.09857143
## 3 [12,15) 187 50 137 0.16666667 0.19571429
## 4 [15,18) 66 13 53 0.04333333 0.07571429
## 5 [18,24) 153 52 101 0.17333333 0.14428571
## 6 [24,30) 201 62 139 0.20666667 0.19857143
## 7 [30,36) 43 14 29 0.04666667 0.04142857
## 8 [36,72] 170 82 88 0.27333333 0.12571429
## woe iv
## 1 1.28093385 0.1110142666
## 2 0.55359530 0.0231982792
## 3 0.16066006 0.0046667922
## 4 0.55804470 0.0180700187
## 5 -0.18342106 0.0053279451
## 6 -0.03995831 0.0003234721
## 7 -0.11905936 0.0006236443
## 8 -0.77668029 0.1146528052
## [1] 0.2778772
# l$collapseLevels gives a table of the response rates by the levels of the categorized variable
# Levels with similar response rates may be collapsed
l$collapseLevels
## categoricalDuration tot response non_response response_pct non_response_pct
## 1 [0,9) 94 10 84 3.333333 12.000000
## 4 [15,18) 66 13 53 4.333333 7.571429
## 7 [30,36) 43 14 29 4.666667 4.142857
## 2 [9,12) 86 17 69 5.666667 9.857143
## 3 [12,15) 187 50 137 16.666667 19.571429
## 5 [18,24) 153 52 101 17.333333 14.428571
## 6 [24,30) 201 62 139 20.666667 19.857143
## 8 [36,72] 170 82 88 27.333333 12.571429
## response_pct_change
## 1 0.000000
## 4 23.076923
## 7 7.142857
## 2 17.647059
## 3 66.000000
## 5 3.846154
## 6 16.129032
## 8 24.390244
The change in response_pct from level [15,18) to [30,36) is only 7 percent .
So these levels may be combined.
Similarly levels [12,15), and [18,24) may be combined.
This function displays the Information Values ( not level wise ) for all the attributes
# Create an empty data frame
l<-list()
# dset holds the data set
# German_Credit is the data set
# resp specifies the name of the binary response variable in the data set
# bins denotes the number of bins
# Default value for the parameter bin is 10
# adjFactor denotes what is to be added to the response or non_response values for
# a level of the attribute if the response or non_response is zero for that level
# Default value of bins is 10, and that of adjFactor is 0.5.
# If you are supplying default values for these parameters , then they need not be specified in the
# function call
# The function returns a data frame.
# For every attribute, the function displays the information values for the attribute
d<-IVCalc2(dset=German_Credit,resp="Good_Bad")
d
## Variable IV
## 1 Account_Balance 0.666011503
## 2 Duration 0.277877223
## 3 Credit_History 0.293233547
## 4 Purpose 0.169195066
## 5 Credit_Amount 0.113980630
## 6 Saving_Accounts_Bonds 0.196009557
## 7 Current_Employment_Length 0.086433631
## 8 Installment_Rate 0.020522345
## 9 MaritalStatusnGender 0.044670678
## 10 Guarantors 0.032019322
## 11 Duration in Current Address 0.003247037
## 12 Valuable_Asset 0.112638262
## 13 Age 0.121227707
## 14 Other_Credit 0.057614542
## 15 Housing 0.083293434
## 16 Existing_Credits 0.010083557
## 17 Job 0.008762766
## 18 Dependents 0.000000000
## 19 Telephone 0.006377605
## 20 ForeignWorker 0.043877412
This function displays the Information Values by the levels of an attribute This information is displayed for all attributes in the data set
# Create an empty list to hold the data structures returned by IVCalc function
l<-list()
# dset holds the data set
# German_Credit is the data set
# resp specifies the name of the binary response variable in the data set
# bins denotes the number of bins
# Default value for the parameter bin is 10
# adjFactor denotes what is to be added to the response or non_response values for
# a level of the attribute if the response or non_response is zero for that level
# Default value of bins is 10, and that of adjFactor is 0.5.
# If you are supplying default values for these parameters , then they need not be specified in the
# function call
# The function returns a list.
# For every attribute, the function displays the information values by levels of the
# attribute . It also displays the Information Value for the entire attribute
l<-IVCalc(dset=German_Credit,resp="Good_Bad")
#Explore Information Values for the attribute Credit_History
l$Credit_History
## $IVTable
## Credit_History tot response non_response response_pct non_response_pct
## 1 A30 40 25 15 0.08333333 0.02142857
## 2 A31 49 28 21 0.09333333 0.03000000
## 3 A32 530 169 361 0.56333333 0.51571429
## 4 A33 88 28 60 0.09333333 0.08571429
## 5 A34 293 50 243 0.16666667 0.34714286
## woe iv
## 1 -1.35812348 0.0840743109
## 2 -1.13497993 0.0718820624
## 3 -0.08831862 0.0042056484
## 4 -0.08515781 0.0006488214
## 5 0.73374058 0.1324227042
##
## $IV
## [1] 0.2932335
This function displays the Weight of Evidence of the levels of an attribute.
# dset holds the data set
# German_Credit is the data set
# col specifies the name of the variable for which we want to display the Weight of Evidence values
# resp specifies the name of the binary response variable in the data set
# bins denotes the number of bins
# Default value for the parameter bin is 10
# adjFactor denotes what is to be added to the response or non_response values for
# a level of the attribute if the response or non_response is zero for that level
# Default value of bins is 10, and that of adjFactor is 0.5.
# If you are supplying default values for these parameters , then they need not be specified in the
# function call
# Display the Weight of Evidence for the levels of the Job attribute
displayWOE(German_Credit,col="Job",resp="Good_Bad")
## Warning: Use of `df[[1]]` is discouraged.
## ℹ Use `.data[[1]]` instead.
This function displays the response percentages of the levels of an attribute.
# dset holds the data set
# German_Credit is the data set
# col specifies the name of the variable for which we want to display the response percents
# resp specifies the name of the binary response variable in the data set
# bins denotes the number of bins
# Default value for the parameter bin is 10
# adjFactor denotes what is to be added to the response or non_response values for
# a level of the attribute if the response or non_response is zero for that level
# Default value of bins is 10, and that of adjFactor is 0.5.
# If you are supplying default values for these parameters , then they need not be specified in the
# function call
# Display the response percentages for the levels of the Account_Balance attribute
displayResponseRatebyLevels(German_Credit,col="Account_Balance",resp="Good_Bad")
## Warning: Use of `df[[1]]` is discouraged.
## ℹ Use `.data[[1]]` instead.
## Warning: Use of `df[["response_pct"]]` is discouraged.
## ℹ Use `.data[["response_pct"]]` instead.
This function displays the Information Values of the levels of an attribute.
# dset holds the data set
# German_Credit is the data set
# col specifies the name of the variable for which we want to display the IV values
# resp specifies the name of the binary response variable in the data set
# bins denotes the number of bins
# Default value for the parameter bin is 10
# adjFactor denotes what is to be added to the response or non_response values for
# a level of the attribute if the response or non_response is zero for that level
# Default value of bins is 10, and that of adjFactor is 0.5.
# If you are supplying default values for these parameters , then they need not be specified in the
# function call
# Display the IV values for the levels of the Account_Balance attribute
displayIV(German_Credit,col="Account_Balance",resp="Good_Bad")
## Warning: Use of `df[[1]]` is discouraged.
## ℹ Use `.data[[1]]` instead.