--- title: "CollapseLevels" author: "Krishanu Mukherjee" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{"CollapseLevels"} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- Provides utility functions for binary classification problems. 'CollapseLevels' provides functions to collapse levels of an attribute based on response rates. Also provides functions to compute and display Information Value, and Weight of Evidence for attributes,and to convert numeric variables to categorical by binning. These functions only work for binary classification problems. This package provides utility functions for the data exploration part of binary classification. The binary outcome variable may be a factor with two levels or an integer (or numeric ) with two unique values. ## Data Set This package includes a data set named "German_Credit". This data set classifies customers as "Good" or "Bad" as per their credit risks. This data set was contributed by Professor Dr. Hans Hofmann,and can be downloaded from the UCI Machine Learning Repository. The outcome variable of the downloaded data set is an integer with two unique values 1, and 2. ```{r} library(CollapseLevels) data("German_Credit") str(German_Credit) ``` ## Functions The functions in the package are as follows ### levelsCollapser This function displays the response rates by the levels of an attribute Levels with similar response rates may be combined We will explore the levels of the attribute "Credit_History" in the German_Gredit data set ```{r} data("German_Credit") # Create an empty list to hold the data structures returned by numericToCategorical l<-list() l<-levelsCollapser(German_Credit,resp="Good_Bad",bins=10) # dset holds the data set # German_Credit is the data set # resp specifies the name of the binary response variable in the data set # bins denotes the number of bins for categorizing/binning numeric variables # Default value for the parameter bin is 10 # If you are supplying default values for bin , the parameter need not be specified in the function # The function returns a list. # For every attribute in the data set , the list contains a table thats shows the response rates # by the levels of the attribute # Collapse levels with similar response percentages. l$Credit_History ``` The column response gives the total number of responses (binary outcome variable is 1) for the level. The column response_pct gives the response percentage for the level. The table is sorted by response_pct. The column response_pct_change gives the percentage change in response_pct from one column to the next. As seen from response_pct_change column the change in response_pct from level A31 to A33 is 0 . So these levels may be combined. ### numericToCategorical This functions categorizes a numeric attribute. We will categorize the numeric attribute "Duration" in the German_Gredit data set ```{r} # Create an empty list to hold the data structures returned by numericToCategorical l<-list() # Call the function numericToCategorical to categorize the numeric attribute Duration # dset holds the data set # German_Credit is the data set # col specifies the name of the numeric variable we want to categorize # resp specifies the name of the binary response variable # bins denotes the number of bins # adjFactor denotes what is to be added to the response or non_response values for # a level of the attribute if the response or non_response is zero for that level l<-numericToCategorical(dset=German_Credit,col="Duration",resp="Good_Bad",bins=10,adjFactor=0.5) # Default value of bins is 10, and that of adjFactor is 0.5. # If you are supplying default values for these parameters , then they need not be specified in the # function call # l$categoricalVariable gives the binned categorized variable. # A bin [a,b) denotes >=a and =a and <=b head(l$categoricalVariable) # l$IVTable gives the Information values of the levels of the binned categorized variable l$IVTable # l$IV gives the Information Value for the binned categorized variable l$IV # l$collapseLevels gives a table of the response rates by the levels of the categorized variable # Levels with similar response rates may be collapsed l$collapseLevels ``` The change in response_pct from level [15,18) to [30,36) is only 7 percent . So these levels may be combined. Similarly levels [12,15), and [18,24) may be combined. ### IVCalc2 This function displays the Information Values ( not level wise ) for all the attributes ```{r} # Create an empty data frame l<-list() # dset holds the data set # German_Credit is the data set # resp specifies the name of the binary response variable in the data set # bins denotes the number of bins # Default value for the parameter bin is 10 # adjFactor denotes what is to be added to the response or non_response values for # a level of the attribute if the response or non_response is zero for that level # Default value of bins is 10, and that of adjFactor is 0.5. # If you are supplying default values for these parameters , then they need not be specified in the # function call # The function returns a data frame. # For every attribute, the function displays the information values for the attribute d<-IVCalc2(dset=German_Credit,resp="Good_Bad") d ``` ### IVCalc This function displays the Information Values by the levels of an attribute This information is displayed for all attributes in the data set ```{r} # Create an empty list to hold the data structures returned by IVCalc function l<-list() # dset holds the data set # German_Credit is the data set # resp specifies the name of the binary response variable in the data set # bins denotes the number of bins # Default value for the parameter bin is 10 # adjFactor denotes what is to be added to the response or non_response values for # a level of the attribute if the response or non_response is zero for that level # Default value of bins is 10, and that of adjFactor is 0.5. # If you are supplying default values for these parameters , then they need not be specified in the # function call # The function returns a list. # For every attribute, the function displays the information values by levels of the # attribute . It also displays the Information Value for the entire attribute l<-IVCalc(dset=German_Credit,resp="Good_Bad") #Explore Information Values for the attribute Credit_History l$Credit_History ``` ### displayWOE This function displays the Weight of Evidence of the levels of an attribute. ```{r} # dset holds the data set # German_Credit is the data set # col specifies the name of the variable for which we want to display the Weight of Evidence values # resp specifies the name of the binary response variable in the data set # bins denotes the number of bins # Default value for the parameter bin is 10 # adjFactor denotes what is to be added to the response or non_response values for # a level of the attribute if the response or non_response is zero for that level # Default value of bins is 10, and that of adjFactor is 0.5. # If you are supplying default values for these parameters , then they need not be specified in the # function call # Display the Weight of Evidence for the levels of the Job attribute displayWOE(German_Credit,col="Job",resp="Good_Bad") ``` ### displayResponseRatebyLevels This function displays the response percentages of the levels of an attribute. ```{r} # dset holds the data set # German_Credit is the data set # col specifies the name of the variable for which we want to display the response percents # resp specifies the name of the binary response variable in the data set # bins denotes the number of bins # Default value for the parameter bin is 10 # adjFactor denotes what is to be added to the response or non_response values for # a level of the attribute if the response or non_response is zero for that level # Default value of bins is 10, and that of adjFactor is 0.5. # If you are supplying default values for these parameters , then they need not be specified in the # function call # Display the response percentages for the levels of the Account_Balance attribute displayResponseRatebyLevels(German_Credit,col="Account_Balance",resp="Good_Bad") ``` ### displayIV This function displays the Information Values of the levels of an attribute. ```{r} # dset holds the data set # German_Credit is the data set # col specifies the name of the variable for which we want to display the IV values # resp specifies the name of the binary response variable in the data set # bins denotes the number of bins # Default value for the parameter bin is 10 # adjFactor denotes what is to be added to the response or non_response values for # a level of the attribute if the response or non_response is zero for that level # Default value of bins is 10, and that of adjFactor is 0.5. # If you are supplying default values for these parameters , then they need not be specified in the # function call # Display the IV values for the levels of the Account_Balance attribute displayIV(German_Credit,col="Account_Balance",resp="Good_Bad") ```