Title: | Evaluation Metrics for Customer Scoring Models Depending on Binary Classifiers |
---|---|
Description: | Functions for evaluating and visualizing predictive model performance (specifically: binary classifiers) in the field of customer scoring. These metrics include lift, lift index, gain percentage, top-decile lift, F1-score, expected misclassification cost and absolute misclassification cost. See Berry & Linoff (2004, ISBN:0-471-47064-3), Witten and Frank (2005, 0-12-088407-0) and Blattberg, Kim & Neslin (2008, ISBN:978–0–387–72578–9) for details. Visualization functions are included for lift charts and gain percentage charts. All metrics that require class predictions offer the possibility to dynamically determine cutoff values for transforming real-valued probability predictions into class predictions. |
Authors: | Koen W. De Bock |
Maintainer: | Koen W. De Bock <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.0 |
Built: | 2024-11-16 04:18:56 UTC |
Source: | https://github.com/cran/CustomerScoringMetrics |
Perform check on the true class label vector.
checkDepVector(depTest)
checkDepVector(depTest)
depTest |
Vector with true data labels (outcome values) |
Koen W. De Bock, [email protected]
## Load response modeling predictions data("response") ## Apply checkDepVector checking function checkDepVector(response$test[,1])
## Load response modeling predictions data("response") ## Apply checkDepVector checking function checkDepVector(response$test[,1])
Calculates a range of metrics based upon the confusion matrix: accuracy, true positive rate (TPR; sensitivity or recall), true negative rate (specificity), false postive rate (FPR), false negative rate (FPR), F1-score , with the optional ability to dynamically determine an incidence-based cutoff value using validation sample predictions.
confMatrixMetrics(predTest, depTest, cutoff = 0.5, dyn.cutoff = FALSE, predVal = NULL, depVal = NULL)
confMatrixMetrics(predTest, depTest, cutoff = 0.5, dyn.cutoff = FALSE, predVal = NULL, depVal = NULL)
predTest |
Vector with predictions (real-valued or discrete) |
depTest |
Vector with real class labels |
cutoff |
Threshold for converting real-valued predictions into class predictions. Default 0.5. |
dyn.cutoff |
Logical indicator to enable dynamic threshold determination using
validation sample predictions. In this case, the function determines, using validation
data, the incidence (occurrence percentage of the customer behavior or characterstic
of interest) and chooses a cutoff value so that the number of predicted positives is
equal to the number of true positives. If |
predVal |
Vector with predictions (real-valued or discrete). Only used if
|
depVal |
Optional vector with true class labels for validation data. Only used
if |
A list with the following items:
accuracy |
accuracy value |
truePostiveRate |
TPR or true positive rate |
trueNegativeRate |
TNR or true negative rate |
falsePostiveRate |
FPR or false positive rate |
falseNegativeRate |
FNR or false negative rate |
F1Score |
F1-score |
cutoff |
the threshold value used to convert real-valued predictions to class predictions |
Koen W. De Bock, [email protected]
Witten, I.H., Frank, E. (2005): Data Mining: Practical Machine Learning Tools and Techniques, Second Edition. Chapter 5. Morgan Kauffman.
## Load response modeling data set data("response") ## Apply confMatrixMetrics function to obtain confusion matrix-based performance metrics ## achieved on the test sample. Use validation sample predictions to dynamically ## determine a cutoff value. cmm<-confMatrixMetrics(response$test[,2],response$test[,1],dyn.cutoff=TRUE, predVal=response$val[,2],depVal=response$val[,1]) ## Retrieve F1-score print(cmm$F1Score)
## Load response modeling data set data("response") ## Apply confMatrixMetrics function to obtain confusion matrix-based performance metrics ## achieved on the test sample. Use validation sample predictions to dynamically ## determine a cutoff value. cmm<-confMatrixMetrics(response$test[,2],response$test[,1],dyn.cutoff=TRUE, predVal=response$val[,2],depVal=response$val[,1]) ## Retrieve F1-score print(cmm$F1Score)
Visualize gain through a cumulative gains chart.
cumGainsChart(predTest, depTest, resolution = 1/10)
cumGainsChart(predTest, depTest, resolution = 1/10)
predTest |
Vector with predictions (real-valued or discrete) |
depTest |
Vector with true class labels |
resolution |
Value for the determination of percentile intervals. Default 1/10 (10%). |
Koen W. De Bock, [email protected]
Linoff, G.S. and Berry, M.J.A (2011): "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management - Third Edition". John Wiley & Sons.
topDecileLift
, liftIndex
, liftChart
## Load response modeling predictions data("response") ## Apply cumGainschart function to visualize cumulative gains of a customer response model cumGainsChart(response$test[,2],response$test[,1])
## Load response modeling predictions data("response") ## Apply cumGainschart function to visualize cumulative gains of a customer response model cumGainsChart(response$test[,2],response$test[,1])
Calculates a cumulative gains (cumulative lift) table, showing for different percentiles of predicted scores the percentage of customers with the behavior or characterstic of interest is reached.
cumGainsTable(predTest, depTest, resolution = 1/10)
cumGainsTable(predTest, depTest, resolution = 1/10)
predTest |
Vector with predictions (real-valued or discrete) |
depTest |
Vector with true class labels |
resolution |
Value for the determination of percentile intervals. Default 1/10 (10%). |
A gain percentage table.
Koen W. De Bock, [email protected]
Linoff, G.S. and Berry, M.J.A (2011): "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management - Third Edition". John Wiley & Sons.
topDecileLift
, liftIndex
, liftChart
## Load response modeling predictions data("response") ## Apply cumGainsTable function to obtain cumulative gains table for test sample results ## and print results cgt<-cumGainsTable(response$test[,2],response$test[,1]) print(cgt)
## Load response modeling predictions data("response") ## Apply cumGainsTable function to obtain cumulative gains table for test sample results ## and print results cgt<-cumGainsTable(response$test[,2],response$test[,1]) print(cgt)
Visualize the sensitivity of a chosen metric to the choice of the threshold (cutoff) value used to transform continuous predictions into class predictions.
cutoffSensitivityPlot(predTest, depTest, metric = c("accuracy", "expMisclassCost", "misclassCost"), costType = c("costRatio", "costMatrix", "costVector"), costs = NULL, resolution = 1/50)
cutoffSensitivityPlot(predTest, depTest, metric = c("accuracy", "expMisclassCost", "misclassCost"), costType = c("costRatio", "costMatrix", "costVector"), costs = NULL, resolution = 1/50)
predTest |
Vector with predictions (real-valued or discrete) |
depTest |
Vector with true class labels |
metric |
Which metric to assess. Should be one of the following values:
|
costType |
An argument that specifies how the cost information is provided.
This should be either |
costs |
see |
resolution |
Value for the determination of percentile intervals. Default 1/10 (10%). |
Koen W. De Bock, [email protected]
dynAccuracy
, misclassCost
, expMisclassCost
## Load response modeling predictions data("response") ## Apply cutoffSensitivityPlot function to visualize how the cutoff value influences ## accuracy. cutoffSensitivityPlot(response$test[,2],response$test[,1],metric="accuracy") ## Same exercise, but in function of misclassification costs costs <- runif(nrow(response$test), 1, 50) cutoffSensitivityPlot(response$test[,2],response$test[,1],metric="misclassCost", costType="costVector",costs=costs, resolution=1/10)
## Load response modeling predictions data("response") ## Apply cutoffSensitivityPlot function to visualize how the cutoff value influences ## accuracy. cutoffSensitivityPlot(response$test[,2],response$test[,1],metric="accuracy") ## Same exercise, but in function of misclassification costs costs <- runif(nrow(response$test), 1, 50) cutoffSensitivityPlot(response$test[,2],response$test[,1],metric="misclassCost", costType="costVector",costs=costs, resolution=1/10)
Calculates accuracy (percentage correctly classified instances) for real-valued classifier predictions, with the optional ability to dynamically determine an incidence-based cutoff value using validation sample predictions
dynAccuracy(predTest, depTest, dyn.cutoff = FALSE, cutoff = 0.5, predVal = NULL, depVal = NULL)
dynAccuracy(predTest, depTest, dyn.cutoff = FALSE, cutoff = 0.5, predVal = NULL, depVal = NULL)
predTest |
Vector with predictions (real-valued or discrete) |
depTest |
Vector with real class labels |
dyn.cutoff |
Logical indicator to enable dynamic threshold determination using
validation sample predictions. In this case, the function determines, using validation
data, the indidicence (occurrence percentage of the customer behavior or characterstic
of interest) and chooses a cutoff value so that the number of predicted positives is
equal to the number of true positives. If |
cutoff |
Threshold for converting real-valued predictions into class predictions. Default 0.5. |
predVal |
Vector with predictions (real-valued or discrete). Only used if
|
depVal |
Optional vector with true class labels for validation data. Only used
if |
Accuracy value
accuracy |
accuracy value |
cutoff |
the threshold value used to convert real-valued predictions to class predictions |
Koen W. De Bock, [email protected]
dynConfMatrix
,confMatrixMetrics
## Load response modeling data set data("response") ## Apply dynAccuracy function to obtain the accuracy that is achieved on the test sample. ## Use validation sample predictions to dynamically determine a cutoff value. acc<-dynAccuracy(response$test[,2],response$test[,1],dyn.cutoff=TRUE,predVal= response$val[,2],depVal=response$val[,1]) print(acc)
## Load response modeling data set data("response") ## Apply dynAccuracy function to obtain the accuracy that is achieved on the test sample. ## Use validation sample predictions to dynamically determine a cutoff value. acc<-dynAccuracy(response$test[,2],response$test[,1],dyn.cutoff=TRUE,predVal= response$val[,2],depVal=response$val[,1]) print(acc)
Calculates a confusion matrix for real-valued classifier predictions, with the optional ability to dynamically determine an incidence-based cutoff value using validation sample predictions
dynConfMatrix(predTest, depTest, cutoff = 0.5, dyn.cutoff = FALSE, predVal = NULL, depVal = NULL, returnClassPreds = FALSE)
dynConfMatrix(predTest, depTest, cutoff = 0.5, dyn.cutoff = FALSE, predVal = NULL, depVal = NULL, returnClassPreds = FALSE)
predTest |
Vector with predictions (real-valued or discrete) |
depTest |
Vector with real class labels |
cutoff |
Threshold for converting real-valued predictions into class predictions. Default 0.5. |
dyn.cutoff |
Logical indicator to enable dynamic threshold determination using validation sample predictions. In this case, the function determines, using validation data, the indidicence (occurrence percentage of the customer behavior or characterstic of interest) and chooses a cutoff value so that the number of predicted positives is equal to the number of true positives. If TRUE, then the value for the cutoff parameter is ignored. |
predVal |
Vector with predictions (real-valued or discrete). Only used if
|
depVal |
Optional vector with true class labels for validation data. Only used
if |
returnClassPreds |
Boolean value: should class predictions (using |
A list with two elements:
confMatrix |
a confusion matrix |
cutoff |
the threshold value used to convert real-valued predictions to class predictions |
classPreds |
class predictions, if requested using |
Koen W. De Bock, [email protected]
Witten, I.H., Frank, E. (2005): Data Mining: Practical Machine Learning Tools and Techniques, Second Edition. Chapter 5. Morgan Kauffman.
dynAccuracy
, confMatrixMetrics
## Load response modeling data set data("response") ## Apply dynConfMatrix function to obtain a confusion matrix. Use validation sample ## predictions to dynamically determine an incidence-based cutoff value. cm<-dynConfMatrix(response$test[,2],response$test[,1],dyn.cutoff=TRUE, predVal=response$val[,2],depVal=response$val[,1]) print(cm)
## Load response modeling data set data("response") ## Apply dynConfMatrix function to obtain a confusion matrix. Use validation sample ## predictions to dynamically determine an incidence-based cutoff value. cm<-dynConfMatrix(response$test[,2],response$test[,1],dyn.cutoff=TRUE, predVal=response$val[,2],depVal=response$val[,1]) print(cm)
Calculates the expected misclassification cost value for a set of predictions.
expMisclassCost(predTest, depTest, costType = c("costRatio", "costMatrix"), costs = NULL, cutoff = 0.5, dyn.cutoff = FALSE, predVal = NULL, depVal = NULL)
expMisclassCost(predTest, depTest, costType = c("costRatio", "costMatrix"), costs = NULL, cutoff = 0.5, dyn.cutoff = FALSE, predVal = NULL, depVal = NULL)
predTest |
Vector with predictions (real-valued or discrete) |
depTest |
Vector with real class labels |
costType |
An argument that specifies how the cost information is provided. This
should be either |
costs |
see |
cutoff |
Threshold for converting real-valued predictions into class predictions. Default 0.5. |
dyn.cutoff |
Logical indicator to enable dynamic threshold determination using
validation sample predictions. In this case, the function determines, using validation
data, the indidicence (occurrence percentage of the customer behavior or characterstic
of interest) and chooses a cutoff value so that the number of predicted positives is
equal to the number of true positives. If |
predVal |
Vector with predictions (real-valued or discrete). Only used if
|
depVal |
Optional vector with true class labels for validation data. Only used
if |
A list with
EMC |
expected misclassification cost value |
cutoff |
the threshold value used to convert real-valued predictions to class predictions |
Koen W. De Bock, [email protected]
## Load response modeling data set data("response") ## Apply expMisclassCost function to obtain the misclassification cost for the ## predictions for test sample. Assume a cost ratio of 5. emc<-expMisclassCost(response$test[,2],response$test[,1],costType="costRatio", costs=5) print(emc$EMC)
## Load response modeling data set data("response") ## Apply expMisclassCost function to obtain the misclassification cost for the ## predictions for test sample. Assume a cost ratio of 5. emc<-expMisclassCost(response$test[,2],response$test[,1],costType="costRatio", costs=5) print(emc$EMC)
Visualize lift through a lift chart.
liftChart(predTest, depTest, resolution = 1/10)
liftChart(predTest, depTest, resolution = 1/10)
predTest |
Vector with predictions (real-valued or discrete) |
depTest |
Vector with true class labels |
resolution |
Value for the determination of percentile intervals. Default 1/10 (10%). |
Koen W. De Bock, [email protected]
Berry, M.J.A. and Linoff, G.S. (2004): "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management - Second Edition". John Wiley & Sons.
Blattberg, R.C., Kim, B.D. and Neslin, S.A. (2008): "Database Marketing: Analyzing and Managing Customers". Springer.
topDecileLift
, liftIndex
, liftChart
## Load response modeling predictions data("response") ## Apply liftChart function to visualize lift table results liftChart(response$test[,2],response$test[,1])
## Load response modeling predictions data("response") ## Apply liftChart function to visualize lift table results liftChart(response$test[,2],response$test[,1])
Calculates lift index metric.
liftIndex(predTest, depTest)
liftIndex(predTest, depTest)
predTest |
Vector with predictions (real-valued or discrete) |
depTest |
Vector with true class labels |
Lift index value
Koen W. De Bock, [email protected]
Berry, M.J.A. and Linoff, G.S. (2004): "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management - Second Edition". John Wiley & Sons.
liftTable
, topDecileLift
, liftChart
## Load response modeling predictions data("response") ## Calculate lift index for test sample results li<-liftIndex(response$test[,2],response$test[,1]) print(li)
## Load response modeling predictions data("response") ## Calculate lift index for test sample results li<-liftIndex(response$test[,2],response$test[,1]) print(li)
Calculates a lift table, showing for different percentiles of predicted scores how much more the characteristic or action of interest occurs than for the overall sample.
liftTable(predTest, depTest, resolution = 1/10)
liftTable(predTest, depTest, resolution = 1/10)
predTest |
Vector with predictions (real-valued or discrete) |
depTest |
Vector with true class labels |
resolution |
Value for the determination of percentile intervals. Default 1/10 (10%). |
A lift table.
Koen W. De Bock, [email protected]
Berry, M.J.A. and Linoff, G.S. (2004): "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management - Second Edition". John Wiley & Sons.
topDecileLift
, liftIndex
, liftChart
## Load response modeling predictions data("response") ## Apply liftTable function to obtain lift table for test sample results and print ## results lt<-liftTable(response$test[,2],response$test[,1]) print(lt)
## Load response modeling predictions data("response") ## Apply liftTable function to obtain lift table for test sample results and print ## results lt<-liftTable(response$test[,2],response$test[,1]) print(lt)
Calculates the absolute misclassification cost value for a set of predictions.
misclassCost(predTest, depTest, costType = c("costRatio", "costMatrix", "costVector"), costs = NULL, cutoff = 0.5, dyn.cutoff = FALSE, predVal = NULL, depVal = NULL)
misclassCost(predTest, depTest, costType = c("costRatio", "costMatrix", "costVector"), costs = NULL, cutoff = 0.5, dyn.cutoff = FALSE, predVal = NULL, depVal = NULL)
predTest |
Vector with predictions (real-valued or discrete) |
depTest |
Vector with real class labels |
costType |
An argument that specifies how the cost information is provided. This
should be either |
costs |
see |
cutoff |
Threshold for converting real-valued predictions into class predictions. Default 0.5. |
dyn.cutoff |
Logical indicator to enable dynamic threshold determination using
validation sample predictions. In this case, the function determines, using validation
data, the indidicence (occurrence percentage of the customer behavior or characterstic
of interest) and chooses a cutoff value so that the number of predicted positives is
equal to the number of true positives. If |
predVal |
Vector with predictions (real-valued or discrete). Only used if
|
depVal |
Optional vector with true class labels for validation data. Only
used if |
A list with the following elements:
misclassCost |
Total misclassification cost value |
cutoff |
the threshold value used to convert real-valued predictions to class predictions |
Koen W. De Bock, [email protected]
Witten, I.H., Frank, E. (2005): Data Mining: Practical Machine Learning Tools and Techniques, Second Edition. Chapter 5. Morgan Kauffman.
dynConfMatrix
,expMisclassCost
,dynAccuracy
## Load response modeling data set data("response") ## Generate cost vector costs <- runif(nrow(response$test), 1, 100) ## Apply misclassCost function to obtain the misclassification cost for the ## predictions for test sample. Assume a cost ratio of 5. emc<-misclassCost(response$test[,2],response$test[,1],costType="costVector", costs=costs) print(emc$EMC)
## Load response modeling data set data("response") ## Generate cost vector costs <- runif(nrow(response$test), 1, 100) ## Apply misclassCost function to obtain the misclassification cost for the ## predictions for test sample. Assume a cost ratio of 5. emc<-misclassCost(response$test[,2],response$test[,1],costType="costVector", costs=costs) print(emc$EMC)
Predicted customer reponse probabilities and true responses for a customer scoring model. Includes results for two data samples: a test sample (response$test
) and a validation sample (response$val
).
data(response)
data(response)
A list with two elements: response$test
and response$val
, both are data frames with data for 2 variables: preds
and dep
.
Authors: Koen W. De Bock Maintainer: [email protected]
# Load data data(response) # Calculate incidence in test sample print(sum(response$test[,1]=="cl1")/nrow(response$test))
# Load data data(response) # Calculate incidence in test sample print(sum(response$test[,1]=="cl1")/nrow(response$test))
Calculates top-decile lift, a metric that expresses how the incidence in the 10% customers with the highest model predictions compares to the overall sample incidence. A top-decile lift of 1 is expected for a random model. A top-decile lift of 3 indicates that in the 10% highest predictions, 3 times more postive cases are identified by the model than would be expected for a random selection of instances. The upper boundary of the metric depends on the sample incidence and is given by 100% / Indidence %. E.g. when the incidence is 10%, top-decile lift can be no higher than 10.
topDecileLift(predTest, depTest)
topDecileLift(predTest, depTest)
predTest |
Vector with predictions (real-valued or discrete) |
depTest |
Vector with true class labels |
Top-decile lift value
Koen W. De Bock, [email protected]
Berry, M.J.A. and Linoff, G.S. (2004): "Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management - Second Edition". John Wiley & Sons.
liftTable
, liftIndex
, liftChart
## Load response modeling predictions data("response") ## Calculate top-decile lift for test sample results tdl<-topDecileLift(response$test[,2],response$test[,1]) print(tdl)
## Load response modeling predictions data("response") ## Calculate top-decile lift for test sample results tdl<-topDecileLift(response$test[,2],response$test[,1]) print(tdl)