使用 R 验证机器学习模型

2019-12-12 08:00:00 · 飞浪

介绍构建机器学习模型是预测建模的重要组成部分。然而，如果没有适当的模型验证，训练后的模型在未知数据上表现良好的信心就永远不会高。模型验证有助于确保模型在新数据上表现良好，并有助于选择最

介绍

构建机器学习模型是预测建模的重要组成部分。然而，如果没有适当的模型验证，训练后的模型在未知数据上表现良好的信心就永远不会高。模型验证有助于确保模型在新数据上表现良好，并有助于选择最佳模型、参数和准确度指标。

在本指南中，我们将学习几种模型验证技术的基础知识和实现：

保留验证
K 折交叉验证
重复 K 折交叉验证
留一法交叉验证

数据

在本指南中，我们将使用一个虚构的贷款申请人数据集，其中包含 600 个观测值和 9 个变量，如下所述：

Marital_status：申请人是否已婚（“是”）或未婚（“否”）
Is_graduate：申请人是否为毕业生（“是”）或不是（“否”）
收入：申请人的年收入（美元）
Loan_amount：提交申请的贷款金额（美元）
Credit_score：申请人的信用评分是好（“好”）还是不好（“坏”）
Approval_status：贷款申请是否被批准（“是”）或不被批准（“否”）
年龄：申请人的年龄（岁）
性别：申请人是男性（“M”）还是女性（“F”）
投资额：申请人申报的股票和共同基金投资总额（美元）

让我们首先加载所需的库和数据。

      library(plyr)
library(readr)
library(dplyr)
library(caret)
library(klaR)

dat <- read_csv("dataset.csv")
dat$Purpose = NULL

glimpse(dat)
    

输出：

      Observations: 600
Variables: 9
$ Marital_status  <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ...
$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "...
$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Sex             <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",...
$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
    

输出显示数据集有四个数字（标记为int）和五个字符变量（标记为chr）。我们将使用下面的代码行将它们转换为因子变量。

      names <- c(1,2,5,6,8)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
    

输出：

      Observations: 600
Variables: 9
$ Marital_status  <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No...
$ Is_graduate     <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y...
$ Income          <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136...
$ Loan_amount     <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123...
$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
$ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ...
$ Age             <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33...
$ Sex             <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, ...
$ Investment      <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...
    

保留验证

保留验证方法涉及创建训练集和保留集。训练数据用于训练模型，而保留数据用于验证模型性能。常见的分割比例为 70:30，而对于小型数据集，该比例可以是 90:10。

下面的第一行代码设置了随机种子，以确保结果的可重复性。第二行加载用于数据分区的caTools包，而第三至第五行创建训练集和测试集。训练集包含 70% 的数据（10 个变量的 420 个观测值），测试集包含剩余的 30%（10 个变量的 180 个观测值）。

      library(caTools)
set.seed(100)

spl = sample.split(dat$approval_status, SplitRatio = 0.7)
train = subset(dat, spl==TRUE)
test = subset(dat, spl==FALSE)

print(dim(train)); print(dim(test))
    

输出：

      1] 420   9

[1] 180   9

建立、预测和评估模型

为了拟合逻辑回归模型，第一步是实例化算法。这在下面的第一行代码中完成，而第二行生成对测试数据的预测。第三行生成混淆矩阵，而第四行计算并打印准确率。

      model_glm = glm(approval_status ~ . , family="binomial", data = train)

#Predictions on the test set
predictTest = predict(model_glm, newdata = test, type = "response")

# Confusion matrix on test set
table(test$approval_status, predictTest >= 0.5)
158/nrow(test) #Accuracy - 88%
    

输出：

      FALSE TRUE
  No     35   22
  Yes    10  113

[1] 0.8777778

我们可以看到，该模型在测试数据上的准确率约为 87.8%。上述技术很有用，但也有缺陷。分割非常重要，如果出错，可能会导致模型对新数据过度拟合或欠拟合。可以使用重采样方法纠正此问题，该方法使用完整数据的随机选择子集多次重复计算。我们将在本指南的以下部分讨论流行的交叉验证技术。

K 折交叉验证

在k 折交叉验证中，数据被分成 k 份。模型在 k-1 份上进行训练，其中一份留作测试。这个过程会重复进行，以确保数据集的每一份都有机会成为留作测试的数据集。这个过程完成后，我们可以使用平均值和/或标准差来总结评估指标。

我们将使用五倍交叉验证来处理问题陈述，如下面的第一行代码所示。第二行训练算法，第三行打印模型结果。

      control <- trainControl(method="cv", number=5)

kfold_model <- train(approval_status ~., data=dat, trControl=control, method="nb")

print(kfold_model)

输出：

      Naive Bayes 

600 samples
  8 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 480, 480, 480, 480, 480 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa     
  FALSE      0.7616667  0.39489399
   TRUE      0.6816667  0.05721624

Tuning parameter 'fL' was held constant at a value of 0
Tuning parameter
 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.
    

使用 k 倍交叉验证的模型的平均准确率为 76.17%，低于使用保留验证方法实现的 88% 的准确率。

重复 K 折交叉验证

将数据分成 k 份的过程可以重复多次。这称为重复 k 折交叉验证*，其中最终模型准确率取为重复次数的平均值。

以下代码行使用 5 倍交叉验证和 3 次重复来估计数据集上的朴素贝叶斯。

      control2 <- trainControl(method="repeatedcv", number=5, repeats=3)

repeated_kfold_model <- train(approval_status ~., data=dat, trControl=control2, method="nb")

print(repeated_kfold_model)

输出：

      Naive Bayes 

600 samples
  8 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 480, 480, 480, 480, 480, 480, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.7594444  0.3937285
   TRUE      0.6844444  0.0492689

Tuning parameter 'fL' was held constant at a value of 0
Tuning parameter
 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.
    

使用重复 k 倍交叉验证的模型的平均准确率为 75.94%。

留一交叉验证 (LOOCV)

留一法交叉验证（LOOCV）是一种交叉验证技术，其中折叠的大小为“1”，“k”设置为数据中的观测值数量。当训练数据的大小有限且要测试的参数数量不多时，这种变体很有用。下面的代码行重复了上述步骤。

      control3 <- trainControl(method="LOOCV")

loocv_model <- train(approval_status ~., data=dat, trControl=control3, method="nb")

print(loocv_model)

输出：

      Naive Bayes 

600 samples
  8 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 599, 599, 599, 599, 599, 599, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa     
  FALSE      0.7700000  0.41755768
   TRUE      0.6833333  0.01893287

Tuning parameter 'fL' was held constant at a value of 0
Tuning parameter
 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.
    

使用留一交叉验证的模型平均准确率为 77%。

结论

在本指南中，您了解了 R 中的各种模型验证技术。这些技术的平均准确度结果总结如下：

保留验证方法：准确率为 88%
K 折交叉验证：平均准确率为 76%
重复 K 折交叉验证：平均准确率为 76%
留一法交叉验证：平均准确率为 77%

要了解有关使用 R 进行数据科学的更多信息，请参阅以下指南：

_{免责声明：本内容来源于第三方作者授权、网友推荐或互联网整理，旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有，其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况，请与我们取得联系，我们将尽快进行相关处理与修改。感谢您的理解与支持！}

_查看原文

技术指南

R语言

阅读全文

使用 R 验证机器学习模型

杭州电子商务研究院

6年前 · 面向社会、服务行业、政产学研结合、整合资源、和谐发展

Observations: 600 Variables: 9 $ Marital_status <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "Yes", ... $ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",... $ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136... $ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123... $ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis... $ approval_status <chr> "Yes", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "... $ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33... $ Sex <chr> "F", "F", "M", "F", "M", "M", "M", "F", "F", "F", "M",... $ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...

Observations: 600 Variables: 9 $ Marital_status <fct> Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, No, No... $ Is_graduate <fct> No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Y... $ Income <int> 30000, 30000, 30000, 30000, 89900, 133300, 136700, 136... $ Loan_amount <int> 60000, 90000, 90000, 90000, 80910, 119970, 123030, 123... $ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory... $ approval_status <fct> Yes, Yes, No, No, Yes, No, Yes, Yes, Yes, No, No, No, ... $ Age <int> 25, 29, 27, 33, 29, 25, 29, 27, 33, 29, 25, 29, 27, 33... $ Sex <fct> F, F, M, F, M, M, M, F, F, F, M, F, F, M, M, M, M, M, ... $ Investment <int> 21000, 21000, 21000, 21000, 62930, 93310, 95690, 95690...

model_glm = glm(approval_status ~ . , family="binomial", data = train) #Predictions on the test set predictTest = predict(model_glm, newdata = test, type = "response") # Confusion matrix on test set table(test$approval_status, predictTest >= 0.5) 158/nrow(test) #Accuracy - 88%

Naive Bayes 600 samples 8 predictor 2 classes: 'No', 'Yes' No pre-processing Resampling: Cross-Validated (5 fold) Summary of sample sizes: 480, 480, 480, 480, 480 Resampling results across tuning parameters: usekernel Accuracy Kappa FALSE 0.7616667 0.39489399 TRUE 0.6816667 0.05721624 Tuning parameter 'fL' was held constant at a value of 0 Tuning parameter 'adjust' was held constant at a value of 1 Accuracy was used to select the optimal model using the largest value. The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.

control2 <- trainControl(method="repeatedcv", number=5, repeats=3) repeated_kfold_model <- train(approval_status ~., data=dat, trControl=control2, method="nb") print(repeated_kfold_model)

Naive Bayes 600 samples 8 predictor 2 classes: 'No', 'Yes' No pre-processing Resampling: Cross-Validated (5 fold, repeated 3 times) Summary of sample sizes: 480, 480, 480, 480, 480, 480, ... Resampling results across tuning parameters: usekernel Accuracy Kappa FALSE 0.7594444 0.3937285 TRUE 0.6844444 0.0492689 Tuning parameter 'fL' was held constant at a value of 0 Tuning parameter 'adjust' was held constant at a value of 1 Accuracy was used to select the optimal model using the largest value. The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.

Naive Bayes 600 samples 8 predictor 2 classes: 'No', 'Yes' No pre-processing Resampling: Leave-One-Out Cross-Validation Summary of sample sizes: 599, 599, 599, 599, 599, 599, ... Resampling results across tuning parameters: usekernel Accuracy Kappa FALSE 0.7700000 0.41755768 TRUE 0.6833333 0.01893287 Tuning parameter 'fL' was held constant at a value of 0 Tuning parameter 'adjust' was held constant at a value of 1 Accuracy was used to select the optimal model using the largest value. The final values used for the model were fL = 0, usekernel = FALSE and adjust = 1.