使用 R 查找数据中的关系

2019-11-12 08:00:00 · 飞浪

学习在数值和分类变量的数据中寻找关系的技术，以及如何通过统计验证解释结果。

介绍

构建高性能机器学习算法取决于识别变量之间的关系。这有助于特征工程以及决定机器学习算法。在本指南中，您将学习使用 R 在数据中查找关系的技术。

数据

在本指南中，我们将使用一个虚构的贷款申请人数据集，其中包含 200 个观测值和 10 个变量，如下所述：

Marital_status申请人是否已婚（“是”）或未婚（“否”）
Is_graduate申请人是否为毕业生（“是”）或不是（“否”）
收入申请人的年收入（美元）
Loan_amount提交申请的贷款金额（美元）
Credit_score申请人的信用评分是好（“好”）还是不好（“坏”）。
Approval_status贷款申请是否被批准（“是”）或未被批准（“否”）。
投资申请人申报的股票和共同基金投资额（美元）
性别申请人是“女性”还是“男性”
年龄申请人的年龄（岁）
Work_exp申请人的工作经验（年数）

让我们首先加载所需的库和数据。

      library(plyr)
library(readr)
library(ggplot2)
library(GGally)
library(dplyr)
library(mlbench)

dat <- read_csv("data_test.csv")
glimpse(dat)
    

输出：

      Observations: 200
Variables: 10
$ Marital_status  <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
$ Is_graduate     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", ...
$ Income          <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
$ Loan_amount     <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
$ Credit_score    <chr> "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad"...
$ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
$ Investment      <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
$ gender          <chr> "Female", "Female", "Female", "Female", "Female", "Fem...
$ age             <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
$ work_exp        <dbl> 8.10, 7.20, 9.00, 8.55, 8.10, 6.30, 5.40, 8.10, 8.10, ...
    

输出显示数据集有五个数字（标记为int、dbl）和五个字符变量（标记为chr）。我们将使用下面的代码行将它们转换为因子变量。

      names <- c(1,2,5,6,8)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
    

输出：

      Observations: 200
Variables: 10
$ Marital_status  <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Is_graduate     <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Income          <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
$ Loan_amount     <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
$ Credit_score    <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad,...
$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Investment      <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
$ gender          <fct> Female, Female, Female, Female, Female, Female, Female...
$ age             <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
$ work_exp        <dbl> 8.10, 7.20, 9.00, 8.55, 8.10, 6.30, 5.40, 8.10, 8.10, ...
    

数值变量之间的关系

许多机器学习算法要求连续变量之间不能相互关联，这种现象称为“多重共线性”。建立数值变量之间的关系是检测和处理多重共线性的常见步骤。

分类变量之间的关系

在前面的部分中，我们介绍了寻找数值变量之间关系的技术。理解和估计分类变量之间的关系也同样重要。

频率表

创建频率表是查找两个分类变量之间分布的简单但有效的方法。table ()函数可用于创建两个变量之间的双向表。

在下面的第一行代码中，我们在变量marital_status和approved_status之间创建了一个双向表。第二行打印频率表，第三行打印比例表。第四行打印行比例表，第五行打印列比例表。

      # 2 - way table
two_way = table(dat$Marital_status, dat$approval_status)
two_way

prop.table(two_way) # cell percentages
prop.table(two_way, 1) # row percentages
prop.table(two_way, 2) # column percentages
    

输出：

      #Output - two_way table

           No Yes
  Divorced 31  29
  No       66  10
  Yes      52  12


#Output - cell percentages table

             No   Yes
  Divorced 0.155 0.145
  No       0.330 0.050
  Yes      0.260 0.060

#Output - row percentages table

                No       Yes
  Divorced 0.5166667 0.4833333
  No       0.8684211 0.1315789
  Yes      0.8125000 0.1875000


#Output - column percentages table

                 No       Yes
  Divorced 0.2080537 0.5686275
  No       0.4429530 0.1960784
  Yes      0.3489933 0.2352941
    

列百分比表的输出显示，离婚申请人（56.8%）获得贷款批准的概率高于已婚申请人（19.6%）。为了检验这一见解是否具有统计意义，我们使用独立性卡方检验。

独立性卡方检验

独立性卡方检验用于确定两个或多个分类变量之间是否存在关联。在我们的案例中，我们想测试申请人的婚姻状况是否与批准状态有任何关联。

<font style="

_{免责声明：本内容来源于第三方作者授权、网友推荐或互联网整理，旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有，其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况，请与我们取得联系，我们将尽快进行相关处理与修改。感谢您的理解与支持！}

_查看原文

技术指南

R语言

阅读全文

使用 R 查找数据中的关系

杭州电子商务研究院

5年前 · 面向社会、服务行业、政产学研结合、整合资源、和谐发展

Observations: 200 Variables: 10 $ Marital_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"... $ Is_graduate <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", ... $ Income <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000... $ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61... $ Credit_score <chr> "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad"... $ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"... $ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9... $ gender <chr> "Female", "Female", "Female", "Female", "Female", "Fem... $ age <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33... $ work_exp <dbl> 8.10, 7.20, 9.00, 8.55, 8.10, 6.30, 5.40, 8.10, 8.10, ...

Observations: 200 Variables: 10 $ Marital_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,... $ Is_graduate <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y... $ Income <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000... $ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61... $ Credit_score <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad,... $ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,... $ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9... $ gender <fct> Female, Female, Female, Female, Female, Female, Female... $ age <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33... $ work_exp <dbl> 8.10, 7.20, 9.00, 8.55, 8.10, 6.30, 5.40, 8.10, 8.10, ...

Income Loan_amount Investment age work_exp Income 1.0 0.0 0.1 -0.2 0.9 Loan_amount 0.0 1.0 0.8 0.0 0.0 Investment 0.1 0.8 1.0 0.0 0.1 age -0.2 0.0 0.0 1.0 -0.1 work_exp 0.9 0.0 0.1 -0.1 1.0

library(ggcorrplot) ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, lab_size = 3, method="circle", colors = c("blue", "white", "red"), outline.color = "gray", show.legend = TRUE, show.diag = FALSE, title="Correlogram of loan variables")

Pearson's product-moment correlation data: dat$Investment and dat$work_exp t = 1.0801, df = 198, p-value = 0.2814 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.0628762, 0.2130117 sample estimates: cor - 0.07653245

Pearson's product-moment correlation data: dat$Income and dat$work_exp t = 25.869, df = 198, p-value < 2.2e-16 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.8423810; 0.9066903 sample estimates: cor - 0.8784546

# 2 - way table two_way = table(dat$Marital_status, dat$approval_status) two_way prop.table(two_way) # cell percentages prop.table(two_way, 1) # row percentages prop.table(two_way, 2) # column percentages

#Output - two_way table No Yes Divorced 31 29 No 66 10 Yes 52 12 #Output - cell percentages table No Yes Divorced 0.155 0.145 No 0.330 0.050 Yes 0.260 0.060 #Output - row percentages table No Yes Divorced 0.5166667 0.4833333 No 0.8684211 0.1315789 Yes 0.8125000 0.1875000 #Output - column percentages table No Yes Divorced 0.2080537 0.5686275 No 0.4429530 0.1960784 Yes 0.3489933 0.2352941

使用 R 查找数据中的关系

介绍

数据

数值变量之间的关系

相关矩阵

相关图

相关性检验

分类变量之间的关系

频率表

独立性卡方检验

使用 R 查找数据中的关系

介绍

数据

数值变量之间的关系

相关矩阵

相关图

相关性检验

分类变量之间的关系

频率表

独立性卡方检验