使用 R 编码数据

2019-11-12 08:00:00 · 飞浪

在 R 中，标签编码、独热编码和编码连续（或数字）变量使我们能够使用强大的机器学习算法。

介绍

R 中有几种强大的机器学习算法。但是，为了充分利用这些算法，我们必须将数据转换为所需的格式。执行此操作的常见步骤之一是对数据进行编码，这可以增强算法的计算能力和效率。在本指南中，您将了解使用 R 对数据进行编码的不同技术。

数据

在本指南中，我们将使用包含 600 个观测值和 10 个变量的虚构贷款申请数据集：

Marital_status：申请人是否已婚（“是”）或未婚（“否”）
家属：申请人家属人数
Is_graduate：申请人是否为毕业生（“是”）或不是（“否”）
收入：申请人的年收入（美元）
Loan_amount：提交申请的贷款金额（美元）
Credit_score：申请人的信用评分是良好（“满意”）还是不良好（“不满意”）
Approval_status：贷款申请是否已获批准（“1”）或未获批准（“0”）
年龄：申请人的年龄（岁）
性别：申请人是男性（“M”）还是女性（“F”）
目的：申请贷款的目的

让我们首先加载所需的库和数据。

      library(plyr)
library(readr)
library(dplyr)
library(caret)

dat <- read_csv("data_eng.csv")

glimpse(dat)
    

输出：

      Observations: 600
Variables: 10
$ Marital_status  <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Dependents      <int> 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, ...
$ Is_graduate     <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",...
$ Income          <int> 298500, 315500, 295100, 319300, 333300, 277700, 332100...
$ Loan_amount     <int> 71000, 75500, 70000, 70000, 98000, 71000, 58000, 64000...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Age             <int> 74, 71, 71, 68, 64, 64, 63, 61, 60, 59, 56, 55, 54, 54...
$ Sex             <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",...
$ Purpose         <chr> "Wedding", "Wedding", "Wedding", "Wedding", "Wedding",...
    

输出显示数据集有六个数值变量（标记为int）和四个分类变量（标记为chr）。我们现在准备执行编码步骤。

编码分类变量的方法有很多种，选择的方法取决于变量中的标签分布和最终目标。在后续章节中，我们将介绍最广泛使用的编码分类变量的技术。

标签编码

简单来说，标签编码是用虚拟数字替换分类变量的不同级别的过程。例如，变量Credit_score有两个级别，“Satisfactory”和“Not_satisfactory”。它们可以分别编码为 1 和 0。下面的第一行代码执行此任务，而第二行打印编码后的级别表。

      dat$Credit_score <- ifelse(dat$Credit_score == "Satisfactory",1,0)

table(dat$Credit_score)

输出：

 1 
472
    

The above output shows that the label encoding is done. This is easy when you have two levels in the categorical variable, as with Credit_score. If the variable contains more than two labels, this will not be intuitive. For example, the 'Purpose' variable has six levels, as can be seen from the output below.

      table(dat$Purpose)

Output:

      Business Education Furniture  Personal    Travel   Wedding 
       43       191        38       166       123        39
    

In such cases, one-hot encoding is preferred.

One-Hot Encoding

In this technique, one-hot (dummy) encoding is applied to the features, creating a binary column for each category level and returning a sparse matrix. In each dummy variable, the label “1” will represent the existence of the level in the variable, while the label “0” will represent its non-existence.

We will apply this technique to all the remaining categorical variables. The first line of code below imports the powerful caret package, while the second line uses the dummyVars() function to create a full set of dummy variables. The dummyVars() method works on the categorical variables. It is to be noted that the second line contains the argument fullrank=T, which will create n-1 columns for a categorical variable with n unique levels.

The third line uses the output of the dummyVars() function and transforms the dataset, dat, where all the categorical variables are encoded to numerical variables. The fourth line of code prints the structure of the resulting data, dat-transfored, which confirms that one-hot encoding is completed.

      library(caret)

dmy <- dummyVars(" ~ .", data = dat, fullRank = T)
dat_transformed <- data.frame(predict(dmy, newdata = dat))

glimpse(dat_transformed)

Output:

      Observations: 600
Variables: 14
$ Marital_status.Yes <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...
$ Dependents         <dbl> 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, ...
$ Is_graduate.Yes    <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, ...
$ Income                  <dbl> 298500, 315500, 295100, 319300, 333300, 277700, 332...
$ Loan_amount        <dbl> 71000, 75500, 70000, 70000, 98000, 71000, 58000, 64...
$ Credit_score          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, ...
$ approval_status.1  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Age                         <dbl> 74, 71, 71, 68, 64, 64, 63, 61, 60, 59, 56, 55, 54,...
$ Sex.M                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Purpose.Education  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Purpose.Furniture   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Purpose.Personal   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Purpose.Travel        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Purpose.Wedding    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
    

Encoding Continuous (or Numeric) Variables

In the previous sections, we learned how to encode categorical variables. However, sometimes it may be useful to carry out encoding for numerical variables as well. For example, the Naive Bayes Algorithm requires all variables to be categorical, so encoding numerical variables is required. This is also called binning.

We will consider the Income variable as an example. Let’s look at the summary statistics of this variable.

      summary(dat$Income)

Output:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 133300  384975  508350  706302  766100 8444900
    

The values of Income range between $133,300 and $8.44 million, which shows that the distribution is right skewed. One of the additional benefits of binning is that it also takes care of the outliers. Let’s create three levels of the variable Income, which are “Low” for income levels lying below $380,000, “High” for income values above $760,000, and “Mid50” for the middle 50 percentage values of the income distribution.

第一步是创建这些截止点的向量，这在下面的第一行代码中完成。第二行给这些截止点赋予相应的名称。第三行使用 cut ()函数根据截止点拆分向量。最后，我们使用summary()函数将原始Income变量与分箱后的Income_New变量进行比较。

      bins <- c(-Inf, 384975, 766100, Inf)

bin_names <- c("Low", "Mid50", "High")

dat$Income_new <- cut(dat$Income, breaks = bins, labels = bin_names)

summary(dat$Income)

summary(dat$Income_new)

输出：

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 133300  384975  508350  706302  766100 8444900 
 
  Low Mid50  High 
  150   301   149
    

上面的输出表明变量已分箱。也可以自动创建分箱截止值，如下面的代码所示。在本例中，我们为变量Age创建了 5 个宽度大致相等的分箱。

      dat$Age_new <- cut(dat$Age, breaks = 5, labels = c("Bin1", "Bin2", "Bin3","Bin4", "Bin5"))

summary(dat$Age)

summary(dat$Age_new)

输出：

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  22.00   36.00   50.00   49.31   61.00   76.00 
 
Bin1 Bin2 Bin3 Bin4 Bin5 
 108  117  114  162   99
    

结论

在本指南中，您学习了使用 R 编码数据的方法。您已将这些技术应用于定量和定性变量。根据项目目标，您可以应用任何或所有这些编码技术。要了解有关使用 R 进行数据科学的更多信息，请参阅以下指南：

_{免责声明：本内容来源于第三方作者授权、网友推荐或互联网整理，旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有，其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况，请与我们取得联系，我们将尽快进行相关处理与修改。感谢您的理解与支持！}

_查看原文

技术指南

R语言

阅读全文

使用 R 编码数据

杭州电子商务研究院

5年前 · 面向社会、服务行业、政产学研结合、整合资源、和谐发展

Observations: 600 Variables: 10 $ Marital_status <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",... $ Dependents <int> 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, ... $ Is_graduate <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes",... $ Income <int> 298500, 315500, 295100, 319300, 333300, 277700, 332100... $ Loan_amount <int> 71000, 75500, 70000, 70000, 98000, 71000, 58000, 64000... $ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis... $ approval_status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... $ Age <int> 74, 71, 71, 68, 64, 64, 63, 61, 60, 59, 56, 55, 54, 54... $ Sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",... $ Purpose <chr> "Wedding", "Wedding", "Wedding", "Wedding", "Wedding",...

Observations: 600 Variables: 14 $ Marital_status.Yes <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ... $ Dependents <dbl> 1, 0, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, ... $ Is_graduate.Yes <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, ... $ Income <dbl> 298500, 315500, 295100, 319300, 333300, 277700, 332... $ Loan_amount <dbl> 71000, 75500, 70000, 70000, 98000, 71000, 58000, 64... $ Credit_score <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, ... $ approval_status.1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... $ Age <dbl> 74, 71, 71, 68, 64, 64, 63, 61, 60, 59, 56, 55, 54,... $ Sex.M <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... $ Purpose.Education <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... $ Purpose.Furniture <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... $ Purpose.Personal <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... $ Purpose.Travel <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... $ Purpose.Wedding <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

bins <- c(-Inf, 384975, 766100, Inf) bin_names <- c("Low", "Mid50", "High") dat$Income_new <- cut(dat$Income, breaks = bins, labels = bin_names) summary(dat$Income) summary(dat$Income_new)