处理 R 中的缺失、无效和重复数据
介绍
数据科学的一个重要组成部分是清理数据并为预测建模做好准备。与数据清理相关的最常见问题是处理缺失数据、无效记录和重复值。
在本指南中,您将了解统计编程语言 R 中处理缺失、无效和重复数据的技术。
数据
在本指南中,我们将使用包含 600 个观测值和 12 个变量的虚构贷款申请数据集:
1. UID——申请人的唯一标识符
2. Marital_status - 申请人是否已婚(“是”)或未婚(“否”)
3. 家属- 申请人的家属人数
4. Is_graduate - 申请人是否为毕业生(“是”)或不是(“否”)
5. 收入- 申请人的年收入(美元)
6. Loan_amount - 提交申请的贷款金额(美元)
7. Term_months - 贷款期限
8. Credit_score - 申请人的信用评分是良好(“满意”)还是不良好(“不满意”)
9. Approval_status - 贷款申请是否被批准(“1”)或未批准(“0”)
10. 年龄- 申请人的年龄(岁)
11. 性别- 申请人是男性(“M”)还是女性(“F”)
12. 目的- 申请贷款的目的
让我们首先加载所需的库和数据。
library(readr)
dat <- read_csv("data_cleaning.csv")
glimpse(dat)
输出:
Observations: 600
Variables: 12
$ UID <chr> "UIDA7", "UIDA354", "UIDA402", "UIDA467", "UIDA78", "U...
$ Marital_status <chr> "No", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Y...
$ Dependents <int> NA, 0, 0, 0, NA, NA, NA, NA, 0, 2, 0, 2, 0, 0, NA, NA,...
$ Is_graduate <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", ...
$ Income <int> 333300, 532400, 454700, 368500, 611100, 266700, 243700...
$ Loan_amount <int> 53500, 115000, 49000, 61000, 62000, 70000, 55500, 6300...
$ Term_months <int> 504, 384, 384, 384, 384, 384, 384, 204, 384, 384, 384,...
$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, ...
$ Age <int> -3, -3, -10, -12, -12, -12, -12, 23, 23, 23, 23, 23, 2...
$ Sex <chr> "M", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M",...
$ Purpose <chr> "Education", "Travel", "Personal", "Business", "Educat...
输出显示数据集有六个数值变量(标记为“int”),其余六个是分类变量(标记为“chr”)。我们将使用下面的代码行将这些变量转换为“因子”变量(“UID”变量除外)。
names <- c(2,4,8,9,11,12)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
输出:
Observations: 600
Variables: 12
$ UID <chr> "UIDA7", "UIDA354", "UIDA402", "UIDA467", "UIDA78", "U...
$ Marital_status <fct> No, Yes, No, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes...
$ Dependents <int> NA, 0, 0, 0, NA, NA, NA, NA, 0, 2, 0, 2, 0, 0, NA, NA,...
$ Is_graduate <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, Yes, Yes, Yes, Y...
$ Income <int> 333300, 532400, 454700, 368500, 611100, 266700, 243700...
$ Loan_amount <int> 53500, 115000, 49000, 61000, 62000, 70000, 55500, 6300...
$ Term_months <int> 504, 384, 384, 384, 384, 384, 384, 204, 384, 384, 384,...
$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
$ approval_status <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, ...
$ Age <int> -3, -3, -10, -12, -12, -12, -12, 23, 23, 23, 23, 23, 2...
$ Sex <fct> M, M, F, F, F, M, M, M, M, M, M, M, M, F, F, M, M, M, ...
$ Purpose <fct> Education, Travel, Personal, Business, Education, Educ...
我们现在准备执行以下部分中的数据清理步骤。
重复值
第一步是检查重复记录,这是现实世界数据中最常见的错误之一。重复记录会增加计算时间并降低模型准确性,因此必须删除。在我们的数据集中,“UID”是唯一标识符变量,将用于删除重复记录。下面的第一行代码使用duplicated()函数查找重复项,而第二行则打印重复项的数量。
dup_records <- duplicated(dat$UID)
sum(dup_records)
输出:
1] 3
The output shows that there are three duplicate records. We will drop these records using the first line of code below. The second line prints the dimension of the resulting dataset — 597 observations and 12 variables.
dat <- dat[!duplicated(dat$UID), ]
dim(dat)
Output:
1] 597 12
Invalid Values
When we looked at the data using the glimpse() function in the previous section, we realized that the age variable has incorrect entries. Let’s look at the summary of the age variable.
summary(dat$Age)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-12.00 36.50 51.00 49.03 61.00 76.00 2
The output shows that the minimum value of the variable 'Age' is -12. This is impossible and brings us to the next common problem in real world datasets: the presence of inaccurate records. It is safe to assume that for loan applications, the minimum age should be 18 years. This means that we will remove records of applicants below 18 years of age.
The first two of lines of code below give us the number of records in the dataset for which the age is below 18 years. The number of such records is seven, and they are removed with the third line of code. The fourth line prints the dimensions of the new data — 590 observations and 12 variables.
Finally, we relook at the summary of the age variable. This shows that the range of age is now 23 to 76 years, indicating that the correction has been made.
age_18 <- dat[which(dat$Age<18),]
dim(age_18)
dat <- dat[-which(dat$Age<18),]
dim(dat)
summary(dat$Age)
Output:
1] 590 12
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
23.00 37.00 51.00 49.72 61.00 76.00 2
It was easy to detect incorrect entries in the age variable. In other cases, invalid values are in the form of outliers. Let’s look at an example of the 'Income' variable. The summary of the variable shows that the minimum and maximum income levels are USD 136700, and 3321001, respectively. This is a highly skewed range, indicating some extreme values. To better understand the distribution, we use the quantile function, which gives us the first to hundredth percentile values of the variable in the sequence of unit percentile.
summary(dat$Income)
quantile(dat$Income,seq(0,1,0.01))
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
136700 386700 512800 687874 775300 3321001
0% 1% 2% 3% 4% 5% 6% 7% 8% 9%
136700 136700 210676 240000 244288 254540 263684 274664 286108 295100
10% 11% 12% 13% 14% 15% 16% 17% 18% 19%
301560 311100 317700 320000 329100 332220 333300 335288 344400 346700
20% 21% 22% 23% 24% 25% 26% 27% 28% 29%
352860 358800 363084 371396 383332 386700 391172 397980 401508 405556
30% 31% 32% 33% 34% 35% 36% 37% 38% 39%
410220 412724 421052 422244 424804 431960 437016 444400 448620 454972
40% 41% 42% 43% 44% 45% 46% 47% 48% 49%
458920 465068 468448 476468 479696 486180 491380 495548 500000 506956
50% 51% 52% 53% 54% 55% 56% 57% 58% 59%
512800 515552 523184 532012 536480 551820 555504 563080 572852 577700
60% 61% 62% 63% 64% 65% 66% 67% 68% 69%
585380 607584 611276 620300 625904 633300 648308 656708 666700 683156
70% 71% 72% 73% 74% 75% 76% 77% 78% 79%
700000 721040 733300 753968 761484 775300 788132 800000 807740 821696
80% 81% 82% 83% 84% 85% 86% 87% 88% 89%
834660 853300 880008 914712 963752 1010680 1058180 1111100 1149276 1219460
90% 91% 92% 93% 94% 95% 96% 97% 98% 99%
1262060 1333300 1392412 1502676 1664032 1944400 2064768 2223884 2608396 3197268
100%
3321001
We can remove the outliers using the method described in the previous section. We can also address them through a different method of flooring and capping the extreme values. The first line of code below does the flooring of the lower outliers at the first percentile value, i.e., USD 136700. Similarly, the second line performs the capping of the higher outliers at the 99th percentile value, i.e., USD 3321001.
The third line of code prints the new summary of the variable, indicating that the correction has been done.
dat$Income[which(dat$Income<136700)]<- 136700
dat$Income[which(dat$Income > 3321001)]<- 3321001
summary(dat$Income)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
136700 386175 508650 685301 772650 3321001
Missing Values
Missing value treatment is the most common data cleaning step performed in a data science project. The line of code below prints the number of missing values in each of the variables.
sapply(dat, function(x) sum(is.na(x)))
Output:
UID Marital_status Dependents Is_graduate Income
0 0 3 6 0
Loan_amount Term_months Credit_score approval_status Age
7 6 0 0 2
Sex Purpose
0 0
The output above shows the presence of missing values across the variables, most of which are numerical variables, except 'Is_graduate', which is a categorical variable.
Missing Value Imputation for Numerical Variables
为数值变量估算值最常用的技术是用平均值或中值替换缺失值。在下面的代码行中,我们用中值替换“Loan_amount”中的缺失值,而用平均值替换“Term_months”中的缺失值。输出显示缺失值已被估算。
dat$Loan_amount[is.na(dat$Loan_amount)] <- median(dat$Loan_amount, na.rm = TRUE)
table(is.na(dat$Loan_amount))
dat$Term_months[is.na(dat$Term_months)] <- mean(dat$Term_months, na.rm = TRUE)
table(is.na(dat$Term_months))
输出:
FALSE
590
FALSE
590
分类变量的缺失值插补
对于分类变量,了解频率分布很重要,可以使用下面的代码行打印出来。
table(dat$Is_graduate)
输出:
No Yes
127 457
输出显示大多数申请人都是毕业生,并标有“是”标签。下面的代码行将缺失值替换为频率最高的标签“是”。
dat$Is_graduate[is.na(dat$Is_graduate)] <- "Yes"
table(dat$Is_graduate)
输出:
No Yes
127 463
免责声明:本内容来源于第三方作者授权、网友推荐或互联网整理,旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有,其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况,请与我们取得联系,我们将尽快进行相关处理与修改。感谢您的理解与支持!
请先 登录后发表评论 ~