使用 R 中的断言验证数据
介绍
数据质量在机器学习中起着至关重要的作用。如果没有良好的数据,就会产生错误,从而对数据分析和模型性能结果产生不利影响。通常,这些错误很难检测,并且在分析后期才会出现。更糟糕的是,有时错误仍未被发现并流入数据,产生不准确的结果。解决这个问题的方法是数据验证。输入断言,这是一种测试条件的调试辅助工具,用于以编程方式检查数据。
在本指南中,您将学习使用 R 中的断言来验证数据。具体来说,我们将使用Assertr包,它提供了各种函数,旨在在数据分析管道早期验证有关数据的假设。
数据
在本指南中,我们将使用一个虚构的贷款申请人数据集,其中包含 600 个观测值和 10 个变量,如下所述:
Marital_status:申请人是否已婚(“是”)或未婚(“否”)。
Is_graduate:申请人是否已毕业(“是”)或不是(“否”)。
收入:申请人的年收入(美元)。
Loan_amount:提交申请的贷款金额(以美元计)。
Credit_score:申请人的信用评分是否令人满意。
authorization_status:贷款申请是否被批准(“是”)或未被批准(“否”)。
年龄:申请人的年龄。
性别:申请人是男性(“M”)还是女性(“F”)。
受抚养人:申请人家庭中的受抚养人人数。
目的:申请贷款的目的。
让我们首先加载所需的库和数据。
library(readr)
library(assertr)
library(assertive)
library(magrittr)
library(dplyr)
dat <- read_csv("dataset.csv")
dim(dat)
输出:
1] 600 10
断言的重要性
下面的示例展示了断言的重要性,其中我们汇总了按批准状态分组的申请人的平均年龄。下面的第一行代码将approval_status变量转换为因子,而第二行执行所需的计算。
dat$approval_status = as.factor(dat$approval_status)
dat %>%
group_by(approval_status) %>%
summarise(avg_age=mean(Age))
输出:
approval_status avg_age
<fctr> <dbl>
0 47.40000
1 48.61463
上面的输出似乎没有任何错误,但让我们看一下Age变量的摘要函数。
summary(dat$Age)
输出:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-10.00 36.00 50.00 48.23 61.00 76.00
从上面的输出中,我们可以看到一些申请人的年龄是负数,这是不可能的。这是不正确的数据,但在我们执行 group -by操作的先前代码中未检测到此错误。这是 Assertr 的verify()函数可用于确保此类错误不会被识别的地方。
verify 函数接受一个数据框 ( dat ) 和一个逻辑表达式 ( Age >= 0 )。然后,它根据提供的数据评估该表达式。如果表达式的条件不满足,verify 会发出错误警报并终止代码管道的进一步处理。在此示例中,下面的代码行将执行此任务。
dat %>%
verify(Age >= 0) %>%
group_by(approval_status) %>%
summarise(avg_age=mean(Age))
输出:
verification [Age >= 0] failed! (10 failures)
verb redux_fn predicate column index value
1 verify NA Age >= 0 NA 1 NA
2 verify NA Age >= 0 NA 2 NA
3 verify NA Age >= 0 NA 3 NA
4 verify NA Age >= 0 NA 4 NA
5 verify NA Age >= 0 NA 193 NA
6 verify NA Age >= 0 NA 194 NA
7 verify NA Age >= 0 NA 195 NA
8 verify NA Age >= 0 NA 199 NA
9 verify NA Age >= 0 NA 209 NA
10 verify NA Age >= 0 NA 600 NA
Error: assertr stopped execution
输出结果显示 age 为负值的 10 个实例,并以索引突出显示。最后,错误消息Error: assertr ceased operation表明执行已停止,这就是未显示所需输出的原因。
可以使用 Assertr 的assert()函数执行相同的任务。在下面的代码中,assert()函数获取数据dat并应用谓词函数within_bounds(0,Inf)。我们已将范围设置为仅包含正值,但可以根据需要进行更改。下一步是将谓词函数应用于感兴趣的列Age。当条件不满足时,下面的代码会发出错误警报。
dat %>%
assert(within_bounds(0,Inf), Age) %>%
group_by(approval_status) %>%
summarise(avg_age=mean(Age))
输出:
Column 'Age' violates assertion 'within_bounds(0, Inf)' 10 times
verb redux_fn predicate column index value
1 assert NA within_bounds(0, Inf) Age 1 -2
2 assert NA within_bounds(0, Inf) Age 2 -3
3 assert NA within_bounds(0, Inf) Age 3 -4
4 assert NA within_bounds(0, Inf) Age 4 -5
5 assert NA within_bounds(0, Inf) Age 193 -5
[omitted 5 rows]
Error: assertr stopped execution
输出的第一行,Column 'Age'violates assertion 'within_bounds(0, Inf)' 10 times,表示有十行年龄值为负数。
合并多个断言
使用断言逐个验证数据点可能非常耗时且效率低下。更有效的方法是使用断言函数系列并创建此类命令链以进行数据验证,如下例所示。
假设我们想要验证数据中的以下条件。
数据包含指南初始部分中描述的所有十个变量。这是通过以下代码中的verify(has_all_names())命令实现的。
数据集包含至少 120 个观测值,占初始数据的 20%。这是通过下面的verify((nrow(.) > 120))命令实现的。
变量Age仅取正值。这可以通过下面的verify(Age > 0)命令实现。
变量Income和Loan_amount的值应在各自平均值的三个标准差内。这可以通过以下代码中的persist(within_n_sds(3), Income)命令实现。
The target variable, approval_status, contains only the binary values zero and one. This is achieved with the assert(in_set(0,1), approval_status) command in the code below.
Each row in the data contains at most six missing records. This is achieved with the assert_rows(num_row_NAs, within_bounds(0,6), everything()) command below.
Each row is unique jointly between the Income, Dependents, approval_status, Age, Sex, Purpose, Loan_amount, and Credit_score variables. This is achieved with the assert_rows(col_concat, is_uniq,...) command below.
dat %>%
verify(has_all_names("Loan_amount", "Income", "Marital_status", "Dependents", "Is_graduate", "Credit_score", "approval_status", "Age", "Sex", "Purpose")) %>%
verify(nrow(.) > 120) %>%
verify(Age > 0) %>%
insist(within_n_sds(3), Income) %>%
insist(within_n_sds(3), Loan_amount) %>%
assert(in_set(0,1), approval_status) %>%
assert_rows(num_row_NAs, within_bounds(0,6), everything()) %>%
assert_rows(col_concat, is_uniq, Income, Dependents, approval_status, Age, Sex, Purpose, Loan_amount, Credit_score) %>%
group_by(approval_status) %>%
summarise(avg.Age=mean(Age))
Output:
verification [Age > 0] failed! (10 failures)
verb redux_fn predicate column index value
1 verify NA Age > 0 NA 1 NA
2 verify NA Age > 0 NA 2 NA
3 verify NA Age > 0 NA 3 NA
4 verify NA Age > 0 NA 4 NA
5 verify NA Age > 0 NA 193 NA
6 verify NA Age > 0 NA 194 NA
7 verify NA Age > 0 NA 195 NA
8 verify NA Age > 0 NA 199 NA
9 verify NA Age > 0 NA 209 NA
10 verify NA Age > 0 NA 600 NA
Error: assertr stopped execution
The output shows that the first two requirements are met but the execution was halted in the third condition with the variable,Age taking negative values. Let's make this correction and create a new data frame, dat2, which only takes positive age values. This is done using the code below.
dat2 <- dat %>%
filter(Age > 0)
dim(dat2)
Output:
1] 590 10
The resulting data has 590 observations because ten rows containing negative values of age were removed. We'll recheck the combination of the data conditions, specified above, using the code below.
dat2 %>%
verify(has_all_names("Loan_amount", "Income", "Marital_status", "Dependents", "Is_graduate", "Credit_score", "approval_status", "Age", "Sex", "Purpose")) %>%
verify(nrow(.) > 120) %>%
verify(Age > 0) %>%
insist(within_n_sds(3), Income) %>%
insist(within_n_sds(3), Loan_amount) %>%
assert(in_set(0,1), approval_status) %>%
assert_rows(num_row_NAs, within_bounds(0,6), everything()) %>%
assert_rows(col_concat, is_uniq, Income, Dependents, approval_status, Age, Sex, Purpose, Loan_amount, Credit_score) %>%
group_by(approval_status) %>%
summarise(avg.Age=mean(Age))
Output:
Column 'Income' violates assertion 'within_n_sds(3)' 7 times
verb redux_fn predicate column index value
1 insist NA within_n_sds(3) Income 190 3173700
2 insist NA within_n_sds(3) Income 255 5219600
3 insist NA within_n_sds(3) Income 321 5333200
4 insist NA within_n_sds(3) Income 324 6901700
5 insist NA within_n_sds(3) Income 344 8444900
[omitted 2 rows]
The output shows that now there is no error alert for negative age values, since those were dropped. Instead, the insist() function found seven records where the Income variable was not within three standard deviations from the mean. The output also prints the index of these records, making it easier for us to treat them as outliers. In this way, we can go on validating the data assumptions and incorporating required corrections if needed.
Conclusion
In this guide, you have learned methods of validating data using asserts in R. You have applied these assertions using two functions, verify() and assert(). This knowledge will help you perform proper data validation, resulting in better data science and analytics results.
To learn more about Data Science with R, please refer to the following guides:
Hypothesis Testing - Interpreting Data with Statistical Models
<a href="https://www-pluralsight-com.translate.goog/resources/blog/guides/coping-missing-i
免责声明:本内容来源于第三方作者授权、网友推荐或互联网整理,旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有,其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况,请与我们取得联系,我们将尽快进行相关处理与修改。感谢您的理解与支持!
请先 登录后发表评论 ~