SQL R 等效于两个或多个字段/变量上的 SELECT DISTINCT
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2900510/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
R equivalent of SELECT DISTINCT on two or more fields/variables
提问by wahalulu
Say I have a dataframe df with two or more columns, is there an easy way to use unique()
or other R function to create a subset of unique combinations of two or more columns?
假设我有一个包含两列或更多列的数据框 df,是否有一种简单的方法可以使用unique()
或其他 R 函数来创建两列或更多列的唯一组合的子集?
I know I can use sqldf()
and write an easy "SELECT DISTINCT var1, var2, ... varN"
query, but I am looking for an R way of doing this.
我知道我可以使用sqldf()
和编写一个简单的"SELECT DISTINCT var1, var2, ... varN"
查询,但我正在寻找一种 R 方法来做到这一点。
It occurred to me to try ftablecoerced to a dataframeand use the field names, but I also get the cross tabulations of combinations that don't exist in the dataset:
我突然想到尝试将ftable强制转换为数据框并使用字段名称,但我也得到了数据集中不存在的组合的交叉表:
uniques <- as.data.frame(ftable(df$var1, df$var2))
回答by Marek
unique
works on data.frame
so unique(df[c("var1","var2")])
should be what you want.
unique
工作,data.frame
所以unique(df[c("var1","var2")])
应该是你想要的。
Another option is distinct
from dplyr
package:
另一种选择distinct
来自dplyr
包:
df %>% distinct(var1, var2) # or distinct(df, var1, var2)
Note:
笔记:
For older versions of dplyr (< 0.5.0, 2016-06-24) distinct
required additional step
对于旧版本的 dplyr ( < 0.5.0, 2016-06-24)distinct
需要额外的步骤
df %>% select(var1, var2) %>% distinct
(or oldish way distinct(select(df, var1, var2))
).
(或古老的方式distinct(select(df, var1, var2))
)。
回答by Tjebo
@Marek's answer is obviously correct, but may be outdated. The current dplyr
version (0.7.4) allows for an even simpler code:
@Marek 的答案显然是正确的,但可能已经过时。当前dplyr
版本 (0.7.4) 允许使用更简单的代码:
Simply use:
只需使用:
df %>% distinct(var1, var2)
If you want to keep all columns, add
如果要保留所有列,请添加
df %>% distinct(var1, var2, .keep_all = TRUE)
回答by sbaniwal
To KEEP all other variables in df use this:
要保留 df 中的所有其他变量,请使用以下命令:
unique_rows <- !duplicated(df[c("var1","var2")])
unique.df <- df[unique_rows,]
Another less recommended method is using row.names() #(see David's comment below):
另一种不太推荐的方法是使用 row.names() #(见下面大卫的评论):
unique_rows <- row.names(unique(df[c("var1","var2")]))
unique.df <- df[unique_rows,]
回答by Zaki
In addition to answers above, the data.table version:
除了上面的答案,data.table 版本:
setDT(df)
unique_dt = unique(df, by = c('var1', 'var2'))