string 匹配 R 中的两列
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36345915/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Matching two Columns in R
提问by Saul Garcia
I have a big dataset df
(354903 rows) with two columns named df$ColumnName
and df$ColumnName.1
我有一个大数据集df
(354903 行),其中有两列名为df$ColumnName
和df$ColumnName.1
head(df)
CompleteName CompleteName.1
1 Lefebvre Arnaud Lefebvre Schuhl Anne
1.1 Lefebvre Arnaud Abe Lyu
1.2 Lefebvre Arnaud Abe Lyu
1.3 Lefebvre Arnaud Louvet Nicolas
1.4 Lefebvre Arnaud Muller Jean Michel
1.5 Lefebvre Arnaud De Dinechin Florent
I am trying to create labels to see weather the name is the same or not. When I try a small subset it works [1 if they are the same, 0 if not]:
我正在尝试创建标签以查看名称是否相同的天气。当我尝试一个小子集时,它可以工作 [1 如果它们相同,则 0 如果不同]:
> match(df$CompleteName[1], df$CompleteName.1[1], nomatch = 0)
[1] 0
> match(df$CompleteName[1:10], df$CompleteName.1[1:10], nomatch = 0)
[1] 0 0 0 0 0 0 0 0 0 0
But as soon as I throw the complete columns, it gives me complete different values, which seem nonsense to me:
但是一旦我抛出完整的列,它就会给我完全不同的值,这对我来说似乎是无稽之谈:
> match(df$CompleteName, df$CompleteName.1, nomatch = 0)
[1] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
[23] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
[45] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
Should I use sapply
? I did not figured it out, I tried this with an error:
我应该使用sapply
吗?我没有弄清楚,我尝试了这个错误:
sapply(df, function(x) match(x$CompleteName, x$CompleteName.1, nomatch = 0))
Please help!!!
请帮忙!!!
回答by andrechalom
From the man page of match,
从匹配的手册页,
‘match' returns a vector of the positions of (first) matches of its first argument in its second.
'match' 在其第二个参数中返回其第一个参数的(第一个)匹配位置的向量。
So your data seem to indicate that the first match of "Lefebvre Arnaud" (the first position in the first argument) is in the row 101. I believe what you intended to do is a simple comparison, so that's just the equality operator ==
.
因此,您的数据似乎表明“Lefebvre Arnaud”的第一个匹配项(第一个参数中的第一个位置)在第 101 行。我相信您打算做的是一个简单的比较,所以这只是相等运算符==
。
Some sample data:
一些示例数据:
> a <- rep ("Lefebvre Arnaud", 6)
> b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
> x <- data.frame(a,b, stringsAsFactors=F)
> x
a b
1 Lefebvre Arnaud Abe Lyu
2 Lefebvre Arnaud Abe Lyu
3 Lefebvre Arnaud Lefebvre Arnaud
4 Lefebvre Arnaud De Dinechin Florent
5 Lefebvre Arnaud De Dinechin Florent
6 Lefebvre Arnaud De Dinechin Florent
> x$a == x$b
[1] FALSE FALSE TRUE FALSE FALSE FALSE
EDIT:Also, you need to make sure that you are comparing apples to apples, so double check the data type of your columns. Use str(df)
to see whether the columns are strings or factors. You can either construct the matrix with "stringsAsFactors = FALSE", or convert from factor to character. There are several ways to do that, check here: Convert data.frame columns from factors to characters
编辑:此外,您需要确保将苹果与苹果进行比较,因此请仔细检查列的数据类型。使用str(df)
看看列是字符串或因素。您可以使用“stringsAsFactors = FALSE”构造矩阵,也可以将因子转换为字符。有几种方法可以做到这一点,请在此处查看:将 data.frame 列从因子转换为字符
回答by jaimedash
As others have pointed out, match
isn't right here. What you want is equality, which you can get by testing with ==
, which gives you TRUE/FALSE
. Then using as.numeric
will give you desired 1/0
or using which
will give you the indices.
正如其他人指出的那样,match
不在这里。你想要的是平等,你可以通过测试得到==
它,它给你TRUE/FALSE
. 然后使用as.numeric
会给你想要的1/0
或使用which
会给你索引。
Butyou may still have an issue with factors!
但是你可能仍然有一个因素的问题!
# making up some similar data( adapted from earlier answer)
a <- rep ("Lefebvre Arnaud", 6)
b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
df <- data.frame(CompleteName = a, CompleteName.1 = b)
which(df$CompleteName == df$CompleteName1)
#integer(0)
#Warning message:
#In is.na(e2) : is.na() applied to non-(list or vector) of type 'NULL'
str(df)
# 'data.frame': 6 obs. of 2 variables:
# $ CompleteName : Factor w/ 1 level "Lefebvre Arnaud": 1 1 1 1 1 1
# $ CompleteName.1: Factor w/ 3 levels "Abe Lyu","De Dinechin Florent",..: 1 1 3 2 2 2
stringsAsFactors
stringsAsFactors
Above, the data.frame wasn't constructed with stringsAsFactors=FALSE
and caused an error. Unfortunately, out of the box R
will coerce strings to factors on loading a csv
or creating a data.frame
. This can be fixed when creating the data.frame by explicitly specifying stringsAsFactors=FALSE
上面,data.frame 没有构建stringsAsFactors=FALSE
并导致错误。不幸的是,开箱即用R
会将字符串强制转换为加载csv
或创建data.frame
. 这可以在创建 data.frame 时通过显式指定来修复stringsAsFactors=FALSE
df <- data.frame(CompleteName = a, CompleteName.1 = b, stringsAsFactors = FALSE)
df[which(df$CompleteName == df$CompleteName.1), ]
## CompleteName CompleteName.1
## 3 Lefebvre Arnaud Lefebvre Arnaud
To avoid the issue in the future, run options(stringsAsFactors = FALSE)
at the beginning of your R session (or put it at the top of your .R
script). More discussion here:
为避免将来出现此问题,请options(stringsAsFactors = FALSE)
在 R 会话的开头运行(或将其放在.R
脚本的顶部)。更多讨论在这里:
回答by Matt Weller
Here's a solution using a data.table
with performance comparison to the data.frame
solution based on an identical number of records as in your case.
这是一个使用data.table
与data.frame
基于与您的情况相同数量的记录的解决方案进行性能比较的解决方案。
col1 = sample(x = letters, size = 354903, replace = TRUE)
col2 = sample(x = letters, size = 354903, replace = TRUE)
library(data.table)
dt = data.table(col1 = col1, col2 = col2)
df = data.frame(col1 = col1, col2 = col2)
# comparing the 2 columns
system.time(dt$col1==dt$col2)
system.time(df$col1==df$col2)
# storing the comparison in the table/frame itself
system.time(dt[, col3:= (col1==col2)])
system.time({df$col3 = (df$col1 == df$col2)})
The data.table
approach offers a significant speedup on my machine: from 0.020s to 0.008s.
该data.table
方法在我的机器上提供了显着的加速:从 0.020s 到 0.008s。
Try it for yourself and see. I know this is not really significant with such a small number of rows but multiply that 1000 and you'll see a major difference!
自己试试看。我知道这对于这么少的行并不是很重要,但是乘以 1000,您会看到很大的不同!