string 匹配 R 中的两列

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/36345915/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-08 16:31:26  来源:igfitidea点击:

Matching two Columns in R

rstringmatchsapply

提问by Saul Garcia

I have a big dataset df(354903 rows) with two columns named df$ColumnNameand df$ColumnName.1

我有一个大数据集df(354903 行),其中有两列名为df$ColumnNamedf$ColumnName.1

head(df)
       CompleteName       CompleteName.1
1   Lefebvre Arnaud Lefebvre Schuhl Anne
1.1 Lefebvre Arnaud              Abe Lyu
1.2 Lefebvre Arnaud              Abe Lyu
1.3 Lefebvre Arnaud       Louvet Nicolas
1.4 Lefebvre Arnaud   Muller Jean Michel
1.5 Lefebvre Arnaud  De Dinechin Florent

I am trying to create labels to see weather the name is the same or not. When I try a small subset it works [1 if they are the same, 0 if not]:

我正在尝试创建标签以查看名称是否相同的天气。当我尝试一个小子集时,它可以工作 [1 如果它们相同,则 0 如果不同]:

> match(df$CompleteName[1], df$CompleteName.1[1], nomatch = 0)
[1] 0
> match(df$CompleteName[1:10], df$CompleteName.1[1:10], nomatch = 0)
[1] 0 0 0 0 0 0 0 0 0 0

But as soon as I throw the complete columns, it gives me complete different values, which seem nonsense to me:

但是一旦我抛出完整的列,它就会给我完全不同的值,这对我来说似乎是无稽之谈:

> match(df$CompleteName, df$CompleteName.1, nomatch = 0)
[1] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
[23] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101
[45] 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101

Should I use sapply? I did not figured it out, I tried this with an error:

我应该使用sapply吗?我没有弄清楚,我尝试了这个错误:

 sapply(df, function(x) match(x$CompleteName, x$CompleteName.1, nomatch = 0))

Please help!!!

请帮忙!!!

回答by andrechalom

From the man page of match,

从匹配的手册页,

‘match' returns a vector of the positions of (first) matches of its first argument in its second.

'match' 在其第二个参数中返回其第一个参数的(第一个)匹配位置的向量。

So your data seem to indicate that the first match of "Lefebvre Arnaud" (the first position in the first argument) is in the row 101. I believe what you intended to do is a simple comparison, so that's just the equality operator ==.

因此,您的数据似乎表明“Lefebvre Arnaud”的第一个匹配项(第一个参数中的第一个位置)在第 101 行。我相信您打算做的是一个简单的比较,所以这只是相等运算符==

Some sample data:

一些示例数据:

> a <- rep ("Lefebvre Arnaud", 6)
> b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
> x <- data.frame(a,b, stringsAsFactors=F)
> x
            a                   b
1 Lefebvre Arnaud             Abe Lyu
2 Lefebvre Arnaud             Abe Lyu
3 Lefebvre Arnaud     Lefebvre Arnaud
4 Lefebvre Arnaud De Dinechin Florent
5 Lefebvre Arnaud De Dinechin Florent
6 Lefebvre Arnaud De Dinechin Florent
> x$a == x$b
[1] FALSE FALSE  TRUE FALSE FALSE FALSE

EDIT:Also, you need to make sure that you are comparing apples to apples, so double check the data type of your columns. Use str(df)to see whether the columns are strings or factors. You can either construct the matrix with "stringsAsFactors = FALSE", or convert from factor to character. There are several ways to do that, check here: Convert data.frame columns from factors to characters

编辑:此外,您需要确保将苹果与苹果进行比较,因此请仔细检查列的数据类型。使用str(df)看看列是字符串或因素。您可以使用“stringsAsFactors = FALSE”构造矩阵,也可以将因子转换为字符。有几种方法可以做到这一点,请在此处查看:将 data.frame 列从因子转换为字符

回答by jaimedash

As others have pointed out, matchisn't right here. What you want is equality, which you can get by testing with ==, which gives you TRUE/FALSE. Then using as.numericwill give you desired 1/0or using whichwill give you the indices.

正如其他人指出的那样,match不在这里。你想要的是平等,你可以通过测试得到==它,它给你TRUE/FALSE. 然后使用as.numeric会给你想要的1/0或使用which会给你索引。

Butyou may still have an issue with factors!

但是你可能仍然有一个因素的问题!

 # making up some similar data( adapted from earlier answer)
 a <- rep ("Lefebvre Arnaud", 6)
 b <- c("Abe Lyu", "Abe Lyu", "Lefebvre Arnaud", rep("De Dinechin Florent", 3))
 df <- data.frame(CompleteName = a, CompleteName.1 = b)
 which(df$CompleteName == df$CompleteName1)
 #integer(0)
 #Warning message:
 #In is.na(e2) : is.na() applied to non-(list or vector) of type 'NULL'

 str(df)
 # 'data.frame':    6 obs. of  2 variables:
 # $ CompleteName  : Factor w/ 1 level "Lefebvre Arnaud": 1 1 1 1 1 1
 # $ CompleteName.1: Factor w/ 3 levels "Abe Lyu","De Dinechin Florent",..: 1 1 3 2 2 2

stringsAsFactors

stringsAsFactors

Above, the data.frame wasn't constructed with stringsAsFactors=FALSEand caused an error. Unfortunately, out of the box Rwill coerce strings to factors on loading a csvor creating a data.frame. This can be fixed when creating the data.frame by explicitly specifying stringsAsFactors=FALSE

上面,data.frame 没有构建stringsAsFactors=FALSE并导致错误。不幸的是,开箱即用R会将字符串强制转换为加载csv或创建data.frame. 这可以在创建 data.frame 时通过显式指定来修复stringsAsFactors=FALSE

df <- data.frame(CompleteName = a, CompleteName.1 = b, stringsAsFactors = FALSE)
df[which(df$CompleteName == df$CompleteName.1), ]
##     CompleteName CompleteName.1
## 3 Lefebvre Arnaud Lefebvre Arnaud

To avoid the issue in the future, run options(stringsAsFactors = FALSE)at the beginning of your R session (or put it at the top of your .Rscript). More discussion here:

为避免将来出现此问题,请options(stringsAsFactors = FALSE)在 R 会话的开头运行(或将其放在.R脚本的顶部)。更多讨论在这里:

回答by Matt Weller

Here's a solution using a data.tablewith performance comparison to the data.framesolution based on an identical number of records as in your case.

这是一个使用data.tabledata.frame基于与您的情况相同数量的记录的解决方案进行性能比较的解决方案。

col1 = sample(x = letters, size = 354903, replace = TRUE)
col2 = sample(x = letters, size = 354903, replace = TRUE)

library(data.table)
dt = data.table(col1 = col1, col2 = col2)
df = data.frame(col1 = col1, col2 = col2)

# comparing the 2 columns
system.time(dt$col1==dt$col2)
system.time(df$col1==df$col2)

# storing the comparison in the table/frame itself
system.time(dt[, col3:= (col1==col2)])
system.time({df$col3 = (df$col1 == df$col2)})

The data.tableapproach offers a significant speedup on my machine: from 0.020s to 0.008s.

data.table方法在我的机器上提供了显着的加速:从 0.020s 到 0.008s。

Try it for yourself and see. I know this is not really significant with such a small number of rows but multiply that 1000 and you'll see a major difference!

自己试试看。我知道这对于这么少的行并不是很重要,但是乘以 1000,您会看到很大的不同!