string R中的部分动物字符串匹配
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/22949300/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Partial animal string matching in R
提问by testname123
I have a dataframe,
我有一个数据框,
d<-data.frame(name=c("brown cat", "blue cat", "big lion", "tall tiger",
"black panther", "short cat", "red bird",
"short bird stuffed", "big eagle", "bad sparrow",
"dog fish", "head dog", "brown yorkie",
"lab short bulldog"), label=1:14)
I'd like to search the name
column and if the words
"cat", "lion", "tiger", and "panther" appear, I want to assign the character string feline
to a new column and corresponding row species
.
我想搜索该name
列,如果出现“cat”、“lion”、“tiger”和“panther”等词,我想将字符串分配feline
给一个新列和相应的行species
。
If the words "bird", "eagle", and "sparrow"
appear, I want to assign the character string avian
to a new column and corresponding row species
.
如果"bird", "eagle", and "sparrow"
出现单词,我想将字符串分配给avian
新的列和相应的行species
。
If the words "dog", "yorkie", and "bulldog" appear, I want to assign the character string canine
to a new column and corresponding row species
.
如果出现单词“dog”、“yorkie”和“bullldog”,我想将字符串分配给canine
新的列和相应的行species
。
Ideally, I'd store this in a list or something similar that I can keep at the beginning of the script, because as new variants of the species show up in the name category, it would be nice to have easy access to update what qualifies as a feline
, avian
, and canine
.
理想情况下,我会将它存储在一个列表或类似的东西中,我可以保留在脚本的开头,因为随着物种的新变种出现在名称类别中,很高兴能够轻松访问更新符合条件的内容作为feline
, avian
, 和canine
。
This question is almost answered here (How to create new column in dataframe based on partial string matching other column in R), but it doesn't address the multiple name twist that is present in this problem.
这个问题在这里几乎得到了回答(How to create new column in dataframe based on partial string matching other column in R),但它没有解决这个问题中存在的多名称扭曲。
回答by ping
There may be a more elegant solution than this, but you could use grep
with |
to specify alternative matches.
可能有比这更优雅的解决方案,但您可以使用grep
with|
来指定替代匹配。
d[grep("cat|lion|tiger|panther", d$name), "species"] <- "feline"
d[grep("bird|eagle|sparrow", d$name), "species"] <- "avian"
d[grep("dog|yorkie", d$name), "species"] <- "canine"
I've assumed you meant "avian", and left out "bulldog" since it contains "dog".
我假设您的意思是“avian”,而忽略了“bulldog”,因为它包含“dog”。
You might want to add ignore.case = TRUE
to the grep.
您可能想要添加ignore.case = TRUE
到 grep。
output:
输出:
# name label species
#1 brown cat 1 feline
#2 blue cat 2 feline
#3 big lion 3 feline
#4 tall tiger 4 feline
#5 black panther 5 feline
#6 short cat 6 feline
#7 red bird 7 avian
#8 short bird stuffed 8 avian
#9 big eagle 9 avian
#10 bad sparrow 10 avian
#11 dog fish 11 canine
#12 head dog 12 canine
#13 brown yorkie 13 canine
#14 lab short bulldog 14 canine
回答by testname123
An elegant-ish way of doing this (I say elegant-ish because, while it's the most elegant way I know of, it's not great) is something like:
这样做的优雅方式(我说优雅是因为,虽然这是我所知道的最优雅的方式,但它并不好)是这样的:
#Define the regexes at the beginning of the code
regexes <- list(c("(cat|lion|tiger|panther)","feline"),
c("(bird|eagle|sparrow)","avian"),
c("(dog|yorkie|bulldog)","canine"))
....
#Create a vector, the same length as the df
output_vector <- character(nrow(d))
#For each regex..
for(i in seq_along(regexes)){
#Grep through d$name, and when you find matches, insert the relevant 'tag' into
#The output vector
output_vector[grepl(x = d$name, pattern = regexes[[i]][1])] <- regexes[[i]][2]
}
#Insert that now-filled output vector into the dataframe
d$species <- output_vector
The advantage of this method are several-fold
这种方法的优点是多方面的
- You only have to modify the data frame once in the entire process, which increases the speed of the loop (data frames do not have modification-in-place; to modify a data frame 3 times, you're essentially relabelling and recreating it 3 times).
- By specifying the length of the vector in advance, since we know what it's going to be, you increase speed even more by ensuring that the output vector never needs more memory allotted after it is created.
- Because it's a loop, rather than repeated, manual calls, the addition of more rows and categories to the 'regexes' object will not require further modification of the code. It'll run just as it does now.
- 在整个过程中你只需要修改一次数据框,这会提高循环的速度(数据框没有就地修改;修改一个数据框 3 次,你本质上是重新标记和重新创建它 3次)。
- 通过预先指定向量的长度,因为我们知道它将是什么,通过确保输出向量在创建后永远不需要分配更多内存,您可以进一步提高速度。
- 因为它是一个循环,而不是重复的手动调用,所以向“正则表达式”对象添加更多行和类别将不需要进一步修改代码。它会像现在一样运行。
The only disadvantage - and this applies to, I think, most solutions you're likely to get, is that if something matches multiple patterns, the last pattern in the list it matches will be its 'species' tag.
唯一的缺点——我认为,这适用于你可能得到的大多数解决方案,是如果某个东西匹配多个模式,它匹配的列表中的最后一个模式将是它的“物种”标签。