string R中的部分动物字符串匹配

Question

提问by testname123

I have a dataframe,

我有一个数据框，

d<-data.frame(name=c("brown cat", "blue cat", "big lion", "tall tiger",
                     "black panther", "short cat", "red bird",
                     "short bird stuffed", "big eagle", "bad sparrow",
                     "dog fish", "head dog", "brown yorkie",
                     "lab short bulldog"), label=1:14)

I'd like to search the namecolumn and if the words "cat", "lion", "tiger", and "panther" appear, I want to assign the character string felineto a new column and corresponding row species.

我想搜索该name列，如果出现“cat”、“lion”、“tiger”和“panther”等词，我想将字符串分配feline给一个新列和相应的行species。

If the words "bird", "eagle", and "sparrow"appear, I want to assign the character string avianto a new column and corresponding row species.

如果"bird", "eagle", and "sparrow"出现单词，我想将字符串分配给avian新的列和相应的行species。

If the words "dog", "yorkie", and "bulldog" appear, I want to assign the character string canineto a new column and corresponding row species.

如果出现单词“dog”、“yorkie”和“bullldog”，我想将字符串分配给canine新的列和相应的行species。

Ideally, I'd store this in a list or something similar that I can keep at the beginning of the script, because as new variants of the species show up in the name category, it would be nice to have easy access to update what qualifies as a feline, avian, and canine.

理想情况下，我会将它存储在一个列表或类似的东西中，我可以保留在脚本的开头，因为随着物种的新变种出现在名称类别中，很高兴能够轻松访问更新符合条件的内容作为feline, avian, 和canine。

This question is almost answered here (How to create new column in dataframe based on partial string matching other column in R), but it doesn't address the multiple name twist that is present in this problem.

这个问题在这里几乎得到了回答（How to create new column in dataframe based on partial string matching other column in R），但它没有解决这个问题中存在的多名称扭曲。

Answer 1

回答by ping

There may be a more elegant solution than this, but you could use grepwith |to specify alternative matches.

可能有比这更优雅的解决方案，但您可以使用grepwith|来指定替代匹配。

d[grep("cat|lion|tiger|panther", d$name), "species"] <- "feline"
d[grep("bird|eagle|sparrow", d$name), "species"] <- "avian"
d[grep("dog|yorkie", d$name), "species"] <- "canine"

I've assumed you meant "avian", and left out "bulldog" since it contains "dog".

我假设您的意思是“avian”，而忽略了“bulldog”，因为它包含“dog”。

You might want to add ignore.case = TRUEto the grep.

您可能想要添加ignore.case = TRUE到 grep。

output:

输出：

#                 name label species
#1           brown cat     1  feline
#2            blue cat     2  feline
#3            big lion     3  feline
#4          tall tiger     4  feline
#5       black panther     5  feline
#6           short cat     6  feline
#7            red bird     7   avian
#8  short bird stuffed     8   avian
#9           big eagle     9   avian
#10        bad sparrow    10   avian
#11           dog fish    11  canine
#12           head dog    12  canine
#13       brown yorkie    13  canine
#14  lab short bulldog    14  canine

Answer 2

回答by testname123

An elegant-ish way of doing this (I say elegant-ish because, while it's the most elegant way I know of, it's not great) is something like:

这样做的优雅方式（我说优雅是因为，虽然这是我所知道的最优雅的方式，但它并不好）是这样的：

#Define the regexes at the beginning of the code
regexes <- list(c("(cat|lion|tiger|panther)","feline"),
                c("(bird|eagle|sparrow)","avian"),
                c("(dog|yorkie|bulldog)","canine"))

....


#Create a vector, the same length as the df
output_vector <- character(nrow(d))

#For each regex..
for(i in seq_along(regexes)){

    #Grep through d$name, and when you find matches, insert the relevant 'tag' into
    #The output vector
    output_vector[grepl(x = d$name, pattern = regexes[[i]][1])] <- regexes[[i]][2]

} 

#Insert that now-filled output vector into the dataframe
d$species <- output_vector

The advantage of this method are several-fold

这种方法的优点是多方面的

You only have to modify the data frame once in the entire process, which increases the speed of the loop (data frames do not have modification-in-place; to modify a data frame 3 times, you're essentially relabelling and recreating it 3 times).
By specifying the length of the vector in advance, since we know what it's going to be, you increase speed even more by ensuring that the output vector never needs more memory allotted after it is created.
Because it's a loop, rather than repeated, manual calls, the addition of more rows and categories to the 'regexes' object will not require further modification of the code. It'll run just as it does now.

在整个过程中你只需要修改一次数据框，这会提高循环的速度（数据框没有就地修改；修改一个数据框 3 次，你本质上是重新标记和重新创建它 3次）。
通过预先指定向量的长度，因为我们知道它将是什么，通过确保输出向量在创建后永远不需要分配更多内存，您可以进一步提高速度。
因为它是一个循环，而不是重复的手动调用，所以向“正则表达式”对象添加更多行和类别将不需要进一步修改代码。它会像现在一样运行。

The only disadvantage - and this applies to, I think, most solutions you're likely to get, is that if something matches multiple patterns, the last pattern in the list it matches will be its 'species' tag.

唯一的缺点——我认为，这适用于你可能得到的大多数解决方案，是如果某个东西匹配多个模式，它匹配的列表中的最后一个模式将是它的“物种”标签。

string R中的部分动物字符串匹配

提问by testname123

回答by ping

回答by testname123

相关推荐

最近更新

标签

string R中的部分动物字符串匹配

提问by testname123

回答by ping

回答by testname123

相关推荐

string 在arduino草图中将double类型转换为字符串类型

string 从数组中删除空字符串同时保持无循环记录？

string 仅当字符串不为 null 或为空时才使用分隔符连接字符串

在 Lua 中使用 string.gmatch() 拆分字符串

相关推荐

最近更新

标签