string 选择列具有像 'hsa..' 这样的字符串的行(部分字符串匹配)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13043928/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 01:39:53  来源:igfitidea点击:

Selecting rows where a column has a string like 'hsa..' (partial string match)

stringrmatch

提问by Asda

I have a 371MB text file containing micro RNA data. Essentially, I would like to only select those rows that have information about human microRNA.

我有一个包含 micro RNA 数据的 371MB 文本文件。本质上,我只想选择那些包含人类 microRNA 信息的行。

I have read in the file using a read.table. Usually, I'd accomplish what I'd want with sqldf - if it had a 'like' syntax (select * from <> where miRNA like 'hsa'). Unfortunately - sqldf does not support that syntax.

我已经使用 read.table 读入了文件。通常,我会用 sqldf 完成我想要的 - 如果它有一个“like”语法(select * from <> where miRNA like 'hsa')。不幸的是 - sqldf 不支持该语法。

How can I do this in R? I have looked around stackoverflow and do not see an example of how I can do a partial string match. I even installed the stringr package - but it does not quite have what I need.

我怎样才能在 R 中做到这一点?我环顾了 stackoverflow,没有看到如何进行部分字符串匹配的示例。我什至安装了 stringr 包 - 但它并没有我需要的东西。

What I would like to do, is something like this - where all rows where hsa-*are selected.

我想做的是这样的事情 -选择hsa- * 的所有行。

selectedRows <- conservedData[, conservedData$miRNA %like% "hsa-"]

which of course, is not correct syntax.

这当然不是正确的语法。

Can somebody please help me with this? Thanks a lot for reading.

有人可以帮我解决这个问题吗?非常感谢阅读。

Asda

阿斯达

回答by A5C1D2H2I1M1N2O1R2T1

I notice that you mention a function %like%in your current approach. I don't know if that's a reference to the %like%from "data.table", but if it is, you can definitely use it as follows.

我注意到您%like%在当前方法中提到了一个函数。我不知道这是否是对%like%来自“data.table”的引用,但如果是,您绝对可以按如下方式使用它。

Note that the object does not have to be a data.table(but also remember that subsetting approaches for data.frames and data.tables are not identical):

请注意,对象不必是 a data.table(但也要记住data.frames 和data.tables 的子集方法不相同):

library(data.table)
mtcars[rownames(mtcars) %like% "Merc", ]
iris[iris$Species %like% "osa", ]

If that is what you had, then perhaps you had just mixed up row and column positions for subsetting data.

如果这就是你所拥有的,那么也许你只是混淆了行和列位置来设置子集数据。



If you don't want to load a package, you can try using grep()to search for the string you're matching. Here's an example with the mtcarsdataset, where we are matching all rows where the row names includes "Merc":

如果不想加载包,可以尝试使用grep()搜索匹配的字符串。这是mtcars数据集的示例,我们匹配行名称包含“Merc”的所有行:

mtcars[grep("Merc", rownames(mtcars)), ]
             mpg cyl  disp  hp drat   wt qsec vs am gear carb
# Merc 240D   24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
# Merc 230    22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
# Merc 280    19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
# Merc 280C   17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
# Merc 450SE  16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3
# Merc 450SL  17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
# Merc 450SLC 15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3

And, another example, using the irisdataset searching for the string osa:

再举一个例子,使用iris数据集搜索字符串osa

irisSubset <- iris[grep("osa", iris$Species), ]
head(irisSubset)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

For your problem try:

对于您的问题,请尝试:

selectedRows <- conservedData[grep("hsa-", conservedData$miRNA), ]

回答by Sam Firke

Try str_detect()from the stringrpackage, which detects the presence or absence of a pattern in a string.

str_detect()stringr包中尝试,它检测字符串中模式的存在与否。

Here is an approach that also incorporates the %>%pipe and filter()from the dplyrpackage:

这是一种还包含%>%管道和filter()来自dplyr包的方法:

library(stringr)
library(dplyr)

CO2 %>%
  filter(str_detect(Treatment, "non"))

   Plant        Type  Treatment conc uptake
1    Qn1      Quebec nonchilled   95   16.0
2    Qn1      Quebec nonchilled  175   30.4
3    Qn1      Quebec nonchilled  250   34.8
4    Qn1      Quebec nonchilled  350   37.2
5    Qn1      Quebec nonchilled  500   35.3
...

This filters the sample CO2 data set (that comes with R) for rows where the Treatment variable contains the substring "non". You can adjust whether str_detectfinds fixed matches or uses a regex - see the documentation for the stringr package.

这将过滤示例 CO2 数据集(随 R 一起提供),用于处理变量包含子字符串“non”的行。您可以调整是str_detect查找固定匹配项还是使用正则表达式 - 请参阅 stringr 包的文档。

回答by user1609452

LIKEshould work in sqlite:

LIKE应该在sqlite中工作:

require(sqldf)
df <- data.frame(name = c('bob','robert','peter'),id=c(1,2,3))
sqldf("select * from df where name LIKE '%er%'")
    name id
1 robert  2
2  peter  3