string 在 R 中提取混合数字和字符的字符串的数字部分
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/15451251/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Extract numeric part of strings of mixed numbers and characters in R
提问by user288609
I have a lot of strings, and each of which tends to have the following format: Ab_Cd-001234.txt
I want to replace it with 001234
. How can I achieve it in R?
我有很多字符串,每个字符串都有以下格式:Ab_Cd-001234.txt
我想用001234
. 我怎样才能在 R 中实现它?
采纳答案by agstudy
Using gsub
or sub
you can do this :
使用gsub
或者sub
你可以这样做:
gsub('.*-([0-9]+).*','\1','Ab_Cd-001234.txt')
"001234"
you can use regexpr
with regmatches
你可以用regexpr
与regmatches
m <- gregexpr('[0-9]+','Ab_Cd-001234.txt')
regmatches('Ab_Cd-001234.txt',m)
"001234"
EDITthe 2 methods are vectorized and works for a vector of strings.
编辑这 2 种方法是矢量化的,适用于字符串向量。
x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
sub('.*-([0-9]+).*','\1',x)
"001234" "001234"
m <- gregexpr('[0-9]+',x)
> regmatches(x,m)
[[1]]
[1] "001234"
[[2]]
[1] "001234"
回答by Ben
The stringrpackage has lots of handy shortcuts for this kind of work:
该stringr包有很多这种工作的方便快捷方式:
# input data following @agstudy
data <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
# load library
library(stringr)
# prepare regular expression
regexp <- "[[:digit:]]+"
# process string
str_extract(data, regexp)
Which gives the desired result:
[1] "001234" "001234"
To explain the regexp a little:
稍微解释一下正则表达式:
[[:digit:]]
is any number 0 to 9
[[:digit:]]
是 0 到 9 之间的任意数字
+
means the preceding item (in this case, a digit) will be matched one or more times
+
表示前一项(在本例中为数字)将匹配一次或多次
This page is also very useful for this kind of string processing: http://en.wikibooks.org/wiki/R_Programming/Text_Processing
此页面对于此类字符串处理也非常有用:http: //en.wikibooks.org/wiki/R_Programming/Text_Processing
回答by Tyler Rinker
You could use genXtract
from the qdap package. This takes a left character string and a right character string and extracts the elements between.
您可以genXtract
从 qdap 包中使用。这需要一个左字符串和一个右字符串并提取它们之间的元素。
library(qdap)
genXtract("Ab_Cd-001234.txt", "-", ".txt")
Though I much prefer agstudy's answer.
虽然我更喜欢agstudy的答案。
EDITExtending answer to match agstudy's:
编辑扩展答案以匹配 agstudy 的:
x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
genXtract(x, "-", ".txt")
# $`- : .txt1`
# [1] "001234"
#
# $`- : .txt2`
# [1] "001234"
回答by G. Grothendieck
gsubRemove prefix and suffix:
gsub删除前缀和后缀:
gsub(".*-|\.txt$", "", x)
tools packageUse file_path_sans_ext
from tools to remove extension and then use sub
to remove prefix:
工具包使用file_path_sans_ext
from tools 删除扩展名,然后使用sub
删除前缀:
library(tools)
sub(".*-", "", file_path_sans_ext(x))
strapplycExtract the digits after - and before dot. See gsubfn home pagefor more info:
Strapplyc提取点之后和之前的数字。有关更多信息,请参阅gsubfn 主页:
library(gsubfn)
strapplyc(x, "-(\d+)\.", simplify = TRUE)
Note that if it were desired to return a numeric we could use strapply
rather than strapplyc
like this:
请注意,如果需要返回一个数字,我们可以使用strapply
而不是strapplyc
这样:
strapply(x, "-(\d+)\.", as.numeric, simplify = TRUE)