string 如何将字符串拆分为给定长度的子字符串?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/11619616/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 01:32:52  来源:igfitidea点击:

How to split a string into substrings of a given length?

stringrsplit

提问by MadSeb

I have a string such as:

我有一个字符串,例如:

"aabbccccdd"

"aabbccccdd"

I want to break this string into a vector of substrings of length 2 :

我想将此字符串分解为长度为 2 的子字符串的向量:

"aa" "bb" "cc" "cc" "dd"

"aa" "bb" "cc" "cc" "dd"

回答by GSee

Here is one way

这是一种方法

substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2))
#[1] "aa" "bb" "cc" "cc" "dd"

or more generally

或更一般地

text <- "aabbccccdd"
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
#[1] "aa" "bb" "cc" "cc" "dd"

Edit: This is much, much faster

编辑:这要快得多

sst <- strsplit(text, "")[[1]]
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

It first splits the string into characters. Then, it pastes together the even elements and the odd elements.

它首先将字符串拆分为字符。然后,它将偶数元素和奇数元素粘贴在一起。

Timings

时间安排

text <- paste(rep(paste0(letters, letters), 1000), collapse="")
g1 <- function(text) {
    substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2))
}
g2 <- function(text) {
    sst <- strsplit(text, "")[[1]]
    paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}
identical(g1(text), g2(text))
#[1] TRUE
library(rbenchmark)
benchmark(g1=g1(text), g2=g2(text))
#  test replications elapsed relative user.self sys.self user.child sys.child
#1   g1          100  95.451 79.87531    95.438        0          0         0
#2   g2          100   1.195  1.00000     1.196        0          0         0

回答by Sven Hohenstein

There are two easy possibilities:

有两种简单的可能性:

s <- "aabbccccdd"
  1. gregexprand regmatches:

    regmatches(s, gregexpr(".{2}", s))[[1]]
    # [1] "aa" "bb" "cc" "cc" "dd"
    
  2. strsplit:

    strsplit(s, "(?<=.{2})", perl = TRUE)[[1]]
    # [1] "aa" "bb" "cc" "cc" "dd"
    
  1. gregexprregmatches

    regmatches(s, gregexpr(".{2}", s))[[1]]
    # [1] "aa" "bb" "cc" "cc" "dd"
    
  2. strsplit

    strsplit(s, "(?<=.{2})", perl = TRUE)[[1]]
    # [1] "aa" "bb" "cc" "cc" "dd"
    

回答by mindless.panda

string <- "aabbccccdd"
# total length of string
num.chars <- nchar(string)

# the indices where each substr will start
starts <- seq(1,num.chars, by=2)

# chop it up
sapply(starts, function(ii) {
  substr(string, ii, ii+1)
})

Which gives

这使

[1] "aa" "bb" "cc" "cc" "dd"

回答by Matthew Lundberg

One can use a matrix to group the characters:

可以使用矩阵对字符进行分组:

s2 <- function(x) {
  m <- matrix(strsplit(x, '')[[1]], nrow=2)
  apply(m, 2, paste, collapse='')
}

s2('aabbccddeeff')
## [1] "aa" "bb" "cc" "dd" "ee" "ff"

Unfortunately, this breaks for an input of odd string length, giving a warning:

不幸的是,对于奇数字符串长度的输入,这会中断,并发出警告:

s2('abc')
## [1] "ab" "ca"
## Warning message:
## In matrix(strsplit(x, "")[[1]], nrow = 2) :
##   data length [3] is not a sub-multiple or multiple of the number of rows [2]

More unfortunate is that g1and g2from @GSee silently return incorrect results for an input of odd string length:

更不幸的是,g1g2从@GSee不返回不正确的结果为奇数串长度的输入端:

g1('abc')
## [1] "ab"

g2('abc')
## [1] "ab" "cb"

Here is function in the spirit of s2, taking a parameter for the number of characters in each group, and leaves the last entry short if necessary:

这是本着 s2 精神的函数,采用每个组中字符数的参数,并在必要时保留最后一个条目:

s <- function(x, n) {
  sst <- strsplit(x, '')[[1]]
  m <- matrix('', nrow=n, ncol=(length(sst)+n-1)%/%n)
  m[seq_along(sst)] <- sst
  apply(m, 2, paste, collapse='')
}

s('hello world', 2)
## [1] "he" "ll" "o " "wo" "rl" "d" 
s('hello world', 3)
## [1] "hel" "lo " "wor" "ld" 

(It is indeed slower than g2, but faster than g1by about a factor of 7)

(它确实比 慢g2,但比 快g1约 7 倍)

回答by den2042

Ugly but works

丑但有效

sequenceString <- "ATGAATAAAG"

J=3#maximum sequence length in file
sequenceSmallVecStart <-
  substring(sequenceString, seq(1, nchar(sequenceString)-J+1, J), 
    seq(J,nchar(sequenceString), J))
sequenceSmallVecEnd <-
    substring(sequenceString, max(seq(J, nchar(sequenceString), J))+1)
sequenceSmallVec <-
    c(sequenceSmallVecStart,sequenceSmallVecEnd)
cat(sequenceSmallVec,sep = "\n")

Gives ATG AAT AAA G

给予 ATG AAT AAA G