string 将数据框字符串列拆分为多列

Question

提问by jkebinger

I'd like to take data of the form

我想获取表格的数据

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
  attr          type
1    1   foo_and_bar
2   30 foo_and_bar_2
3    4   foo_and_bar
4    6 foo_and_bar_2

and use split()on the column "type" from above to get something like this:

并split()在type上面的“ ”列上使用以获得如下内容：

  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

I came up with something unbelievably complex involving some form of applythat worked, but I've since misplaced that. It seemed far too complicated to be the best way. I can use strsplitas below, but then unclear how to get that back into 2 columns in the data frame.

我想出了一些令人难以置信的复杂的东西，其中涉及某种形式的apply工作，但后来我把它放错了地方。这似乎太复杂了，不是最好的方法。我可以使用strsplit如下，但不清楚如何将其恢复到数据框中的 2 列。

> strsplit(as.character(before$type),'_and_')
[[1]]
[1] "foo" "bar"

[[2]]
[1] "foo"   "bar_2"

[[3]]
[1] "foo" "bar"

[[4]]
[1] "foo"   "bar_2"

Thanks for any pointers. I've not quite groked R lists just yet.

感谢您的指点。我还没有完全了解 R 列表。

Answer 1

回答by hadley

Use stringr::str_split_fixed

用 stringr::str_split_fixed

library(stringr)
str_split_fixed(before$type, "_and_", 2)

Answer 2

回答by hadley

Another option is to use the new tidyr package.

另一种选择是使用新的 tidyr 包。

library(dplyr)
library(tidyr)

before <- data.frame(
  attr = c(1, 30 ,4 ,6 ), 
  type = c('foo_and_bar', 'foo_and_bar_2')
)

before %>%
  separate(type, c("foo", "bar"), "_and_")

##   attr foo   bar
## 1    1 foo   bar
## 2   30 foo bar_2
## 3    4 foo   bar
## 4    6 foo bar_2

Answer 3

回答by David Arenburg

5 years later adding the obligatory data.tablesolution

5 年后添加强制性data.table解决方案

library(data.table) ## v 1.9.6+ 
setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_")]
before
#    attr          type type1 type2
# 1:    1   foo_and_bar   foo   bar
# 2:   30 foo_and_bar_2   foo bar_2
# 3:    4   foo_and_bar   foo   bar
# 4:    6 foo_and_bar_2   foo bar_2

We could also both make sure that the resulting columns will have correct types andimprove performance by adding type.convertand fixedarguments (since "_and_"isn't really a regex)

我们还可以通过添加和参数来确保结果列具有正确的类型并提高性能（因为不是真正的正则表达式）type.convertfixed"_and_"

setDT(before)[, paste0("type", 1:2) := tstrsplit(type, "_and_", type.convert = TRUE, fixed = TRUE)]

Answer 4

回答by Aniko

Yet another approach: use rbindon out:

另一种方法：使用rbindon out：

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))  
out <- strsplit(as.character(before$type),'_and_') 
do.call(rbind, out)

     [,1]  [,2]   
[1,] "foo" "bar"  
[2,] "foo" "bar_2"
[3,] "foo" "bar"  
[4,] "foo" "bar_2"

And to combine:

并结合：

data.frame(before$attr, do.call(rbind, out))

Answer 5

回答by IRTFM

Notice that sapply with "[" can be used to extract either the first or second items in those lists so:

请注意，带有“[”的 sapply 可用于提取这些列表中的第一项或第二项，因此：

before$type_1 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 1)
before$type_2 <- sapply(strsplit(as.character(before$type),'_and_'), "[", 2)
before$type <- NULL

And here's a gsub method:

这是一个 gsub 方法：

before$type_1 <- gsub("_and_.+$", "", before$type)
before$type_2 <- gsub("^.+_and_", "", before$type)
before$type <- NULL

Answer 6

回答by Ramnath

here is a one liner along the same lines as aniko's solution, but using hadley's stringr package:

这是一个与 aniko 的解决方案相同的线，但使用了 hadley 的 stringr 包：

do.call(rbind, str_split(before$type, '_and_'))

Answer 7

回答by A5C1D2H2I1M1N2O1R2T1

To add to the options, you could also use my splitstackshape::cSplitfunction like this:

要添加到选项，您还可以splitstackshape::cSplit像这样使用我的函数：

library(splitstackshape)
cSplit(before, "type", "_and_")
#    attr type_1 type_2
# 1:    1    foo    bar
# 2:   30    foo  bar_2
# 3:    4    foo    bar
# 4:    6    foo  bar_2

Answer 8

回答by Gavin Simpson

An easy way is to use sapply()and the [function:

一个简单的方法是使用sapply()和[函数：

before <- data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2'))
out <- strsplit(as.character(before$type),'_and_')

For example:

例如：

> data.frame(t(sapply(out, `[`)))
   X1    X2
1 foo   bar
2 foo bar_2
3 foo   bar
4 foo bar_2

sapply()'s result is a matrix and needs transposing and casting back to a data frame. It is then some simple manipulations that yield the result you wanted:

sapply()的结果是一个矩阵，需要转置并转换回数据框。然后是一些简单的操作，产生你想要的结果：

after <- with(before, data.frame(attr = attr))
after <- cbind(after, data.frame(t(sapply(out, `[`))))
names(after)[2:3] <- paste("type", 1:2, sep = "_")

At this point, afteris what you wanted

此时，after就是你想要的

> after
  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

Answer 9

回答by Yannis P.

The subject is almostexhausted, I 'd like though to offer a solution to a slightly more general version where you don't know the number of output columns, a priori. So for example you have

该主题几乎用尽，但我想为稍微更通用的版本提供解决方案，在该版本中您事先不知道输出列的数量。所以例如你有

before = data.frame(attr = c(1,30,4,6), type=c('foo_and_bar','foo_and_bar_2', 'foo_and_bar_2_and_bar_3', 'foo_and_bar'))
  attr                    type
1    1             foo_and_bar
2   30           foo_and_bar_2
3    4 foo_and_bar_2_and_bar_3
4    6             foo_and_bar

We can't use dplyr separate()because we don't know the number of the result columns before the split, so I have then created a function that uses stringrto split a column, given the pattern and a name prefix for the generated columns. I hope the coding patterns used, are correct.

我们不能使用 dplyr，separate()因为我们不知道拆分前结果列的数量，因此我创建了一个stringr用于拆分列的函数，给定模式和生成列的名称前缀。我希望使用的编码模式是正确的。

split_into_multiple <- function(column, pattern = ", ", into_prefix){
  cols <- str_split_fixed(column, pattern, n = Inf)
  # Sub out the ""'s returned by filling the matrix to the right, with NAs which are useful
  cols[which(cols == "")] <- NA
  cols <- as.tibble(cols)
  # name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m' 
  # where m = # columns of 'cols'
  m <- dim(cols)[2]

  names(cols) <- paste(into_prefix, 1:m, sep = "_")
  return(cols)
}

We can then use split_into_multiplein a dplyr pipe as follows:

然后我们可以split_into_multiple在 dplyr 管道中使用，如下所示：

after <- before %>% 
  bind_cols(split_into_multiple(.$type, "_and_", "type")) %>% 
  # selecting those that start with 'type_' will remove the original 'type' column
  select(attr, starts_with("type_"))

>after
  attr type_1 type_2 type_3
1    1    foo    bar   <NA>
2   30    foo  bar_2   <NA>
3    4    foo  bar_2  bar_3
4    6    foo    bar   <NA>

And then we can use gatherto tidy up...

然后我们可以gather用来整理...

after %>% 
  gather(key, val, -attr, na.rm = T)

   attr    key   val
1     1 type_1   foo
2    30 type_1   foo
3     4 type_1   foo
4     6 type_1   foo
5     1 type_2   bar
6    30 type_2 bar_2
7     4 type_2 bar_2
8     6 type_2   bar
11    4 type_3 bar_3

Answer 10

回答by lmo

Here is a base R one liner that overlaps a number of previous solutions, but returns a data.frame with the proper names.

这是一个基本的 R one liner，它与许多以前的解决方案重叠，但返回一个具有正确名称的 data.frame。

out <- setNames(data.frame(before$attr,
                  do.call(rbind, strsplit(as.character(before$type),
                                          split="_and_"))),
                  c("attr", paste0("type_", 1:2)))
out
  attr type_1 type_2
1    1    foo    bar
2   30    foo  bar_2
3    4    foo    bar
4    6    foo  bar_2

It uses strsplitto break up the variable, and data.framewith do.call/rbindto put the data back into a data.frame. The additional incremental improvement is the use of setNamesto add variable names to the data.frame.

它用于strsplit分解变量，并data.frame使用do.call/rbind将数据放回 data.frame。额外的增量改进是使用setNames将变量名称添加到 data.frame。

string 将数据框字符串列拆分为多列

提问by jkebinger

回答by hadley

回答by hadley

回答by David Arenburg

回答by Aniko

回答by IRTFM

回答by Ramnath

回答by A5C1D2H2I1M1N2O1R2T1

回答by Gavin Simpson

回答by Yannis P.

回答by lmo

相关推荐

最近更新

标签

string 将数据框字符串列拆分为多列

提问by jkebinger

回答by hadley

回答by hadley

回答by David Arenburg

回答by Aniko

回答by IRTFM

回答by Ramnath

回答by A5C1D2H2I1M1N2O1R2T1

回答by Gavin Simpson

回答by Yannis P.

回答by lmo

相关推荐

在哪里下载适用于 Windows Server 2012 的 Oracle 11g (11.2.0.4.0) 客户端

string T-SQL：如何获取字符串的确切长度（以字符为单位）？

oracle 出现错误 - ORA-01858: 在需要数字的地方发现了非数字字符

如何使用sql developer导出大量数据 - Oracle

相关推荐

最近更新

标签