string 将列中的逗号分隔字符串拆分为单独的行

Question

提问by RoyalTS

I have a data frame, like so:

我有一个数据框，如下所示：

data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", 
                        "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
                        "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González I?árritu", 
                        "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
                        "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
                        "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
                        "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))

As you can see, some entries in the directorcolumn are multiple names separated by commas. I would like to split these entries up into separate rows while maintaining the values of the other column. As an example, the first row in the data frame above should be split into two rows, with a single name each in the directorcolumn and 'A' in the ABcolumn.

如您所见，director列中的某些条目是由逗号分隔的多个名称。我想将这些条目拆分为单独的行，同时保持另一列的值。例如，上面数据框中的第一行应该分成两行，director列中各有一个名称，列中各有一个名称AB。

Answer 1

采纳答案by Uwe

This old question frequently is being used as dupe target (tagged with r-faq). As of today, it has been answered three times offering 6 different approaches but is lacking a benchmarkas guidance which of the approaches is the fastest¹.

这个老问题经常被用作欺骗目标（用标记r-faq）。截至今天，它已被回答 3 次，提供 6 种不同的方法，但缺乏基准作为指导，哪种方法最快¹。

The benchmarked solutions include

基准解决方案包括

Matthew Lundberg's base R approachbut modified according to Rich Scriven's comment,
Jaap'stwo data.tablemethods and two dplyr/ tidyrapproaches,
Ananda's splitstackshapesolution,
and two additional variants of Jaap's data.tablemethods.

Matthew Lundberg 的基本 R 方法，但根据Rich Scriven 的评论进行了修改，
Jaap的两种data.table方法和两种dplyr/tidyr方法，
阿难的splitstackshape解法，
以及 Jaapdata.table方法的两个附加变体。

Overall 8 different methods were benchmarked on 6 different sizes of data frames using the microbenchmarkpackage (see code below).

使用该microbenchmark包对 6 种不同大小的数据帧进行了总共 8 种不同的方法的基准测试（请参阅下面的代码）。

The sample data given by the OP consists only of 20 rows. To create larger data frames, these 20 rows are simply repeated 1, 10, 100, 1000, 10000, and 100000 times which give problem sizes of up to 2 million rows.

OP 给出的样本数据仅包含 20 行。为了创建更大的数据框，这 20 行简单地重复 1、10、100、1000、10000 和 100000 次，从而使问题大小高达 200 万行。

Benchmark results

基准测试结果

The benchmark results show that for sufficiently large data frames all data.tablemethods are faster than any other method. For data frames with more than about 5000 rows, Jaap's data.tablemethod 2 and the variant DT3are the fastest, magnitudes faster than the slowest methods.

基准测试结果表明，对于足够大的数据帧，所有data.table方法都比任何其他方法都快。对于超过大约 5000 行的数据帧，Jaap 的data.table方法 2 和变体DT3是最快的，比最慢的方法快几个数量级。

Remarkably, the timings of the two tidyversemethods and the splistackshapesolution are so similar that it's difficult to distiguish the curves in the chart. They are the slowest of the benchmarked methods across all data frame sizes.

值得注意的是，这两种tidyverse方法的时间安排和splistackshape解决方案非常相似，以至于很难区分图表中的曲线。它们是所有数据帧大小中最慢的基准方法。

For smaller data frames, Matt's base R solution and data.tablemethod 4 seem to have less overhead than the other methods.

对于较小的数据帧，Matt 的基本 R 解决方案和data.table方法 4 似乎比其他方法具有更少的开销。

Code

代码

director <- 
  c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", 
    "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
    "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González I?árritu", 
    "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
    "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
    "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
    "Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", 
        "A", "B", "B", "B", "B", "B", "B", "A")

library(data.table)
library(magrittr)

Define function for benchmark runs of problem size `n`

为问题规模的基准运行定义函数 `n`

run_mb <- function(n) {
  # compute number of benchmark runs depending on problem size `n`
  mb_times <- scales::squish(10000L / n , c(3L, 100L)) 
  cat(n, " ", mb_times, "\n")
  # create data
  DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
  DT <- as.data.table(DF)
  # start benchmarks
  microbenchmark::microbenchmark(
    matt_mod = {
      s <- strsplit(as.character(DF$director), ',')
      data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
    jaap_DT1 = {
      DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]},
    jaap_DT2 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE), 
         by = .(AB, director)][,.(director = V1, AB)]},
    jaap_dplyr = {
      DF %>% 
        dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
        tidyr::unnest(director)},
    jaap_tidyr = {
      tidyr::separate_rows(DF, director, sep = ",")},
    cSplit = {
      splitstackshape::cSplit(DF, "director", ",", direction = "long")},
    DT3 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE),
         by = .(AB, director)][, director := NULL][
           , setnames(.SD, "V1", "director")]},
    DT4 = {
      DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), 
         by = .(AB)]},
    times = mb_times
  )
}

Run benchmark for different problem sizes

为不同的问题规模运行基准测试

# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)

Prepare data for plotting

准备绘图数据

mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]

Create chart

创建图表

library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + 
  geom_point() + geom_smooth(se = FALSE) + 
  scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + 
  xlab("number of rows") + ylab("median of execution time [ms]") +
  ggtitle("microbenchmark results") + theme_bw()

Session info & package versions (excerpt)

会话信息和软件包版本（摘录）

devtools::session_info()
#Session info
# version  R version 3.3.2 (2016-10-31)
# system   x86_64, mingw32
#Packages
# data.table      * 1.10.4  2017-02-01 CRAN (R 3.3.2)
# dplyr             0.5.0   2016-06-24 CRAN (R 3.3.1)
# forcats           0.2.0   2017-01-23 CRAN (R 3.3.2)
# ggplot2         * 2.2.1   2016-12-30 CRAN (R 3.3.2)
# magrittr        * 1.5     2014-11-22 CRAN (R 3.3.0)
# microbenchmark    1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales            0.4.1   2016-11-09 CRAN (R 3.3.2)
# splitstackshape   1.4.2   2014-10-23 CRAN (R 3.3.3)
# tidyr             0.6.1   2017-01-10 CRAN (R 3.3.2)

¹_{My curiosity was piqued by this exuberant commentBrilliant! Orders of magnitude faster!to a tidyverseanswer of a questionwhich was closed as a duplicate of this question.}

¹_这篇_精彩的_评论_{激起了我的好奇心太棒了！速度快几个数量级！对作为此问题的副本而关闭的问题的答案。tidyverse}

Answer 2

回答by Jaap

Several alternatives:

几种选择：

1) two ways with data.table:

1）两种方式data.table：

library(data.table)
# method 1 (preferred)
setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]
# method 2
setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)
         ][,.(director = V1, AB)]

2) a dplyr/tidyrcombination:Alternatively, you can also use the dplyr/ tidyrcombination:

2）一个dplyr/tidyr组合：或者，你也可以使用dplyr/tidyr组合：

library(dplyr)
library(tidyr)
v %>% 
  mutate(director = strsplit(as.character(director), ",")) %>%
  unnest(director)

3) with tidyronly:With tidyr 0.5.0(and later), you can also just use separate_rows:

3) with tidyronly:with tidyr 0.5.0(and later), 你也可以只使用separate_rows：

separate_rows(v, director, sep = ",")

You can use the convert = TRUEparameter to automatically convert numbers into numeric columns.

您可以使用该convert = TRUE参数自动将数字转换为数字列。

4) with base R:

4) 以 R 为基数：

# if 'director' is a character-column:
stack(setNames(strsplit(df$director,','), df$AB))

# if 'director' is a factor-column:
stack(setNames(strsplit(as.character(df$director),','), df$AB))

Answer 3

回答by Matthew Lundberg

Naming your original data.frame v, we have this:

命名您的原始 data.frame v，我们有这个：

> s <- strsplit(as.character(v$director), ',')
> data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))
                      director AB
1                 Aaron Blaise  A
2                   Bob Walker  A
3               Akira Kurosawa  B
4               Alan J. Pakula  A
5                  Alan Parker  A
6           Alejandro Amenabar  B
7  Alejandro Gonzalez Inarritu  B
8  Alejandro Gonzalez Inarritu  B
9             Benicio Del Toro  B
10 Alejandro González I?árritu  A
11                 Alex Proyas  B
12              Alexander Hall  A
13              Alfonso Cuaron  B
14            Alfred Hitchcock  A
15              Anatole Litvak  A
16              Andrew Adamson  B
17                 Marilyn Fox  B
18              Andrew Dominik  B
19              Andrew Stanton  B
20              Andrew Stanton  B
21                 Lee Unkrich  B
22              Angelina Jolie  B
23              John Stevenson  B
24               Anne Fontaine  B
25              Anthony Harvey  A

Note the use of repto build the new AB column. Here, sapplyreturns the number of names in each of the original rows.

请注意使用rep来构建新的 AB 列。在这里，sapply返回每个原始行中的名称数。

Answer 4

回答by A5C1D2H2I1M1N2O1R2T1

Late to the party, but another generalized alternative is to use cSplitfrom my "splitstackshape" package that has a directionargument. Set this to "long"to get the result you specify:

迟到了，但另一个通用的替代方法是使用cSplit我的“splitstackshape”包，它有一个direction参数。将此设置"long"为获得您指定的结果：

library(splitstackshape)
head(cSplit(mydf, "director", ",", direction = "long"))
#              director AB
# 1:       Aaron Blaise  A
# 2:         Bob Walker  A
# 3:     Akira Kurosawa  B
# 4:     Alan J. Pakula  A
# 5:        Alan Parker  A
# 6: Alejandro Amenabar  B

Answer 5

回答by zhang jing

devtools::install_github("yikeshu0611/onetree")

library(onetree)

dd=spread_byonecolumn(data=mydata,bycolumn="director",joint=",")

head(dd)
            director AB
1       Aaron Blaise  A
2         Bob Walker  A
3     Akira Kurosawa  B
4     Alan J. Pakula  A
5        Alan Parker  A
6 Alejandro Amenabar  B

string 将列中的逗号分隔字符串拆分为单独的行

提问by RoyalTS

采纳答案by Uwe

Benchmark results

基准测试结果

Code

代码

Define function for benchmark runs of problem size `n`

为问题规模的基准运行定义函数 `n`

Run benchmark for different problem sizes

为不同的问题规模运行基准测试

Prepare data for plotting

准备绘图数据

Create chart

创建图表

Session info & package versions (excerpt)

会话信息和软件包版本（摘录）

回答by Jaap

回答by Matthew Lundberg

回答by A5C1D2H2I1M1N2O1R2T1

回答by zhang jing

相关推荐

最近更新

标签

string 将列中的逗号分隔字符串拆分为单独的行

提问by RoyalTS

采纳答案by Uwe

Benchmark results

基准测试结果

Code

代码

Define function for benchmark runs of problem size n

为问题规模的基准运行定义函数 n

Run benchmark for different problem sizes

为不同的问题规模运行基准测试

Prepare data for plotting

准备绘图数据

Create chart

创建图表

Session info & package versions (excerpt)

会话信息和软件包版本（摘录）

回答by Jaap

回答by Matthew Lundberg

回答by A5C1D2H2I1M1N2O1R2T1

回答by zhang jing

相关推荐

string 如何在 Dart 中对字符串列表进行排序？

golang 中的 []string 和 ...string 有什么区别？

string Powershell - 将 System.Data.DataRow 解析为字符串

string 如何 fmt.Printf 带有千位逗号的整数

相关推荐

最近更新

标签

Define function for benchmark runs of problem size `n`

为问题规模的基准运行定义函数 `n`