string 将列中的逗号分隔字符串拆分为单独的行

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/13773770/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-09 01:44:43  来源:igfitidea点击:

Split comma-separated strings in a column into separate rows

rstringsplitr-faq

提问by RoyalTS

I have a data frame, like so:

我有一个数据框,如下所示:

data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", 
                        "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
                        "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González I?árritu", 
                        "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
                        "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
                        "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
                        "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))

As you can see, some entries in the directorcolumn are multiple names separated by commas. I would like to split these entries up into separate rows while maintaining the values of the other column. As an example, the first row in the data frame above should be split into two rows, with a single name each in the directorcolumn and 'A' in the ABcolumn.

如您所见,director列中的某些条目是由逗号分隔的多个名称。我想将这些条目拆分为单独的行,同时保持另一列的值。例如,上面数据框中的第一行应该分成两行,director列中各有一个名称,列中各有一个名称AB

采纳答案by Uwe

This old question frequently is being used as dupe target (tagged with r-faq). As of today, it has been answered three times offering 6 different approaches but is lacking a benchmarkas guidance which of the approaches is the fastest1.

这个老问题经常被用作欺骗目标(用 标记r-faq)。截至今天,它已被回答 3 次,提供 6 种不同的方法,但缺乏基准作为指导,哪种方法最快1

The benchmarked solutions include

基准解决方案包括

Overall 8 different methods were benchmarked on 6 different sizes of data frames using the microbenchmarkpackage (see code below).

使用该microbenchmark包对 6 种不同大小的数据帧进行了总共 8 种不同的方法的基准测试(请参阅下面的代码)。

The sample data given by the OP consists only of 20 rows. To create larger data frames, these 20 rows are simply repeated 1, 10, 100, 1000, 10000, and 100000 times which give problem sizes of up to 2 million rows.

OP 给出的样本数据仅包含 20 行。为了创建更大的数据框,这 20 行简单地重复 1、10、100、1000、10000 和 100000 次,从而使问题大小高达 200 万行。

Benchmark results

基准测试结果

enter image description here

在此处输入图片说明

The benchmark results show that for sufficiently large data frames all data.tablemethods are faster than any other method. For data frames with more than about 5000 rows, Jaap's data.tablemethod 2 and the variant DT3are the fastest, magnitudes faster than the slowest methods.

基准测试结果表明,对于足够大的数据帧,所有data.table方法都比任何其他方法都快。对于超过大约 5000 行的数据帧,Jaap 的data.table方法 2 和变体DT3是最快的,比最慢的方法快几个数量级。

Remarkably, the timings of the two tidyversemethods and the splistackshapesolution are so similar that it's difficult to distiguish the curves in the chart. They are the slowest of the benchmarked methods across all data frame sizes.

值得注意的是,这两种tidyverse方法的时间安排和splistackshape解决方案非常相似,以至于很难区分图表中的曲线。它们是所有数据帧大小中最慢的基准方法。

For smaller data frames, Matt's base R solution and data.tablemethod 4 seem to have less overhead than the other methods.

对于较小的数据帧,Matt 的基本 R 解决方案和data.table方法 4 似乎比其他方法具有更少的开销。

Code

代码

director <- 
  c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", 
    "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
    "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González I?árritu", 
    "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
    "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
    "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
    "Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", 
        "A", "B", "B", "B", "B", "B", "B", "A")

library(data.table)
library(magrittr)

Define function for benchmark runs of problem size n

为问题规模的基准运行定义函数 n

run_mb <- function(n) {
  # compute number of benchmark runs depending on problem size `n`
  mb_times <- scales::squish(10000L / n , c(3L, 100L)) 
  cat(n, " ", mb_times, "\n")
  # create data
  DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
  DT <- as.data.table(DF)
  # start benchmarks
  microbenchmark::microbenchmark(
    matt_mod = {
      s <- strsplit(as.character(DF$director), ',')
      data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
    jaap_DT1 = {
      DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]},
    jaap_DT2 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE), 
         by = .(AB, director)][,.(director = V1, AB)]},
    jaap_dplyr = {
      DF %>% 
        dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
        tidyr::unnest(director)},
    jaap_tidyr = {
      tidyr::separate_rows(DF, director, sep = ",")},
    cSplit = {
      splitstackshape::cSplit(DF, "director", ",", direction = "long")},
    DT3 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE),
         by = .(AB, director)][, director := NULL][
           , setnames(.SD, "V1", "director")]},
    DT4 = {
      DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), 
         by = .(AB)]},
    times = mb_times
  )
}

Run benchmark for different problem sizes

为不同的问题规模运行基准测试

# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)

Prepare data for plotting

准备绘图数据

mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]

Create chart

创建图表

library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + 
  geom_point() + geom_smooth(se = FALSE) + 
  scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + 
  xlab("number of rows") + ylab("median of execution time [ms]") +
  ggtitle("microbenchmark results") + theme_bw()

Session info & package versions (excerpt)

会话信息和软件包版本(摘录)

devtools::session_info()
#Session info
# version  R version 3.3.2 (2016-10-31)
# system   x86_64, mingw32
#Packages
# data.table      * 1.10.4  2017-02-01 CRAN (R 3.3.2)
# dplyr             0.5.0   2016-06-24 CRAN (R 3.3.1)
# forcats           0.2.0   2017-01-23 CRAN (R 3.3.2)
# ggplot2         * 2.2.1   2016-12-30 CRAN (R 3.3.2)
# magrittr        * 1.5     2014-11-22 CRAN (R 3.3.0)
# microbenchmark    1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales            0.4.1   2016-11-09 CRAN (R 3.3.2)
# splitstackshape   1.4.2   2014-10-23 CRAN (R 3.3.3)
# tidyr             0.6.1   2017-01-10 CRAN (R 3.3.2)


1My curiosity was piqued by this exuberant commentBrilliant! Orders of magnitude faster!to a tidyverseanswer of a questionwhich was closed as a duplicate of this question.

1这篇精彩的评论激起了我的好奇心太棒了!速度快几个数量级!对作为此问题的副本而关闭的问题的答案。tidyverse

回答by Jaap

Several alternatives:

几种选择:

1) two ways with data.table:

1)两种方式data.table

library(data.table)
# method 1 (preferred)
setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]
# method 2
setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)
         ][,.(director = V1, AB)]

2) a dplyr/tidyrcombination:Alternatively, you can also use the dplyr/ tidyrcombination:

2)一个dplyr/tidyr组合:或者,你也可以使用dplyr/tidyr组合:

library(dplyr)
library(tidyr)
v %>% 
  mutate(director = strsplit(as.character(director), ",")) %>%
  unnest(director)

3) with tidyronly:With tidyr 0.5.0(and later), you can also just use separate_rows:

3) with tidyronly:with tidyr 0.5.0(and later), 你也可以只使用separate_rows

separate_rows(v, director, sep = ",")

You can use the convert = TRUEparameter to automatically convert numbers into numeric columns.

您可以使用该convert = TRUE参数自动将数字转换为数字列。

4) with base R:

4) 以 R 为基数:

# if 'director' is a character-column:
stack(setNames(strsplit(df$director,','), df$AB))

# if 'director' is a factor-column:
stack(setNames(strsplit(as.character(df$director),','), df$AB))

回答by Matthew Lundberg

Naming your original data.frame v, we have this:

命名您的原始 data.frame v,我们有这个:

> s <- strsplit(as.character(v$director), ',')
> data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))
                      director AB
1                 Aaron Blaise  A
2                   Bob Walker  A
3               Akira Kurosawa  B
4               Alan J. Pakula  A
5                  Alan Parker  A
6           Alejandro Amenabar  B
7  Alejandro Gonzalez Inarritu  B
8  Alejandro Gonzalez Inarritu  B
9             Benicio Del Toro  B
10 Alejandro González I?árritu  A
11                 Alex Proyas  B
12              Alexander Hall  A
13              Alfonso Cuaron  B
14            Alfred Hitchcock  A
15              Anatole Litvak  A
16              Andrew Adamson  B
17                 Marilyn Fox  B
18              Andrew Dominik  B
19              Andrew Stanton  B
20              Andrew Stanton  B
21                 Lee Unkrich  B
22              Angelina Jolie  B
23              John Stevenson  B
24               Anne Fontaine  B
25              Anthony Harvey  A

Note the use of repto build the new AB column. Here, sapplyreturns the number of names in each of the original rows.

请注意使用rep来构建新的 AB 列。在这里,sapply返回每个原始行中的名称数。

回答by A5C1D2H2I1M1N2O1R2T1

Late to the party, but another generalized alternative is to use cSplitfrom my "splitstackshape" package that has a directionargument. Set this to "long"to get the result you specify:

迟到了,但另一个通用的替代方法是使用cSplit我的“splitstackshape”包,它有一个direction参数。将此设置"long"为获得您指定的结果:

library(splitstackshape)
head(cSplit(mydf, "director", ",", direction = "long"))
#              director AB
# 1:       Aaron Blaise  A
# 2:         Bob Walker  A
# 3:     Akira Kurosawa  B
# 4:     Alan J. Pakula  A
# 5:        Alan Parker  A
# 6: Alejandro Amenabar  B

回答by zhang jing

devtools::install_github("yikeshu0611/onetree")

library(onetree)

dd=spread_byonecolumn(data=mydata,bycolumn="director",joint=",")

head(dd)
            director AB
1       Aaron Blaise  A
2         Bob Walker  A
3     Akira Kurosawa  B
4     Alan J. Pakula  A
5        Alan Parker  A
6 Alejandro Amenabar  B