string 将列中的逗号分隔字符串拆分为单独的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/13773770/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Split comma-separated strings in a column into separate rows
提问by RoyalTS
I have a data frame, like so:
我有一个数据框,如下所示:
data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa",
"Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu",
"Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González I?árritu",
"Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock",
"Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik",
"Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson",
"Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))
As you can see, some entries in the director
column are multiple names separated by commas. I would like to split these entries up into separate rows while maintaining the values of the other column. As an example, the first row in the data frame above should be split into two rows, with a single name each in the director
column and 'A' in the AB
column.
如您所见,director
列中的某些条目是由逗号分隔的多个名称。我想将这些条目拆分为单独的行,同时保持另一列的值。例如,上面数据框中的第一行应该分成两行,director
列中各有一个名称,列中各有一个名称AB
。
采纳答案by Uwe
This old question frequently is being used as dupe target (tagged with r-faq
). As of today, it has been answered three times offering 6 different approaches but is lacking a benchmarkas guidance which of the approaches is the fastest1.
这个老问题经常被用作欺骗目标(用 标记r-faq
)。截至今天,它已被回答 3 次,提供 6 种不同的方法,但缺乏基准作为指导,哪种方法最快1。
The benchmarked solutions include
基准解决方案包括
- Matthew Lundberg's base R approachbut modified according to Rich Scriven's comment,
- Jaap'stwo
data.table
methods and twodplyr
/tidyr
approaches, - Ananda's
splitstackshape
solution, - and two additional variants of Jaap's
data.table
methods.
- Matthew Lundberg 的基本 R 方法,但根据Rich Scriven 的评论进行了修改,
- Jaap的两种
data.table
方法和两种dplyr
/tidyr
方法, - 阿难的
splitstackshape
解法, - 以及 Jaap
data.table
方法的两个附加变体。
Overall 8 different methods were benchmarked on 6 different sizes of data frames using the microbenchmark
package (see code below).
使用该microbenchmark
包对 6 种不同大小的数据帧进行了总共 8 种不同的方法的基准测试(请参阅下面的代码)。
The sample data given by the OP consists only of 20 rows. To create larger data frames, these 20 rows are simply repeated 1, 10, 100, 1000, 10000, and 100000 times which give problem sizes of up to 2 million rows.
OP 给出的样本数据仅包含 20 行。为了创建更大的数据框,这 20 行简单地重复 1、10、100、1000、10000 和 100000 次,从而使问题大小高达 200 万行。
Benchmark results
基准测试结果
The benchmark results show that for sufficiently large data frames all data.table
methods are faster than any other method. For data frames with more than about 5000 rows, Jaap's data.table
method 2 and the variant DT3
are the fastest, magnitudes faster than the slowest methods.
基准测试结果表明,对于足够大的数据帧,所有data.table
方法都比任何其他方法都快。对于超过大约 5000 行的数据帧,Jaap 的data.table
方法 2 和变体DT3
是最快的,比最慢的方法快几个数量级。
Remarkably, the timings of the two tidyverse
methods and the splistackshape
solution are so similar that it's difficult to distiguish the curves in the chart. They are the slowest of the benchmarked methods across all data frame sizes.
值得注意的是,这两种tidyverse
方法的时间安排和splistackshape
解决方案非常相似,以至于很难区分图表中的曲线。它们是所有数据帧大小中最慢的基准方法。
For smaller data frames, Matt's base R solution and data.table
method 4 seem to have less overhead than the other methods.
对于较小的数据帧,Matt 的基本 R 解决方案和data.table
方法 4 似乎比其他方法具有更少的开销。
Code
代码
director <-
c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula",
"Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu",
"Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González I?árritu",
"Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock",
"Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik",
"Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson",
"Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A",
"A", "B", "B", "B", "B", "B", "B", "A")
library(data.table)
library(magrittr)
Define function for benchmark runs of problem size n
为问题规模的基准运行定义函数 n
run_mb <- function(n) {
# compute number of benchmark runs depending on problem size `n`
mb_times <- scales::squish(10000L / n , c(3L, 100L))
cat(n, " ", mb_times, "\n")
# create data
DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
DT <- as.data.table(DF)
# start benchmarks
microbenchmark::microbenchmark(
matt_mod = {
s <- strsplit(as.character(DF$director), ',')
data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
jaap_DT1 = {
DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
][!is.na(director)]},
jaap_DT2 = {
DT[, strsplit(as.character(director), ",", fixed=TRUE),
by = .(AB, director)][,.(director = V1, AB)]},
jaap_dplyr = {
DF %>%
dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
tidyr::unnest(director)},
jaap_tidyr = {
tidyr::separate_rows(DF, director, sep = ",")},
cSplit = {
splitstackshape::cSplit(DF, "director", ",", direction = "long")},
DT3 = {
DT[, strsplit(as.character(director), ",", fixed=TRUE),
by = .(AB, director)][, director := NULL][
, setnames(.SD, "V1", "director")]},
DT4 = {
DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))),
by = .(AB)]},
times = mb_times
)
}
Run benchmark for different problem sizes
为不同的问题规模运行基准测试
# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)
Prepare data for plotting
准备绘图数据
mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]
Create chart
创建图表
library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) +
geom_point() + geom_smooth(se = FALSE) +
scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() +
xlab("number of rows") + ylab("median of execution time [ms]") +
ggtitle("microbenchmark results") + theme_bw()
Session info & package versions (excerpt)
会话信息和软件包版本(摘录)
devtools::session_info()
#Session info
# version R version 3.3.2 (2016-10-31)
# system x86_64, mingw32
#Packages
# data.table * 1.10.4 2017-02-01 CRAN (R 3.3.2)
# dplyr 0.5.0 2016-06-24 CRAN (R 3.3.1)
# forcats 0.2.0 2017-01-23 CRAN (R 3.3.2)
# ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2)
# magrittr * 1.5 2014-11-22 CRAN (R 3.3.0)
# microbenchmark 1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales 0.4.1 2016-11-09 CRAN (R 3.3.2)
# splitstackshape 1.4.2 2014-10-23 CRAN (R 3.3.3)
# tidyr 0.6.1 2017-01-10 CRAN (R 3.3.2)
1My curiosity was piqued by this exuberant commentBrilliant! Orders of magnitude faster!to a tidyverse
answer of a questionwhich was closed as a duplicate of this question.
回答by Jaap
Several alternatives:
几种选择:
1) two ways with data.table
:
1)两种方式data.table
:
library(data.table)
# method 1 (preferred)
setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
][!is.na(director)]
# method 2
setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)
][,.(director = V1, AB)]
2) a dplyr
/tidyr
combination:Alternatively, you can also use the dplyr
/ tidyr
combination:
2)一个dplyr
/tidyr
组合:或者,你也可以使用dplyr
/tidyr
组合:
library(dplyr)
library(tidyr)
v %>%
mutate(director = strsplit(as.character(director), ",")) %>%
unnest(director)
3) with tidyr
only:With tidyr 0.5.0
(and later), you can also just use separate_rows
:
3) with tidyr
only:with tidyr 0.5.0
(and later), 你也可以只使用separate_rows
:
separate_rows(v, director, sep = ",")
You can use the convert = TRUE
parameter to automatically convert numbers into numeric columns.
您可以使用该convert = TRUE
参数自动将数字转换为数字列。
4) with base R:
4) 以 R 为基数:
# if 'director' is a character-column:
stack(setNames(strsplit(df$director,','), df$AB))
# if 'director' is a factor-column:
stack(setNames(strsplit(as.character(df$director),','), df$AB))
回答by Matthew Lundberg
Naming your original data.frame v
, we have this:
命名您的原始 data.frame v
,我们有这个:
> s <- strsplit(as.character(v$director), ',')
> data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))
director AB
1 Aaron Blaise A
2 Bob Walker A
3 Akira Kurosawa B
4 Alan J. Pakula A
5 Alan Parker A
6 Alejandro Amenabar B
7 Alejandro Gonzalez Inarritu B
8 Alejandro Gonzalez Inarritu B
9 Benicio Del Toro B
10 Alejandro González I?árritu A
11 Alex Proyas B
12 Alexander Hall A
13 Alfonso Cuaron B
14 Alfred Hitchcock A
15 Anatole Litvak A
16 Andrew Adamson B
17 Marilyn Fox B
18 Andrew Dominik B
19 Andrew Stanton B
20 Andrew Stanton B
21 Lee Unkrich B
22 Angelina Jolie B
23 John Stevenson B
24 Anne Fontaine B
25 Anthony Harvey A
Note the use of rep
to build the new AB column. Here, sapply
returns the number of names in each of the original rows.
请注意使用rep
来构建新的 AB 列。在这里,sapply
返回每个原始行中的名称数。
回答by A5C1D2H2I1M1N2O1R2T1
Late to the party, but another generalized alternative is to use cSplit
from my "splitstackshape" package that has a direction
argument. Set this to "long"
to get the result you specify:
迟到了,但另一个通用的替代方法是使用cSplit
我的“splitstackshape”包,它有一个direction
参数。将此设置"long"
为获得您指定的结果:
library(splitstackshape)
head(cSplit(mydf, "director", ",", direction = "long"))
# director AB
# 1: Aaron Blaise A
# 2: Bob Walker A
# 3: Akira Kurosawa B
# 4: Alan J. Pakula A
# 5: Alan Parker A
# 6: Alejandro Amenabar B
回答by zhang jing
devtools::install_github("yikeshu0611/onetree")
library(onetree)
dd=spread_byonecolumn(data=mydata,bycolumn="director",joint=",")
head(dd)
director AB
1 Aaron Blaise A
2 Bob Walker A
3 Akira Kurosawa B
4 Alan J. Pakula A
5 Alan Parker A
6 Alejandro Amenabar B