list 将数据框列表转换为一个数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/2851327/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Convert a list of data frames into one data frame
提问by JD Long
I have code that at one place ends up with a list of data frames which I really want to convert to a single big data frame.
我的代码在一个地方以一列数据框结束,我真的想将其转换为单个大数据框。
I got some pointers from an earlier questionwhich was trying to do something similar but more complex.
我从之前的一个问题中得到了一些提示,该问题试图做一些类似但更复杂的事情。
Here's an example of what I am starting with (this is grossly simplified for illustration):
这是我开始的一个例子(为了说明,这是非常简化的):
listOfDataFrames <- vector(mode = "list", length = 100)
for (i in 1:100) {
listOfDataFrames[[i]] <- data.frame(a=sample(letters, 500, rep=T),
b=rnorm(500), c=rnorm(500))
}
I am currently using this:
我目前正在使用这个:
df <- do.call("rbind", listOfDataFrames)
采纳答案by joeklieg
Use bind_rows() from the dplyr package:
使用 dplyr 包中的 bind_rows():
bind_rows(list_of_dataframes, .id = "column_label")
回答by Shane
One other option is to use a plyr function:
另一种选择是使用 plyr 函数:
df <- ldply(listOfDataFrames, data.frame)
This is a little slower than the original:
这比原来的慢一点:
> system.time({ df <- do.call("rbind", listOfDataFrames) })
user system elapsed
0.25 0.00 0.25
> system.time({ df2 <- ldply(listOfDataFrames, data.frame) })
user system elapsed
0.30 0.00 0.29
> identical(df, df2)
[1] TRUE
My guess is that using do.call("rbind", ...)
is going to be the fastest approach that you will find unless you can do something like (a) use a matrices instead of a data.frames and (b) preallocate the final matrix and assign to it rather than growing it.
我的猜测是 usingdo.call("rbind", ...)
将是你能找到的最快的方法,除非你可以做这样的事情(a)使用矩阵而不是 data.frames 和(b)预分配最终矩阵并分配给它而不是增长它.
Edit 1:
编辑1:
Based on Hadley's comment, here's the latest version of rbind.fill
from CRAN:
根据哈德利的评论,这是rbind.fill
来自 CRAN的最新版本:
> system.time({ df3 <- rbind.fill(listOfDataFrames) })
user system elapsed
0.24 0.00 0.23
> identical(df, df3)
[1] TRUE
This is easier than rbind, and marginally faster (these timings hold up over multiple runs). And as far as I understand it, the version of plyr
on githubis even faster than this.
这比 rbind 更容易,并且稍微快一些(这些时间在多次运行中保持不变)。而且据我了解,github上的版本plyr
比这个还要快。
回答by andrekos
For the purpose of completeness, I thought the answers to this question required an update. "My guess is that using do.call("rbind", ...)
is going to be the fastest approach that you will find..." It was probably true for May 2010 and some time after, but in about Sep 2011 a new function rbindlist
was introduced in the data.table
package version 1.8.2, with a remark that "This does the same as do.call("rbind",l)
, but much faster". How much faster?
为了完整起见,我认为这个问题的答案需要更新。“我的猜测是,使用do.call("rbind", ...)
将是你能找到的最快的方法......” 2010 年 5 月和之后的一段时间可能是这样,但在 2011 年 9 月左右rbindlist
,data.table
包版本 1.8.2 中引入了一个新功能,并注明“这与 作用相同do.call("rbind",l)
,但速度要快得多”。快多少?
library(rbenchmark)
benchmark(
do.call = do.call("rbind", listOfDataFrames),
plyr_rbind.fill = plyr::rbind.fill(listOfDataFrames),
plyr_ldply = plyr::ldply(listOfDataFrames, data.frame),
data.table_rbindlist = as.data.frame(data.table::rbindlist(listOfDataFrames)),
replications = 100, order = "relative",
columns=c('test','replications', 'elapsed','relative')
)
test replications elapsed relative
4 data.table_rbindlist 100 0.11 1.000
1 do.call 100 9.39 85.364
2 plyr_rbind.fill 100 12.08 109.818
3 plyr_ldply 100 15.14 137.636
回答by rmf
Code:
代码:
library(microbenchmark)
dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
c=rep(LETTERS,10),d=rep(LETTERS,10))
}
mb <- microbenchmark(
plyr::rbind.fill(dflist),
dplyr::bind_rows(dflist),
data.table::rbindlist(dflist),
plyr::ldply(dflist,data.frame),
do.call("rbind",dflist),
times=1000)
ggplot2::autoplot(mb)
Session:
会议:
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
> packageVersion("plyr")
[1] ‘1.8.4'
> packageVersion("dplyr")
[1] ‘0.5.0'
> packageVersion("data.table")
[1] ‘1.9.6'
UPDATE: Rerun 31-Jan-2018. Ran on the same computer. New versions of packages. Added seed for seed lovers.
更新:2018 年 1 月 31 日重新运行。在同一台电脑上跑。新版本的软件包。为种子爱好者添加了种子。
set.seed(21)
library(microbenchmark)
dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
c=rep(LETTERS,10),d=rep(LETTERS,10))
}
mb <- microbenchmark(
plyr::rbind.fill(dflist),
dplyr::bind_rows(dflist),
data.table::rbindlist(dflist),
plyr::ldply(dflist,data.frame),
do.call("rbind",dflist),
times=1000)
ggplot2::autoplot(mb)+theme_bw()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
> packageVersion("plyr")
[1] ‘1.8.4'
> packageVersion("dplyr")
[1] ‘0.7.2'
> packageVersion("data.table")
[1] ‘1.10.4'
UPDATE: Rerun 06-Aug-2019.
更新:2019 年 8 月 6 日重新运行。
set.seed(21)
library(microbenchmark)
dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
c=rep(LETTERS,10),d=rep(LETTERS,10))
}
mb <- microbenchmark(
plyr::rbind.fill(dflist),
dplyr::bind_rows(dflist),
data.table::rbindlist(dflist),
plyr::ldply(dflist,data.frame),
do.call("rbind",dflist),
purrr::map_df(dflist,dplyr::bind_rows),
times=1000)
ggplot2::autoplot(mb)+theme_bw()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
packageVersion("plyr")
packageVersion("dplyr")
packageVersion("data.table")
packageVersion("purrr")
>> packageVersion("plyr")
[1] ‘1.8.4'
>> packageVersion("dplyr")
[1] ‘0.8.3'
>> packageVersion("data.table")
[1] ‘1.12.2'
>> packageVersion("purrr")
[1] ‘0.3.2'
回答by TheVTM
There is also bind_rows(x, ...)
in dplyr
.
中也bind_rows(x, ...)
有dplyr
。
> system.time({ df.Base <- do.call("rbind", listOfDataFrames) })
user system elapsed
0.08 0.00 0.07
>
> system.time({ df.dplyr <- as.data.frame(bind_rows(listOfDataFrames)) })
user system elapsed
0.01 0.00 0.02
>
> identical(df.Base, df.dplyr)
[1] TRUE
回答by yeedle
Here's another way this can be done (just adding it to the answers because reduce
is a very effective functional tool that is often overlooked as a replacement for loops. In this particular case, neither of these are significantly faster than do.call)
这是可以完成的另一种方法(只需将其添加到答案中,因为它reduce
是一种非常有效的功能性工具,经常被忽略作为循环的替代品。在这种特殊情况下,这两种方法都没有比 do.call 快得多)
using base R:
使用基础 R:
df <- Reduce(rbind, listOfDataFrames)
or, using the tidyverse:
或者,使用 tidyverse:
library(tidyverse) # or, library(dplyr); library(purrr)
df <- listOfDataFrames %>% reduce(bind_rows)
回答by Nick
How it should be done in the tidyverse:
在 tidyverse 中应该如何做:
df.dplyr.purrr <- listOfDataFrames %>% map_df(bind_rows)
回答by Nova
An updated visual for those wanting to compare some of the recent answers (I wanted to compare the purrr to dplyr solution). Basically I combined answers from @TheVTM and @rmf.
为那些想要比较一些最近答案的人提供了更新的视觉效果(我想比较 purrr 和 dplyr 解决方案)。基本上我结合了@TheVTM 和@rmf 的答案。
Code:
代码:
library(microbenchmark)
library(data.table)
library(tidyverse)
dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
c=rep(LETTERS,10),d=rep(LETTERS,10))
}
mb <- microbenchmark(
dplyr::bind_rows(dflist),
data.table::rbindlist(dflist),
purrr::map_df(dflist, bind_rows),
do.call("rbind",dflist),
times=500)
ggplot2::autoplot(mb)
Session Info:
会议信息:
sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Package Versions:
包版本:
> packageVersion("tidyverse")
[1] ‘1.1.1'
> packageVersion("data.table")
[1] ‘1.10.0'
回答by f0nzie
The only thing that the solutions with data.table
are missing is the identifier column to know from which dataframe in the list the data is coming from.
解决方案唯一data.table
缺少的是标识符列,用于了解数据来自列表中的哪个数据帧。
Something like this:
像这样的东西:
df_id <- data.table::rbindlist(listOfDataFrames, idcol = TRUE)
The idcol
parameter adds a column (.id
) identifying the origin of the dataframe contained in the list. The result would look to something like this:
该 idcol
参数添加一列 ( .id
),用于标识列表中包含的数据帧的来源。结果看起来像这样:
.id a b c
1 u -0.05315128 -1.31975849
1 b -1.00404849 1.15257952
1 y 1.17478229 -0.91043925
1 q -1.65488899 0.05846295
1 c -1.43730524 0.95245909
1 b 0.56434313 0.93813197