list 将数据框列表转换为一个数据框

Question

提问by JD Long

I have code that at one place ends up with a list of data frames which I really want to convert to a single big data frame.

我的代码在一个地方以一列数据框结束，我真的想将其转换为单个大数据框。

I got some pointers from an earlier questionwhich was trying to do something similar but more complex.

我从之前的一个问题中得到了一些提示，该问题试图做一些类似但更复杂的事情。

Here's an example of what I am starting with (this is grossly simplified for illustration):

这是我开始的一个例子（为了说明，这是非常简化的）：

listOfDataFrames <- vector(mode = "list", length = 100)

for (i in 1:100) {
    listOfDataFrames[[i]] <- data.frame(a=sample(letters, 500, rep=T),
                             b=rnorm(500), c=rnorm(500))
}

I am currently using this:

我目前正在使用这个：

  df <- do.call("rbind", listOfDataFrames)

Answer 1

采纳答案by joeklieg

Use bind_rows() from the dplyr package:

使用 dplyr 包中的 bind_rows()：

bind_rows(list_of_dataframes, .id = "column_label")

Answer 2

回答by Shane

One other option is to use a plyr function:

另一种选择是使用 plyr 函数：

df <- ldply(listOfDataFrames, data.frame)

This is a little slower than the original:

这比原来的慢一点：

> system.time({ df <- do.call("rbind", listOfDataFrames) })
   user  system elapsed 
   0.25    0.00    0.25 
> system.time({ df2 <- ldply(listOfDataFrames, data.frame) })
   user  system elapsed 
   0.30    0.00    0.29
> identical(df, df2)
[1] TRUE

My guess is that using do.call("rbind", ...)is going to be the fastest approach that you will find unless you can do something like (a) use a matrices instead of a data.frames and (b) preallocate the final matrix and assign to it rather than growing it.

我的猜测是 usingdo.call("rbind", ...)将是你能找到的最快的方法，除非你可以做这样的事情（a）使用矩阵而不是 data.frames 和（b）预分配最终矩阵并分配给它而不是增长它.

Edit 1:

编辑1：

Based on Hadley's comment, here's the latest version of rbind.fillfrom CRAN:

根据哈德利的评论，这是rbind.fill来自 CRAN的最新版本：

> system.time({ df3 <- rbind.fill(listOfDataFrames) })
   user  system elapsed 
   0.24    0.00    0.23 
> identical(df, df3)
[1] TRUE

This is easier than rbind, and marginally faster (these timings hold up over multiple runs). And as far as I understand it, the version of plyron githubis even faster than this.

这比 rbind 更容易，并且稍微快一些（这些时间在多次运行中保持不变）。而且据我了解，github上的版本plyr比这个还要快。

Answer 3

回答by andrekos

For the purpose of completeness, I thought the answers to this question required an update. "My guess is that using do.call("rbind", ...)is going to be the fastest approach that you will find..." It was probably true for May 2010 and some time after, but in about Sep 2011 a new function rbindlistwas introduced in the data.tablepackage version 1.8.2, with a remark that "This does the same as do.call("rbind",l), but much faster". How much faster?

为了完整起见，我认为这个问题的答案需要更新。“我的猜测是，使用do.call("rbind", ...)将是你能找到的最快的方法......” 2010 年 5 月和之后的一段时间可能是这样，但在 2011 年 9 月左右rbindlist，data.table包版本 1.8.2 中引入了一个新功能，并注明“这与作用相同do.call("rbind",l)，但速度要快得多”。快多少？

library(rbenchmark)
benchmark(
  do.call = do.call("rbind", listOfDataFrames),
  plyr_rbind.fill = plyr::rbind.fill(listOfDataFrames), 
  plyr_ldply = plyr::ldply(listOfDataFrames, data.frame),
  data.table_rbindlist = as.data.frame(data.table::rbindlist(listOfDataFrames)),
  replications = 100, order = "relative", 
  columns=c('test','replications', 'elapsed','relative')
  )

                  test replications elapsed relative
4 data.table_rbindlist          100    0.11    1.000
1              do.call          100    9.39   85.364
2      plyr_rbind.fill          100   12.08  109.818
3           plyr_ldply          100   15.14  137.636

Answer 4

回答by rmf

Code:

代码：

library(microbenchmark)

dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
  dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                            c=rep(LETTERS,10),d=rep(LETTERS,10))
}


mb <- microbenchmark(
plyr::rbind.fill(dflist),
dplyr::bind_rows(dflist),
data.table::rbindlist(dflist),
plyr::ldply(dflist,data.frame),
do.call("rbind",dflist),
times=1000)

ggplot2::autoplot(mb)

Session:

会议：

R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

> packageVersion("plyr")
[1] ‘1.8.4'
> packageVersion("dplyr")
[1] ‘0.5.0'
> packageVersion("data.table")
[1] ‘1.9.6'

UPDATE: Rerun 31-Jan-2018. Ran on the same computer. New versions of packages. Added seed for seed lovers.

更新：2018 年 1 月 31 日重新运行。在同一台电脑上跑。新版本的软件包。为种子爱好者添加了种子。

set.seed(21)
library(microbenchmark)

dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
  dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                            c=rep(LETTERS,10),d=rep(LETTERS,10))
}


mb <- microbenchmark(
  plyr::rbind.fill(dflist),
  dplyr::bind_rows(dflist),
  data.table::rbindlist(dflist),
  plyr::ldply(dflist,data.frame),
  do.call("rbind",dflist),
  times=1000)

ggplot2::autoplot(mb)+theme_bw()


R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

> packageVersion("plyr")
[1] ‘1.8.4'
> packageVersion("dplyr")
[1] ‘0.7.2'
> packageVersion("data.table")
[1] ‘1.10.4'

UPDATE: Rerun 06-Aug-2019.

更新：2019 年 8 月 6 日重新运行。

set.seed(21)
library(microbenchmark)

dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
  dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                            c=rep(LETTERS,10),d=rep(LETTERS,10))
}


mb <- microbenchmark(
  plyr::rbind.fill(dflist),
  dplyr::bind_rows(dflist),
  data.table::rbindlist(dflist),
  plyr::ldply(dflist,data.frame),
  do.call("rbind",dflist),
  purrr::map_df(dflist,dplyr::bind_rows),
  times=1000)

ggplot2::autoplot(mb)+theme_bw()

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

packageVersion("plyr")
packageVersion("dplyr")
packageVersion("data.table")
packageVersion("purrr")

>> packageVersion("plyr")
[1] ‘1.8.4'
>> packageVersion("dplyr")
[1] ‘0.8.3'
>> packageVersion("data.table")
[1] ‘1.12.2'
>> packageVersion("purrr")
[1] ‘0.3.2'

Answer 5

回答by TheVTM

There is also bind_rows(x, ...)in dplyr.

中也bind_rows(x, ...)有dplyr。

> system.time({ df.Base <- do.call("rbind", listOfDataFrames) })
   user  system elapsed 
   0.08    0.00    0.07 
> 
> system.time({ df.dplyr <- as.data.frame(bind_rows(listOfDataFrames)) })
   user  system elapsed 
   0.01    0.00    0.02 
> 
> identical(df.Base, df.dplyr)
[1] TRUE

Answer 6

回答by yeedle

Here's another way this can be done (just adding it to the answers because reduceis a very effective functional tool that is often overlooked as a replacement for loops. In this particular case, neither of these are significantly faster than do.call)

这是可以完成的另一种方法（只需将其添加到答案中，因为它reduce是一种非常有效的功能性工具，经常被忽略作为循环的替代品。在这种特殊情况下，这两种方法都没有比 do.call 快得多）

using base R:

使用基础 R：

df <- Reduce(rbind, listOfDataFrames)

or, using the tidyverse:

或者，使用 tidyverse：

library(tidyverse) # or, library(dplyr); library(purrr)
df <- listOfDataFrames %>% reduce(bind_rows)

Answer 7

回答by Nick

How it should be done in the tidyverse:

在 tidyverse 中应该如何做：

df.dplyr.purrr <- listOfDataFrames %>% map_df(bind_rows)

Answer 8

回答by Nova

An updated visual for those wanting to compare some of the recent answers (I wanted to compare the purrr to dplyr solution). Basically I combined answers from @TheVTM and @rmf.

为那些想要比较一些最近答案的人提供了更新的视觉效果（我想比较 purrr 和 dplyr 解决方案）。基本上我结合了@TheVTM 和@rmf 的答案。

Code:

代码：

library(microbenchmark)
library(data.table)
library(tidyverse)

dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
  dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                            c=rep(LETTERS,10),d=rep(LETTERS,10))
}


mb <- microbenchmark(
  dplyr::bind_rows(dflist),
  data.table::rbindlist(dflist),
  purrr::map_df(dflist, bind_rows),
  do.call("rbind",dflist),
  times=500)

ggplot2::autoplot(mb)

Session Info:

会议信息：

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Package Versions:

包版本：

> packageVersion("tidyverse")
[1] ‘1.1.1'
> packageVersion("data.table")
[1] ‘1.10.0'

Answer 9

回答by f0nzie

The only thing that the solutions with data.tableare missing is the identifier column to know from which dataframe in the list the data is coming from.

解决方案唯一data.table缺少的是标识符列，用于了解数据来自列表中的哪个数据帧。

Something like this:

像这样的东西：

df_id <- data.table::rbindlist(listOfDataFrames, idcol = TRUE)

The idcolparameter adds a column (.id) identifying the origin of the dataframe contained in the list. The result would look to something like this:

该 idcol参数添加一列 ( .id)，用于标识列表中包含的数据帧的来源。结果看起来像这样：

.id a         b           c
1   u   -0.05315128 -1.31975849 
1   b   -1.00404849 1.15257952  
1   y   1.17478229  -0.91043925 
1   q   -1.65488899 0.05846295  
1   c   -1.43730524 0.95245909  
1   b   0.56434313  0.93813197

list 将数据框列表转换为一个数据框

提问by JD Long

采纳答案by joeklieg

回答by Shane

回答by andrekos

回答by rmf

回答by TheVTM

回答by yeedle

回答by Nick

回答by Nova

回答by f0nzie

相关推荐

最近更新

标签

list 将数据框列表转换为一个数据框

提问by JD Long

采纳答案by joeklieg

回答by Shane

回答by andrekos

回答by rmf

回答by TheVTM

回答by yeedle

回答by Nick

回答by Nova

回答by f0nzie

相关推荐

list 我在哪里可以找到所有英国 _full_ 邮政编码的列表，包括街道名称及其精确坐标？

C# - var 到 List<T> 的转换

list 将整数附加到 Ocaml 中的列表

list 将列表传递给 Tcl 过程

相关推荐

最近更新

标签