list 同时合并一个列表中的多个data.frames

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/8091303/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-11 01:44:09  来源:igfitidea点击:

Simultaneously merge multiple data.frames in a list

rlistmergedataframer-faq

提问by bshor

I have a list of many data.frames that I want to merge. The issue here is that each data.frame differs in terms of the number of rows and columns, but they all share the key variables (which I've called "var1"and "var2"in the code below). If the data.frames were identical in terms of columns, I could merely rbind, for which plyr's rbind.fillwould do the job, but that's not the case with these data.

我有一个包含许多要合并的 data.frames 的列表。这里的问题是,每个data.frame不同的行数和列数的条款,但他们都有着关键变量(我打过电话"var1",并"var2"在下面的代码)。如果 data.frames 在列方面是相同的,我只能rbindplyr 的rbind.fill可以完成这项工作,但这些数据并非如此。

Because the mergecommand only works on 2 data.frames, I turned to the Internet for ideas. I got this one from here, which worked perfectly in R 2.7.2, which is what I had at the time:

由于该merge命令仅适用于 2 个 data.frames,因此我转向互联网寻求想法。我从这里得到了这个,它在 R 2.7.2 中完美运行,这就是我当时所拥有的:

merge.rec <- function(.list, ...){
    if(length(.list)==1) return(.list[[1]])
    Recall(c(list(merge(.list[[1]], .list[[2]], ...)), .list[-(1:2)]), ...)
}

And I would call the function like so:

我会像这样调用函数:

df <- merge.rec(my.list, by.x = c("var1", "var2"), 
                by.y = c("var1", "var2"), all = T, suffixes=c("", ""))

But in any R version after 2.7.2, including 2.11 and 2.12, this code fails with the following error:

但是在 2.7.2 之后的任何 R 版本中,包括 2.11 和 2.12,此代码失败并出现以下错误:

Error in match.names(clabs, names(xi)) : 
  names do not match previous names

(Incidently, I see other references to this error elsewherewith no resolution).

(顺便说一句,我在其他地方看到了对这个错误的其他引用,但没有解决)。

Is there any way to solve this?

有没有办法解决这个问题?

采纳答案by Paul Rougieux

Another question asked specifically how to perform multiple left joins using dplyr in R . The question was marked as a duplicate of this one so I answer here, using the 3 sample data frames below:

另一个问题专门询问如何在 R 中使用 dplyr 执行多个左连接。这个问题被标记为这个问题的重复,所以我在这里回答,使用下面的 3 个示例数据框:

x <- data.frame(i = c("a","b","c"), j = 1:3, stringsAsFactors=FALSE)
y <- data.frame(i = c("b","c","d"), k = 4:6, stringsAsFactors=FALSE)
z <- data.frame(i = c("c","d","a"), l = 7:9, stringsAsFactors=FALSE)

Update June 2018: I divided the answer in three sections representing three different ways to perform the merge. You probably want to use the purrrway if you are already using the tidyversepackages. For comparison purposes below, you'll find a base R version using the same sample dataset.

2018 年 6 月更新:我将答案分为三个部分,分别代表执行合并的三种不同方式。purrr如果您已经在使用tidyverse包,您可能想使用这种方式。为了在下面进行比较,您将找到使用相同示例数据集的基本 R 版本。



1) Join them with reducefrom the purrrpackage:

1)reducepurrr包中加入它们:

The purrrpackage provides a reducefunction which has a concise syntax:

purrr包提供了一个reduce具有简洁语法的函数:

library(tidyverse)
list(x, y, z) %>% reduce(left_join, by = "i")
#  A tibble: 3 x 4
#  i       j     k     l
#  <chr> <int> <int> <int>
# 1 a      1    NA     9
# 2 b      2     4    NA
# 3 c      3     5     7

You can also perform other joins, such as a full_joinor inner_join:

您还可以执行其他连接,例如 afull_joininner_join

list(x, y, z) %>% reduce(full_join, by = "i")
# A tibble: 4 x 4
# i       j     k     l
# <chr> <int> <int> <int>
# 1 a     1     NA     9
# 2 b     2     4      NA
# 3 c     3     5      7
# 4 d     NA    6      8

list(x, y, z) %>% reduce(inner_join, by = "i")
# A tibble: 1 x 4
# i       j     k     l
# <chr> <int> <int> <int>
# 1 c     3     5     7


2) dplyr::left_join()with base R Reduce():

2)dplyr::left_join()以 R 为基数Reduce()

list(x,y,z) %>%
    Reduce(function(dtf1,dtf2) left_join(dtf1,dtf2,by="i"), .)

#   i j  k  l
# 1 a 1 NA  9
# 2 b 2  4 NA
# 3 c 3  5  7


3) Base R merge()with base R Reduce():

3) 基础 Rmerge()与基础 R Reduce()

And for comparison purposes, here is a base R version of the left join

出于比较的目的,这里是左连接的基本 R 版本

 Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "i", all.x = TRUE),
        list(x,y,z))
#   i j  k  l
# 1 a 1 NA  9
# 2 b 2  4 NA
# 3 c 3  5  7

回答by Charles

Reduce makes this fairly easy:

Reduce 使这变得相当容易:

merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)

Here's a fully example using some mock data:

这是使用一些模拟数据的完整示例:

set.seed(1)
list.of.data.frames = list(data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)))
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
tail(merged.data.frame)
#    x  a  b         y
#12 12 NA 18        NA
#13 13 NA 19        NA
#14 14 NA 20 0.4976992
#15 15 NA NA 0.7176185
#16 16 NA NA 0.3841037
#17 19 NA NA 0.3800352

And here's an example using these datato replicate my.list:

这是使用这些数据进行复制的示例my.list

merged.data.frame = Reduce(function(...) merge(..., by=match.by, all=T), my.list)
merged.data.frame[, 1:12]

#  matchname party st district chamber senate1993 name.x v2.x v3.x v4.x senate1994 name.y
#1   ALGIERE   200 RI      026       S         NA   <NA>   NA   NA   NA         NA   <NA>
#2     ALVES   100 RI      019       S         NA   <NA>   NA   NA   NA         NA   <NA>
#3    BADEAU   100 RI      032       S         NA   <NA>   NA   NA   NA         NA   <NA>


Note: It looks like this is arguably a bug in merge. The problem is there is no check that adding the suffixes (to handle overlapping non-matching names) actually makes them unique. At a certain point it uses [.data.framewhich doesmake.uniquethe names, causing the rbindto fail.

注意:看起来这可以说是merge. 问题是没有检查添加后缀(以处理重叠的不匹配名称)实际上使它们唯一。在某一点上,它使用[.data.framemake.unique名字,导致rbind失败。

# first merge will end up with 'name.x' & 'name.y'
merge(my.list[[1]], my.list[[2]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y
#<0 rows> (or 0-length row.names)
# as there is no clash, we retain 'name.x' & 'name.y' and get 'name' again
merge(merge(my.list[[1]], my.list[[2]], by=match.by, all=T), my.list[[3]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y senate1995   name         votes.year  
#<0 rows> (or 0-length row.names)
# the next merge will fail as 'name' will get renamed to a pre-existing field.

Easiest way to fix is to not leave the field renaming for duplicates fields (of which there are many here) up to merge. Eg:

最简单的修复方法是不要将重复字段(此处有很多)的字段重命名为merge. 例如:

my.list2 = Map(function(x, i) setNames(x, ifelse(names(x) %in% match.by,
      names(x), sprintf('%s.%d', names(x), i))), my.list, seq_along(my.list))

The merge/Reducewill then work fine.

merge/Reduce会再做工精细。

回答by Ramnath

You can do it using merge_allin the reshapepackage. You can pass parameters to mergeusing the ...argument

您可以merge_allreshape包中使用它。您可以将参数传递给merge使用...参数

reshape::merge_all(list_of_dataframes, ...)

Here is an excellent resource on different methods to merge data frames.

这是关于合并数据框的不同方法的优秀资源

回答by SFun28

You can use recursion to do this. I haven't verified the following, but it should give you the right idea:

您可以使用递归来执行此操作。我还没有验证以下内容,但它应该给你正确的想法:

MergeListOfDf = function( data , ... )
{
    if ( length( data ) == 2 ) 
    {
        return( merge( data[[ 1 ]] , data[[ 2 ]] , ... ) )
    }    
    return( merge( MergeListOfDf( data[ -1 ] , ... ) , data[[ 1 ]] , ... ) )
}

回答by dmi3kno

I will reuse the data example from @PaulRougieux

我将重用@PaulRougieux 的数据示例

x <- data_frame(i = c("a","b","c"), j = 1:3)
y <- data_frame(i = c("b","c","d"), k = 4:6)
z <- data_frame(i = c("c","d","a"), l = 7:9)

Here's a short and sweet solution using purrrand tidyr

这是一个简短而甜蜜的解决方案,使用purrrtidyr

library(tidyverse)

 list(x, y, z) %>% 
  map_df(gather, key=key, value=value, -i) %>% 
  spread(key, value)

回答by Moody_Mudskipper

The function eatof my package safejoinhas such feature, if you give it a list of data.frames as a second input it will join them recursively to the first input.

eat我的包safejoin的函数有这样的功能,如果你给它一个 data.frames 列表作为第二个输入,它会递归地将它们连接到第一个输入。

Borrowing and extending the accepted answer's data :

借用和扩展已接受答案的数据:

x <- data_frame(i = c("a","b","c"), j = 1:3)
y <- data_frame(i = c("b","c","d"), k = 4:6)
z <- data_frame(i = c("c","d","a"), l = 7:9)
z2 <- data_frame(i = c("a","b","c"), l = rep(100L,3),l2 = rep(100L,3)) # for later

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
eat(x, list(y,z), .by = "i")
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <int>
# 1 a         1    NA     9
# 2 b         2     4    NA
# 3 c         3     5     7


We don't have to take all columns, we can use select helpers from tidyselectand choose (as we start from .xall .xcolumns are kept):

我们不必获取所有列,我们可以使用tidyselect 中的select 助手并选择(因为我们从.x所有.x列开始都保留):

eat(x, list(y,z), starts_with("l") ,.by = "i")
# # A tibble: 3 x 3
#   i         j     l
#   <chr> <int> <int>
# 1 a         1     9
# 2 b         2    NA
# 3 c         3     7

or remove specific ones:

或删除特定的:

eat(x, list(y,z), -starts_with("l") ,.by = "i")
# # A tibble: 3 x 3
#   i         j     k
#   <chr> <int> <int>
# 1 a         1    NA
# 2 b         2     4
# 3 c         3     5

If the list is named the names will be used as prefixes :

如果列表已命名,则名称将用作前缀:

eat(x, dplyr::lst(y,z), .by = "i")
# # A tibble: 3 x 4
#   i         j   y_k   z_l
#   <chr> <int> <int> <int>
# 1 a         1    NA     9
# 2 b         2     4    NA
# 3 c         3     5     7

If there are column conflicts the .conflictargument allows you to resolve it, for example by taking the first/second one, adding them, coalescing them, or nesting them.

如果存在列冲突,则.conflict参数允许您解决它,例如通过取第一个/第二个、添加它们、合并它们或嵌套它们。

keep first :

保持第一:

eat(x, list(y, z, z2), .by = "i", .conflict = ~.x)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <int>
# 1 a         1    NA     9
# 2 b         2     4    NA
# 3 c         3     5     7

keep last:

保持最后:

eat(x, list(y, z, z2), .by = "i", .conflict = ~.y)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <dbl>
# 1 a         1    NA   100
# 2 b         2     4   100
# 3 c         3     5   100

add:

添加:

eat(x, list(y, z, z2), .by = "i", .conflict = `+`)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <dbl>
# 1 a         1    NA   109
# 2 b         2     4    NA
# 3 c         3     5   107

coalesce:

合并:

eat(x, list(y, z, z2), .by = "i", .conflict = dplyr::coalesce)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <dbl>
# 1 a         1    NA     9
# 2 b         2     4   100
# 3 c         3     5     7

nest:

巢:

eat(x, list(y, z, z2), .by = "i", .conflict = ~tibble(first=.x, second=.y))
# # A tibble: 3 x 4
#   i         j     k l$first $second
#   <chr> <int> <int>   <int>   <int>
# 1 a         1    NA       9     100
# 2 b         2     4      NA     100
# 3 c         3     5       7     100

NAvalues can be replaced by using the .fillargument.

NA可以使用.fill参数替换值。

eat(x, list(y, z), .by = "i", .fill = 0)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <dbl> <dbl>
# 1 a         1     0     9
# 2 b         2     4     0
# 3 c         3     5     7

By default it's an enhanced left_joinbut all dplyrjoins are supported through the .modeargument, fuzzy joins are also supported through the match_funargument (it's wrapped around the package fuzzyjoin) or giving a formula such as ~ X("var1") > Y("var2") & X("var3") < Y("var4")to the byargument.

缺省情况下它是一个增强left_join但所有dplyr连接被通过所支持的.mode参数,模糊联接也通过支持match_fun参数(它包裹绕包fuzzyjoin)或给予式如 ~ X("var1") > Y("var2") & X("var3") < Y("var4")by参数。

回答by Elias EstatisticsEU

When you have a list of dfs, and a column contains the "ID", but in some lists, some IDs are missing, then you may use this version of Reduce / Merge in order to join multiple Dfs of missing Row Ids or labels:

当您有一个 dfs 列表,并且一列包含“ID”,但在某些列表中,缺少某些 ID,那么您可以使用此版本的 Reduce/Merge 来加入多个缺少行 ID 或标签的 Dfs:

Reduce(function(x, y) merge(x=x, y=y, by="V1", all.x=T, all.y=T), list_of_dfs)

回答by englealuze

Here is a generic wrapper which can be used to convert a binary function to multi-parameters function. The benefit of this solution is that it is very generic and can be applied to any binary functions. You just need to do it once and then you can apply it any where.

这是一个通用包装器,可用于将二元函数转换为多参数函数。这个解决方案的好处是它非常通用,可以应用于任何二元函数。你只需要做一次,然后你就可以在任何地方应用它。

To demo the idea, I use simple recursion to implement. It can be of course implemented with more elegant way that benefits from R's good support for functional paradigm.

为了演示这个想法,我使用简单的递归来实现。它当然可以以更优雅的方式实现,这得益于 R 对功能范式的良好支持。

fold_left <- function(f) {
return(function(...) {
    args <- list(...)
    return(function(...){
    iter <- function(result,rest) {
        if (length(rest) == 0) {
            return(result)
        } else {
            return(iter(f(result, rest[[1]], ...), rest[-1]))
        }
    }
    return(iter(args[[1]], args[-1]))
    })
})}

Then you can simply wrap any binary functions with it and call with positional parameters (usually data.frames) in the first parentheses and named parameters in the second parentheses (such as by =or suffix =). If no named parameters, leave second parentheses empty.

然后,您可以简单地用它包装任何二元函数,并使用第一个括号中的位置参数(通常是 data.frames)和第二个括号中的命名参数(例如by =or suffix =)调用。如果没有命名参数,请将第二个括号留空。

merge_all <- fold_left(merge)
merge_all(df1, df2, df3, df4, df5)(by.x = c("var1", "var2"), by.y = c("var1", "var2"))

left_join_all <- fold_left(left_join)
left_join_all(df1, df2, df3, df4, df5)(c("var1", "var2"))
left_join_all(df1, df2, df3, df4, df5)()

回答by Elias EstatisticsEU

I had a list of dataframes with no common id column.
I had missing data on many dfs. There were Null values. The dataframes were produced using table function. The Reduce, Merging, rbind, rbind.fill, and their like could not help me to my aim. My aim was to produce an understandable merged dataframe, irrelevant of the missing data and common id column.

我有一个没有公共 ID 列的数据框列表。
我在许多 dfs 上丢失了数据。有空值。数据帧是使用 table 函数生成的。Reduce、Merging、rbind、rbind.fill 之类的工具无法帮助我实现目标。我的目标是生成一个可理解的合并数据框,与丢失的数据和公共 ID 列无关。

Therefore, I made the following function. Maybe this function can help someone.

因此,我做了以下功能。也许这个功能可以帮助某人。

##########################################################
####             Dependencies                        #####
##########################################################

# Depends on Base R only

##########################################################
####             Example DF                          #####
##########################################################

# Example df
ex_df           <- cbind(c( seq(1, 10, 1), rep("NA", 0), seq(1,10, 1) ), 
                         c( seq(1, 7, 1),  rep("NA", 3), seq(1, 12, 1) ), 
                         c( seq(1, 3, 1),  rep("NA", 7), seq(1, 5, 1), rep("NA", 5) ))

# Making colnames and rownames
colnames(ex_df) <- 1:dim(ex_df)[2]
rownames(ex_df) <- 1:dim(ex_df)[1]

# Making an unequal list of dfs, 
# without a common id column
list_of_df      <- apply(ex_df=="NA", 2, ( table) )

it is following the function

它遵循功能

##########################################################
####             The function                        #####
##########################################################


# The function to rbind it
rbind_null_df_lists <- function ( list_of_dfs ) {
  length_df     <- do.call(rbind, (lapply( list_of_dfs, function(x) length(x))))
  max_no        <- max(length_df[,1])
  max_df        <- length_df[max(length_df),]
  name_df       <- names(length_df[length_df== max_no,][1])
  names_list    <- names(list_of_dfs[ name_df][[1]])

  df_dfs <- list()
  for (i in 1:max_no ) {

    df_dfs[[i]]            <- do.call(rbind, lapply(1:length(list_of_dfs), function(x) list_of_dfs[[x]][i]))

  }

  df_cbind               <- do.call( cbind, df_dfs )
  rownames( df_cbind )   <- rownames (length_df)
  colnames( df_cbind )   <- names_list

  df_cbind

}

Running the example

运行示例

##########################################################
####             Running the example                 #####
##########################################################

rbind_null_df_lists ( list_of_df )