pandas 两个或多个数据框的全外连接

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49787325/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:27:08  来源:igfitidea点击:

Full outer join of two or more data frames

pythonpython-3.xpandas

提问by ps0604

Given the following three Pandas data frames, I need to merge them similar to an SQL full outer join. Note that the key is multi-index type_Nand id_Nwith N= 1,2,3:

给定以下三个 Pandas 数据框,我需要将它们合并,类似于 SQL 全外连接。请注意,该键是多索引的,type_N并且id_N具有N= 1,2,3:

import pandas as pd

raw_data = {
        'type_1': [0, 1, 1,1],
        'id_1': ['3', '4', '5','5'],
        'name_1': ['Alex', 'Amy', 'Allen', 'Jane']}
df_a = pd.DataFrame(raw_data, columns = ['type_1', 'id_1', 'name_1' ])

raw_datab = {
        'type_2': [1, 1, 1, 0],
        'id_2': ['4', '5', '5', '7'],
        'name_2': ['Bill', 'Brian', 'Joe', 'Bryce']}
df_b = pd.DataFrame(raw_datab, columns = ['type_2', 'id_2', 'name_2'])

raw_datac = {
        'type_3': [1, 0],
        'id_3': ['4', '7'],
        'name_3': ['School', 'White']}
df_c = pd.DataFrame(raw_datac, columns = ['type_3', 'id_3', 'name_3'])

The expected result should be:

预期的结果应该是:

type_1   id_1   name_1   type_2   id_2   name_2   type_3   id_3   name_3
0        3      Alex     NaN      NaN    NaN      NaN      NaN    NaN
1        4      Amy      1        4      Bill     1        4      School
1        5      Allen    1        5      Brian    NaN      NaN    NaN
1        5      Allen    1        5      Joe      NaN      NaN    NaN
1        5      Jane     1        5      Brian    NaN      NaN    NaN
1        5      Jane     1        5      Joe      NaN      NaN    NaN
NaN      NaN    NaN      0        7      Bryce    0        7      White

How can this be achieved in Pandas?

如何在 Pandas 中实现这一点?

回答by piRSquared

I'll propose that you make life less complicated and not have different names for the things you want to merge on.

我会建议你让生活不那么复杂,不要为你想要合并的东西起不同的名字。

da = df_a.set_index(['type_1', 'id_1']).rename_axis(['type', 'id'])
db = df_b.set_index(['type_2', 'id_2']).rename_axis(['type', 'id'])
dc = df_c.set_index(['type_3', 'id_3']).rename_axis(['type', 'id'])

da.join(db, how='outer').join(dc, how='outer')

        name_1 name_2  name_3
type id                      
0    3    Alex    NaN     NaN
     7     NaN  Bryce   White
1    4     Amy   Bill  School
     5   Allen  Brian     NaN
     5   Allen    Joe     NaN
     5    Jane  Brian     NaN
     5    Jane    Joe     NaN


Here's an obnoxious way to get those other columns

这是获取其他列的令人讨厌的方法

from cytoolz.dicttoolz import merge

i = pd.DataFrame(d.index.values.tolist(), d.index, d.index.names)
d = d.assign(**merge(
    i.mask(d[f'name_{j}'].isna()).add_suffix(f'_{j}').to_dict('l')
    for j in [1, 2, 3]
))

d[sorted(d.columns, key=lambda x: x.split('_')[::-1])]

        id_1 name_1  type_1 id_2 name_2  type_2 id_3  name_3  type_3
type id                                                             
0    3     3   Alex     0.0  NaN    NaN     NaN  NaN     NaN     NaN
     7   NaN    NaN     NaN    7  Bryce     0.0    7   White     0.0
1    4     4    Amy     1.0    4   Bill     1.0    4  School     1.0
     5     5  Allen     1.0    5  Brian     1.0  NaN     NaN     NaN
     5     5  Allen     1.0    5    Joe     1.0  NaN     NaN     NaN
     5     5   Jane     1.0    5  Brian     1.0  NaN     NaN     NaN
     5     5   Jane     1.0    5    Joe     1.0  NaN     NaN     NaN

回答by user3483203

You could use 2 consecutive merges, first on df_aand df_b, and then on df_c:

您可以使用 2 次连续合并,首先在df_adf_b,然后在df_c

In [49]: df_temp = df_a.merge(df_b, how='outer', left_on=['type_1', 'id_1'], right_on=['type_2', 'id_2'])

In [50]: df_temp.merge(df_c, how='outer', left_on=['type_2', 'id_2'], right_on=['type_3', 'id_3'])
Out[50]:
   type_1 id_1 name_1 type_2 id_2 name_2  type_3 id_3  name_3
0     0.0    3   Alex    NaN  NaN    NaN     NaN  NaN     NaN
1     1.0    4    Amy      1    4   Bill     1.0    4  School
2     1.0    5  Allen      1    5  Brian     NaN  NaN     NaN
3     1.0    5  Allen      1    5    Joe     NaN  NaN     NaN
4     1.0    5   Jane      1    5  Brian     NaN  NaN     NaN
5     1.0    5   Jane      1    5    Joe     NaN  NaN     NaN
6     NaN  NaN    NaN      0    7  Bryce     0.0    7   White

回答by YOBEN_S

Let us try create a new key for this , I am using reducehere

让我们尝试为此创建一个新密钥,我在reduce这里使用

import functools
dfs=[df_a,df_b,df_c]
dfs=[x.assign(key=list(zip(x.iloc[:,0],x.iloc[:,1]))) for x in dfs]
merged_df = functools.reduce(lambda left,right: pd.merge(left,right,on='key',how='outer'), dfs)
merged_df.drop('key',1) 
Out[110]: 
   type_1 id_1 name_1  type_2 id_2 name_2  type_3 id_3  name_3
0     0.0    3   Alex     NaN  NaN    NaN     NaN  NaN     NaN
1     1.0    4    Amy     1.0    4   Bill     1.0    4  School
2     1.0    5  Allen     1.0    5  Brian     NaN  NaN     NaN
3     1.0    5  Allen     1.0    5    Joe     NaN  NaN     NaN
4     1.0    5   Jane     1.0    5  Brian     NaN  NaN     NaN
5     1.0    5   Jane     1.0    5    Joe     NaN  NaN     NaN
6     NaN  NaN    NaN     0.0    7  Bryce     0.0    7   White