Python Pandas 合并两个具有不同列的数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28097222/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:44:05  来源:igfitidea点击:

Pandas merge two dataframes with different columns

pythonpandasdataframedata-munging

提问by economy

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.

我肯定在这里遗漏了一些简单的东西。尝试合并 Pandas 中的两个数据帧,它们的列名几乎相同,但右侧的数据帧有一些左侧没有的列,反之亦然。

>df_may

  id  quantity  attr_1  attr_2
0  1        20       0       1
1  2        23       1       1
2  3        19       1       1
3  4        19       0       0

>df_jun

  id  quantity  attr_1  attr_3
0  5         8       1       0
1  6        13       0       1
2  7        20       1       1
3  8        25       1       1

I've tried joining with an outer join:

我试过加入外连接:

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")

But that yields:

但这会产生:

Left data columns not unique: Index([....

I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except "id" like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to "on":

我还指定了一个要加入的列 (on = "id", eg),但这会复制除“id”之外的所有列,如 attr_1_x、attr_1_y,这并不理想。我还将整个列列表(有很多)传递给“on”:

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))

Which yields:

其中产生:

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.

我错过了什么?我想得到一个附加了所有行的 df,并且在可能的情况下填充了 attr_1、attr_2、attr_3,在它们没有出现的地方填充了 NaN。这似乎是一个非常典型的数据处理工作流程,但我被卡住了。

Thanks in advance.

提前致谢。

采纳答案by EdChum

I think in this case concatis what you want:

我认为在这种情况下concat是你想要的:

In [12]:

pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
   attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

by passing axis=0here you are stacking the df's on top of each other which I believe is what you want then producing NaNvalue where they are absent from their respective dfs.

通过axis=0这里,您将 df 堆叠在彼此之上,我相信这是您想要的,然后NaN在它们各自的 df中不存在的情况下产生价值。

回答by Will H

I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join

我今天使用 concat、append 或 merge 中的任何一个都遇到了这个问题,我通过添加一个按顺序编号的帮助列然后进行外连接来解决它

helper=1
for i in df1.index:
    df1.loc[i,'helper']=helper
    helper=helper+1
for i in df2.index:
    df2.loc[i,'helper']=helper
    helper=helper+1
df1.merge(df2,on='helper',how='outer')