使用完全外连接在 Pandas 中连接两个数据框

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47504975/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:49:17  来源:igfitidea点击:

Joining two dataframes in pandas using full outer join

pythonpython-3.xpandasjoinouter-join

提问by IMK

I've two dataframes in pandas as shown below. EmpID is a primary key in both dataframes.

我在Pandas中有两个数据框,如下所示。EmpID 是两个数据帧中的主键。

df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])

I want to join these two dataframes with EmpID so that

我想用 EmpID 加入这两个数据帧,以便

  1. Missing data in one dataframe can be filled with value from another table if exists and key matches
  2. If there are observations with new keys then they should be appended in the resulting dataframe
  1. 如果存在且键匹配,则可以用另一个表中的值填充一个数据框中的缺失数据
  2. 如果有新键的观察结果,则应将它们附加到结果数据框中

I've used below code for achieving this.

我已经使用下面的代码来实现这一点。

merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])

But this code gives me duplicate columns which I don't want so I only used unique columns from both tables for merging.

但是这段代码给了我不想要的重复列,所以我只使用了两个表中的唯一列进行合并。

ColNames = list(df_second.columns.difference(df_first.columns))
ColNames.append('EmpID')
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])

Now I don't get duplicate columns but don't get value either in observations where key matches.

现在我没有得到重复的列,但在关键匹配的观察中也没有得到值。

I'll really appreciate if someone can help me with this.

如果有人可以帮助我,我将不胜感激。

Regards, Kailash Negi

问候, 凯拉什·内吉

采纳答案by jezrael

It seems you need combine_firstwith set_indexfor match by indices created by columns EmpID:

看来你需要combine_firstset_index对比赛由列上创建索引EmpID

df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
   EmpID   Department  Location Name  Salary
0      1           HR     Delhi    A  1000.0
1      2          NaN       NaN    B     NaN
2      3      Finance       NaN    C  3000.0
3      4          NaN       NaN    D  8000.0
4      5  Programming       NaN    E  6000.0
5      8        Admin    Mumbai    B     NaN
6      9          Ops  Banglore    D     NaN
7     10    Analytics    Mumbai    K     NaN

EDIT:

编辑:

For some order of columns need reindex:

对于某些列顺序需要reindex

#concatenate all columns names togetehr and remove dupes
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')

df = (df_first.set_index('EmpID')
      .combine_first(df_second.set_index('EmpID'))
      .reset_index()
      .reindex(columns=ColNames))
print (df)
   EmpID Name   Department  Location  Salary
0      1    A           HR     Delhi  1000.0
1      2    B          NaN       NaN     NaN
2      3    C      Finance       NaN  3000.0
3      4    D          NaN       NaN  8000.0
4      5    E  Programming       NaN  6000.0
5      8    B        Admin    Mumbai     NaN
6      9    D          Ops  Banglore     NaN
7     10    K    Analytics    Mumbai     NaN