使用完全外连接在 Pandas 中连接两个数据框

Question

提问by IMK

I've two dataframes in pandas as shown below. EmpID is a primary key in both dataframes.

我在Pandas中有两个数据框，如下所示。EmpID 是两个数据帧中的主键。

df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])

I want to join these two dataframes with EmpID so that

我想用 EmpID 加入这两个数据帧，以便

Missing data in one dataframe can be filled with value from another table if exists and key matches
If there are observations with new keys then they should be appended in the resulting dataframe

如果存在且键匹配，则可以用另一个表中的值填充一个数据框中的缺失数据
如果有新键的观察结果，则应将它们附加到结果数据框中

I've used below code for achieving this.

我已经使用下面的代码来实现这一点。

merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])

But this code gives me duplicate columns which I don't want so I only used unique columns from both tables for merging.

但是这段代码给了我不想要的重复列，所以我只使用了两个表中的唯一列进行合并。

ColNames = list(df_second.columns.difference(df_first.columns))
ColNames.append('EmpID')
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])

Now I don't get duplicate columns but don't get value either in observations where key matches.

现在我没有得到重复的列，但在关键匹配的观察中也没有得到值。

I'll really appreciate if someone can help me with this.

如果有人可以帮助我，我将不胜感激。

Regards, Kailash Negi

问候，凯拉什·内吉

Answer 1

采纳答案by jezrael

It seems you need combine_firstwith set_indexfor match by indices created by columns EmpID:

看来你需要combine_first有set_index对比赛由列上创建索引EmpID：

df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
   EmpID   Department  Location Name  Salary
0      1           HR     Delhi    A  1000.0
1      2          NaN       NaN    B     NaN
2      3      Finance       NaN    C  3000.0
3      4          NaN       NaN    D  8000.0
4      5  Programming       NaN    E  6000.0
5      8        Admin    Mumbai    B     NaN
6      9          Ops  Banglore    D     NaN
7     10    Analytics    Mumbai    K     NaN

EDIT:

编辑：

For some order of columns need reindex:

对于某些列顺序需要reindex：

#concatenate all columns names togetehr and remove dupes
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')

df = (df_first.set_index('EmpID')
      .combine_first(df_second.set_index('EmpID'))
      .reset_index()
      .reindex(columns=ColNames))
print (df)
   EmpID Name   Department  Location  Salary
0      1    A           HR     Delhi  1000.0
1      2    B          NaN       NaN     NaN
2      3    C      Finance       NaN  3000.0
3      4    D          NaN       NaN  8000.0
4      5    E  Programming       NaN  6000.0
5      8    B        Admin    Mumbai     NaN
6      9    D          Ops  Banglore     NaN
7     10    K    Analytics    Mumbai     NaN

使用完全外连接在 Pandas 中连接两个数据框

提问by IMK

采纳答案by jezrael

相关推荐

最近更新

标签

使用完全外连接在 Pandas 中连接两个数据框

提问by IMK

采纳答案by jezrael

相关推荐

pandas 使用python从指数分布和模型中生成随机数

pandas 计数的python数据透视表

pandas 如何将训练和测试数据集拆分为 X_Train y_train 和 X_Test y_Test？

pandas 如何使用pymysql将mySQL查询结果存储到pandas DataFrame中？

相关推荐

最近更新

标签