使用完全外连接在 Pandas 中连接两个数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47504975/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Joining two dataframes in pandas using full outer join
提问by IMK
I've two dataframes in pandas as shown below. EmpID is a primary key in both dataframes.
我在Pandas中有两个数据框,如下所示。EmpID 是两个数据帧中的主键。
df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])
I want to join these two dataframes with EmpID so that
我想用 EmpID 加入这两个数据帧,以便
- Missing data in one dataframe can be filled with value from another table if exists and key matches
- If there are observations with new keys then they should be appended in the resulting dataframe
- 如果存在且键匹配,则可以用另一个表中的值填充一个数据框中的缺失数据
- 如果有新键的观察结果,则应将它们附加到结果数据框中
I've used below code for achieving this.
我已经使用下面的代码来实现这一点。
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
But this code gives me duplicate columns which I don't want so I only used unique columns from both tables for merging.
但是这段代码给了我不想要的重复列,所以我只使用了两个表中的唯一列进行合并。
ColNames = list(df_second.columns.difference(df_first.columns))
ColNames.append('EmpID')
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])
Now I don't get duplicate columns but don't get value either in observations where key matches.
现在我没有得到重复的列,但在关键匹配的观察中也没有得到值。
I'll really appreciate if someone can help me with this.
如果有人可以帮助我,我将不胜感激。
Regards, Kailash Negi
问候, 凯拉什·内吉
采纳答案by jezrael
It seems you need combine_first
with set_index
for match by indices created by columns EmpID
:
看来你需要combine_first
有set_index
对比赛由列上创建索引EmpID
:
df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
EmpID Department Location Name Salary
0 1 HR Delhi A 1000.0
1 2 NaN NaN B NaN
2 3 Finance NaN C 3000.0
3 4 NaN NaN D 8000.0
4 5 Programming NaN E 6000.0
5 8 Admin Mumbai B NaN
6 9 Ops Banglore D NaN
7 10 Analytics Mumbai K NaN
EDIT:
编辑:
For some order of columns need reindex
:
对于某些列顺序需要reindex
:
#concatenate all columns names togetehr and remove dupes
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')
df = (df_first.set_index('EmpID')
.combine_first(df_second.set_index('EmpID'))
.reset_index()
.reindex(columns=ColNames))
print (df)
EmpID Name Department Location Salary
0 1 A HR Delhi 1000.0
1 2 B NaN NaN NaN
2 3 C Finance NaN 3000.0
3 4 D NaN NaN 8000.0
4 5 E Programming NaN 6000.0
5 8 B Admin Mumbai NaN
6 9 D Ops Banglore NaN
7 10 K Analytics Mumbai NaN