Python Pandas:如何根据“OR”条件进行合并?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/45869886/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas: How to merge based on an "OR" condition?
提问by alwaysaskingquestions
Let's say I have two dataframes, and the column names for both are:
假设我有两个数据框,两者的列名是:
table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables based on both ShipNumber and TrackNumber. However, if i simply use merge in the following way (pseudo code, not real code):
我想根据 ShipNumber 和 TrackNumber 合并两个表。但是,如果我只是按以下方式使用合并(伪代码,而不是真正的代码):
tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])
then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.
那么,这意味着两个表中 ShipNumber 和 TrackNumber 列中的值必须匹配。
However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.
但是,就我而言,有时 ShipNumber 列值会匹配,有时 TrackNumber 列值会匹配;只要两个值之一匹配一行,我就希望合并发生。
In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.
换句话说,如果tab 1中的row 1 ShipNumber匹配tab 2中的row 3 ShipNumber,但是两个表中两个记录的TrackNumber不匹配,我仍然想匹配两个表中的两行。
So basically this is a either/or match condition (pesudo code):
所以基本上这是一个非此即彼的匹配条件(伪代码):
if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
then merge
I hope my question makes sense... Any help is really really appreciated!
我希望我的问题有意义......真的非常感谢任何帮助!
As suggested, I looked into this post: Python pandas merge with OR logicBut it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)
按照建议,我查看了这篇文章: Python pandas 与 OR 逻辑合并但这并不是我认为的完全相同的问题,因为该文章中的 OP 有一个映射文件,因此他们可以简单地进行 2 次合并来解决这个问题。但是我没有映射文件,相反,我有两个具有相同键列(ShipNumber、TrackNumber)的 df
采纳答案by andrew_reece
Use merge()
and concat()
. Then drop any duplicate cases where both A
and B
match (thanks @Scott Boston for that final step).
使用merge()
和concat()
。然后删除任何重复的情况下,这两个A
和B
匹配(感谢@Scott波士顿为最后一步)。
df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})
df1 df2
A B A B
0 1 7 0 1 4
1 2 8 1 5 1
2 3 9 2 6 8
3 4 5 3 4 5
With these data frames we should see:
有了这些数据框,我们应该看到:
df1.loc[0]
matchesA
ondf2.loc[0]
df1.loc[1]
matchesB
ondf2.loc[2]
df1.loc[3]
matches bothA
andB
ondf2.loc[3]
df1.loc[0]
比赛A
上df2.loc[0]
df1.loc[1]
比赛B
上df2.loc[2]
df1.loc[3]
匹配A
和B
上df2.loc[3]
We'll use suffixes to keep track of what matched where:
我们将使用后缀来跟踪匹配的内容:
suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']
pd.concat([df1.merge(df2, on='A', suffixes=suff_A),
df1.merge(df2, on='B', suffixes=suff_B)])
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
1 4.0 NaN NaN NaN 5.0 5.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
Note that the second and fourth rows are duplicate matches (for both data frames, A = 4
and B = 5
). We need to remove one of those sets.
请注意,第二行和第四行是重复的匹配项(对于数据框A = 4
和B = 5
)。我们需要删除其中一组。
dupes = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dupes]
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
回答by pratpor
I would suggest this alternate way for doing merge like this. This seems easier for me.
我建议使用这种替代方式进行合并。这对我来说似乎更容易。
table1["id_to_be_merged"] = table1.apply(
lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)
You can add the same column in table2
as well if needed and then use in left_in
or right_on
based on your requirement.
table2
如果需要,您也可以添加相同的列,然后根据您的要求使用left_in
或使用right_on
。