Python Pandas:如何根据“OR”条件进行合并?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/45869886/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:18:51  来源:igfitidea点击:

Python Pandas: How to merge based on an "OR" condition?

pythonpandasdataframemerge

提问by alwaysaskingquestions

Let's say I have two dataframes, and the column names for both are:

假设我有两个数据框,两者的列名是:

table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]

I want to merge the two tables based on both ShipNumber and TrackNumber. However, if i simply use merge in the following way (pseudo code, not real code):

我想根据 ShipNumber 和 TrackNumber 合并两个表。但是,如果我只是按以下方式使用合并(伪代码,而不是真正的代码):

tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])

then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.

那么,这意味着两个表中 ShipNumber 和 TrackNumber 列中的值必须匹配。

However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.

但是,就我而言,有时 ShipNumber 列值会匹配,有时 TrackNumber 列值会匹配;只要两个值之一匹配一行,我就希望合并发生。

In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.

换句话说,如果tab 1中的row 1 ShipNumber匹配tab 2中的row 3 ShipNumber,但是两个表中两个记录的TrackNumber不匹配,我仍然想匹配两个表中的两行。

So basically this is a either/or match condition (pesudo code):

所以基本上这是一个非此即彼的匹配条件(伪代码):

if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
    then merge

I hope my question makes sense... Any help is really really appreciated!

我希望我的问题有意义......真的非常感谢任何帮助!

As suggested, I looked into this post: Python pandas merge with OR logicBut it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)

按照建议,我查看了这篇文章: Python pandas 与 OR 逻辑合并但这并不是我认为的完全相同的问题,因为该文章中的 OP 有一个映射文件,因此他们可以简单地进行 2 次合并来解决这个问题。但是我没有映射文件,相反,我有两个具有相同键列(ShipNumber、TrackNumber)的 df

采纳答案by andrew_reece

Use merge()and concat(). Then drop any duplicate cases where both Aand Bmatch (thanks @Scott Boston for that final step).

使用merge()concat()。然后删除任何重复的情况下,这两个AB匹配(感谢@Scott波士顿为最后一步)。

df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})

df1         df2
   A  B        A  B
0  1  7     0  1  4
1  2  8     1  5  1
2  3  9     2  6  8
3  4  5     3  4  5

With these data frames we should see:

有了这些数据框,我们应该看到:

  • df1.loc[0]matches Aon df2.loc[0]
  • df1.loc[1]matches Bon df2.loc[2]
  • df1.loc[3]matches both Aand Bon df2.loc[3]
  • df1.loc[0]比赛Adf2.loc[0]
  • df1.loc[1]比赛Bdf2.loc[2]
  • df1.loc[3]匹配ABdf2.loc[3]

We'll use suffixes to keep track of what matched where:

我们将使用后缀来跟踪匹配的内容:

suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']

pd.concat([df1.merge(df2, on='A', suffixes=suff_A), 
           df1.merge(df2, on='B', suffixes=suff_B)])

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
1  4.0             NaN             NaN  NaN             5.0             5.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

Note that the second and fourth rows are duplicate matches (for both data frames, A = 4and B = 5). We need to remove one of those sets.

请注意,第二行和第四行是重复的匹配项(对于数据框A = 4B = 5)。我们需要删除其中一组。

dupes = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dupes]

     A  A_on_B_match_1  A_on_B_match_2    B  B_on_A_match_1  B_on_A_match_2
0  1.0             NaN             NaN  NaN             9.0             4.0
0  NaN             2.0             6.0  8.0             NaN             NaN
1  NaN             4.0             4.0  5.0             NaN             NaN

回答by pratpor

I would suggest this alternate way for doing merge like this. This seems easier for me.

我建议使用这种替代方式进行合并。这对我来说似乎更容易。

table1["id_to_be_merged"] = table1.apply(
    lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)

You can add the same column in table2as well if needed and then use in left_inor right_onbased on your requirement.

table2如果需要,您也可以添加相同的列,然后根据您的要求使用left_in或使用right_on