Python 熊猫左外连接导致表大于左表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22720739/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:28:56  来源:igfitidea点击:

Pandas Left Outer Join results in table larger than left table

pythonpandas

提问by Terence Chow

From what I understand about a left outer join, the resulting table should never have more rows than the left table...Please let me know if this is wrong...

根据我对左外连接的理解,结果表的行数永远不应多于左表......如果这是错误的,请告诉我......

My left table is 192572 rows and 8 columns.

我的左表是 192572 行和 8 列。

My right table is 42160 rows and 5 columns.

我的右表是 42160 行和 5 列。

My Left table has a field called 'id' which matches with a column in my right table called 'key'.

我的左表有一个名为“id”的字段,它与我右表中名为“key”的列相匹配。

Therefore I merge them as such:

因此,我将它们合并如下:

combined = pd.merge(a,b,how='left',left_on='id',right_on='key')

But then the combined shape is 236569.

但是组合后的形状是 236569。

What am I misunderstanding?

我有什么误解?

采纳答案by Andy Hayden

You can expect this to increase if keys match more than one row in the other DataFrame:

如果键匹配另一个 DataFrame 中的多行,您可以预期这一点会增加:

In [11]: df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])

In [12]: df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])

In [13]: df.merge(df2, how='left')  # merges on columns A
Out[13]: 
   A  B   C
0  1  3   5
1  1  3   6
2  2  4 NaN

To avoid this behaviour drop the duplicatesin df2:

为避免这种行为,请删除df2 中的重复项

In [21]: df2.drop_duplicates(subset=['A'])  # you can use take_last=True
Out[21]: 
   A  C
0  1  5

In [22]: df.merge(df2.drop_duplicates(subset=['A']), how='left')
Out[22]: 
   A  B   C
0  1  3   5
1  2  4 NaN

回答by seeiespi

There are also strategies you can use to avoid this behavior that don't involve losing the duplicated data if, for example, not all columns are duplicated. If you have

您还可以使用一些策略来避免这种行为,例如,如果并非所有列都重复,则不会丢失重复数据。如果你有

In [1]: df = pd.DataFrame([[1, 3], [2, 4]], columns=['A', 'B'])

In [2]: df2 = pd.DataFrame([[1, 5], [1, 6]], columns=['A', 'C'])

One way would be to take the mean of the duplicate (can also take the sum, etc...)

一种方法是取重复的平均值(也可以取总和等......)

In [3]: df3 = df2.groupby('A').mean().reset_index()

In [4]: df3
Out[4]: 
     C
A     
1  5.5

In [5]: merged = pd.merge(df,df3,on=['A'], how='outer')

In [6]: merged
Out[204]: 
   A  B    C
0  1  3  5.5
1  2  4  NaN

Alternatively, if you have non-numeric data that cannot be converted using pd.to_numeric() or if you simply do not want to take the mean, you can alter the merging variable by enumerating the duplicates. However, this strategy would apply when the duplicates exist in both datasets (which would cause the same problematic behavior and is also a common problem):

或者,如果您有无法使用 pd.to_numeric() 转换的非数字数据,或者您只是不想取平均值,则可以通过枚举重复项来更改合并变量。但是,当两个数据集中都存在重复项时,此策略将适用(这将导致相同的问题行为,也是一个常见问题):

In [7]: df = pd.DataFrame([['a', 3], ['b', 4],['b',0]], columns=['A', 'B'])

In [8]: df2 = pd.DataFrame([['a', 3], ['b', 8],['b',5]], columns=['A', 'C'])

In [9]: df['count'] = df.groupby('A')['B'].cumcount()

In [10]: df['A'] = np.where(df['count']>0,df['A']+df['count'].astype(str),df['A'].astype(str))

In[11]: df
Out[11]: 
    A  B  count
0   a  3      0
1   b  4      0
2  b1  0      1

Do the same for df2, drop the count variables in df and df2 and merge on 'A':

对 df2 执行相同操作,删除 df 和 df2 中的计数变量并合并到“A”上:

In [16]: merged
Out[16]: 
    A  B  C
0   a  3  3        
1   b  4  8        
2  b1  0  5        

A couple of notes. In this last case I use .cumcount() instead of .duplicated because it could be the case that you have more than one duplicate for a given observation. Also, I use .astype(str) to convert the count values to strings because I use the np.where() command, but using pd.concat() or something else might allow for different applications.

一些注意事项。在最后一种情况下,我使用 .cumcount() 而不是 .duplicated ,因为对于给定的观察,您可能有多个重复项。此外,我使用 .astype(str) 将计数值转换为字符串,因为我使用 np.where() 命令,但使用 pd.concat() 或其他东西可能允许不同的应用程序。

Finally, if it is the case that only one dataset has the duplicates but you still want to keep them then you can use the first half of the latter strategy to differentiate the duplicates in the resulting merge.

最后,如果只有一个数据集具有重复项但您仍想保留它们,那么您可以使用后一种策略的前半部分来区分结果合并中的重复项。

回答by Tobias Dekker

A small addition on the given answers is that there is a parameter named validate which can be used to throw an error if there are duplicated IDs matched in the right table:

对给定答案的一个小补充是,有一个名为 validate 的参数可用于在右表中匹配重复 ID 时抛出错误:

combined = pd.merge(a,b,how='left',left_on='id',right_on='key', validate = 'm:1')