Python 如何从另一个熊猫数据框中减去一个熊猫数据框的行?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/23284409/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:38:27  来源:igfitidea点击:

How to subtract rows of one pandas data frame from another?

pythonmergepandas

提问by Roman

The operation that I want to do is similar to merger. For example, with the innermerger we get a data frame that contains rows that are present in the first AND second data frame. With the outermerger we get a data frame that are present EITHER in the first OR in the second data frame.

我想做的操作类似于合并。例如,通过inner合并,我们得到一个数据框,其中包含出现在第一个和第二个数据框中的行。通过outer合并,我们得到一个数据帧,它出现在第一个 OR 中的第二个数据帧中。

What I need is a data frame that contains rows that are present in the first data frame AND NOT present in the second one? Is there a fast and elegant way to do it?

我需要的是一个数据框,其中包含存在于第一个数据框中但不存在于第二个数据框中的行?有没有一种快速而优雅的方法来做到这一点?

采纳答案by Karl D.

How about something like the following?

像下面这样的怎么样?

print df1

    Team  Year  foo
0   Hawks  2001    5
1   Hawks  2004    4
2    Nets  1987    3
3    Nets  1988    6
4    Nets  2001    8
5    Nets  2000   10
6    Heat  2004    6
7  Pacers  2003   12

print df2

    Team  Year  foo
0  Pacers  2003   12
1    Heat  2004    6
2    Nets  1988    6

As long as there is a non-key commonly named column, you can let the added on sufffexes do the work (if there is no non-key common column then you could create one to use temporarily ... df1['common'] = 1and df2['common'] = 1):

只要有一个非键常用列,您就可以让添加的后缀来完成工作(如果没有非键常用列,那么您可以创建一个临时使用...df1['common'] = 1df2['common'] = 1):

new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.foo_y.isnull()]

     Team  Year  foo_x  foo_y
0  Hawks  2001      5    NaN
1  Hawks  2004      4    NaN
2   Nets  1987      3    NaN
4   Nets  2001      8    NaN
5   Nets  2000     10    NaN

Or you can use isinbut you would have to create a single key:

或者您可以使用isin但您必须创建一个密钥:

df1['key'] = df1['Team'] + df1['Year'].astype(str)
df2['key'] = df1['Team'] + df2['Year'].astype(str)
print df1[~df1.key.isin(df2.key)]

     Team  Year  foo         key
0   Hawks  2001    5   Hawks2001
2    Nets  1987    3    Nets1987
4    Nets  2001    8    Nets2001
5    Nets  2000   10    Nets2000
6    Heat  2004    6    Heat2004
7  Pacers  2003   12  Pacers2003

回答by RockyRollinghills

You could run into errors if your non-index column has cells with NaN.

如果您的非索引列包含带有 NaN 的单元格,您可能会遇到错误。

print df1

    Team   Year  foo
0   Hawks  2001    5
1   Hawks  2004    4
2    Nets  1987    3
3    Nets  1988    6
4    Nets  2001    8
5    Nets  2000   10
6    Heat  2004    6
7  Pacers  2003   12
8 Problem  2112  NaN


print df2

     Team  Year  foo
0  Pacers  2003   12
1    Heat  2004    6
2    Nets  1988    6
3 Problem  2112  NaN

new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.foo_y.isnull()]

     Team  Year  foo_x  foo_y
0   Hawks  2001      5    NaN
1   Hawks  2004      4    NaN
2    Nets  1987      3    NaN
4    Nets  2001      8    NaN
5    Nets  2000     10    NaN
6 Problem  2112    NaN    NaN

The problem team in 2112 has no value for foo in either table. So, the left join here will falsely return that row, which matches in both DataFrames, as not being present in the right DataFrame.

2112 中的问题团队在任一表中都没有 foo 的值。因此,这里的左连接将错误地返回在两个 DataFrame 中都匹配的那一行,因为它不存在于右侧的 DataFrame 中。

Solution:

解决方案:

What I do is to add a unique column to the inner DataFrame and set a value for all rows. Then when you join, you can check to see if that column is NaN for the inner table to find unique records in the outer table.

我所做的是向内部 DataFrame 添加一个唯一的列并为所有行设置一个值。然后,当您加入时,您可以检查该列是否为内表的 NaN 以查找外表中的唯一记录。

df2['in_df2']='yes'

print df2

     Team  Year  foo  in_df2
0  Pacers  2003   12     yes
1    Heat  2004    6     yes
2    Nets  1988    6     yes
3 Problem  2112  NaN     yes


new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.in_df2.isnull()]

     Team  Year  foo_x  foo_y  in_df1  in_df2
0   Hawks  2001      5    NaN     yes     NaN
1   Hawks  2004      4    NaN     yes     NaN
2    Nets  1987      3    NaN     yes     NaN
4    Nets  2001      8    NaN     yes     NaN
5    Nets  2000     10    NaN     yes     NaN

NB. The problem row is now correctly filtered out, because it has a value for in_df2.

注意。问题行现在已被正确过滤掉,因为它具有 in_df2 的值。

  Problem  2112    NaN    NaN     yes     yes

回答by Chirag Chhatbar

Consider Following:

考虑以下:

  1. df_one is first DataFrame
  2. df_two is second DataFrame
  1. df_one 是第一个 DataFrame
  2. df_two 是第二个 DataFrame

Present in First DataFrameand Not in Second DataFrame

出现在第一个数据帧中而不出现在第二个数据帧中

Solution: by Indexdf = df_one[~df_one.index.isin(df_two.index)]

解决方案:按索引df = df_one[~df_one.index.isin(df_two.index)]

indexcan be replaced by required columnupon which you wish to do exclusion. In above example, I've used index as a reference between both Data Frames

索引可以替换为您希望排除的所需。在上面的示例中,我使用索引作为两个数据帧之间的参考

Additionally, you can also use a more complex query using boolean pandas.Series to solve for above.

此外,您还可以使用 boolean pandas.Series 使用更复杂的查询来解决上述问题。

回答by KUMN

I suggest using parameter 'indicator' in merge. Also if 'on' is None this defaults to the intersection of the columns in both DataFrames.

我建议在合并中使用参数“指标”。此外,如果 'on' 为 None,则默认为两个 DataFrame 中列的交集。

new = df1.merge(df2,how='left', indicator=True) # adds a new column '_merge'
new = new[(new['_merge']=='left_only')].copy() #rows only in df1 and not df2
new = new.drop(columns='_merge').copy()

    Team    Year    foo
0   Hawks   2001    5
1   Hawks   2004    4
2   Nets    1987    3
4   Nets    2001    8
5   Nets    2000    10

Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

参考:https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

indicator : boolean or string, default False

If True, adds a column to output DataFrame called “_merge” with information on the source of each row. 
Information column is Categorical-type and takes on a value of 
“left_only” for observations whose merge key only appears in ‘left' DataFrame,
“right_only” for observations whose merge key only appears in ‘right' DataFrame, 
and “both” if the observation's merge key is found in both.