Python 如何从另一个熊猫数据框中减去一个熊猫数据框的行?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23284409/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to subtract rows of one pandas data frame from another?
提问by Roman
The operation that I want to do is similar to merger. For example, with the inner
merger we get a data frame that contains rows that are present in the first AND second data frame. With the outer
merger we get a data frame that are present EITHER in the first OR in the second data frame.
我想做的操作类似于合并。例如,通过inner
合并,我们得到一个数据框,其中包含出现在第一个和第二个数据框中的行。通过outer
合并,我们得到一个数据帧,它出现在第一个 OR 中的第二个数据帧中。
What I need is a data frame that contains rows that are present in the first data frame AND NOT present in the second one? Is there a fast and elegant way to do it?
我需要的是一个数据框,其中包含存在于第一个数据框中但不存在于第二个数据框中的行?有没有一种快速而优雅的方法来做到这一点?
采纳答案by Karl D.
How about something like the following?
像下面这样的怎么样?
print df1
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
3 Nets 1988 6
4 Nets 2001 8
5 Nets 2000 10
6 Heat 2004 6
7 Pacers 2003 12
print df2
Team Year foo
0 Pacers 2003 12
1 Heat 2004 6
2 Nets 1988 6
As long as there is a non-key commonly named column, you can let the added on sufffexes do the work (if there is no non-key common column then you could create one to use temporarily ... df1['common'] = 1
and df2['common'] = 1
):
只要有一个非键常用列,您就可以让添加的后缀来完成工作(如果没有非键常用列,那么您可以创建一个临时使用...df1['common'] = 1
和df2['common'] = 1
):
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.foo_y.isnull()]
Team Year foo_x foo_y
0 Hawks 2001 5 NaN
1 Hawks 2004 4 NaN
2 Nets 1987 3 NaN
4 Nets 2001 8 NaN
5 Nets 2000 10 NaN
Or you can use isin
but you would have to create a single key:
或者您可以使用isin
但您必须创建一个密钥:
df1['key'] = df1['Team'] + df1['Year'].astype(str)
df2['key'] = df1['Team'] + df2['Year'].astype(str)
print df1[~df1.key.isin(df2.key)]
Team Year foo key
0 Hawks 2001 5 Hawks2001
2 Nets 1987 3 Nets1987
4 Nets 2001 8 Nets2001
5 Nets 2000 10 Nets2000
6 Heat 2004 6 Heat2004
7 Pacers 2003 12 Pacers2003
回答by RockyRollinghills
You could run into errors if your non-index column has cells with NaN.
如果您的非索引列包含带有 NaN 的单元格,您可能会遇到错误。
print df1
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
3 Nets 1988 6
4 Nets 2001 8
5 Nets 2000 10
6 Heat 2004 6
7 Pacers 2003 12
8 Problem 2112 NaN
print df2
Team Year foo
0 Pacers 2003 12
1 Heat 2004 6
2 Nets 1988 6
3 Problem 2112 NaN
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.foo_y.isnull()]
Team Year foo_x foo_y
0 Hawks 2001 5 NaN
1 Hawks 2004 4 NaN
2 Nets 1987 3 NaN
4 Nets 2001 8 NaN
5 Nets 2000 10 NaN
6 Problem 2112 NaN NaN
The problem team in 2112 has no value for foo in either table. So, the left join here will falsely return that row, which matches in both DataFrames, as not being present in the right DataFrame.
2112 中的问题团队在任一表中都没有 foo 的值。因此,这里的左连接将错误地返回在两个 DataFrame 中都匹配的那一行,因为它不存在于右侧的 DataFrame 中。
Solution:
解决方案:
What I do is to add a unique column to the inner DataFrame and set a value for all rows. Then when you join, you can check to see if that column is NaN for the inner table to find unique records in the outer table.
我所做的是向内部 DataFrame 添加一个唯一的列并为所有行设置一个值。然后,当您加入时,您可以检查该列是否为内表的 NaN 以查找外表中的唯一记录。
df2['in_df2']='yes'
print df2
Team Year foo in_df2
0 Pacers 2003 12 yes
1 Heat 2004 6 yes
2 Nets 1988 6 yes
3 Problem 2112 NaN yes
new = df1.merge(df2,on=['Team','Year'],how='left')
print new[new.in_df2.isnull()]
Team Year foo_x foo_y in_df1 in_df2
0 Hawks 2001 5 NaN yes NaN
1 Hawks 2004 4 NaN yes NaN
2 Nets 1987 3 NaN yes NaN
4 Nets 2001 8 NaN yes NaN
5 Nets 2000 10 NaN yes NaN
NB. The problem row is now correctly filtered out, because it has a value for in_df2.
注意。问题行现在已被正确过滤掉,因为它具有 in_df2 的值。
Problem 2112 NaN NaN yes yes
回答by Chirag Chhatbar
Consider Following:
考虑以下:
- df_one is first DataFrame
- df_two is second DataFrame
- df_one 是第一个 DataFrame
- df_two 是第二个 DataFrame
Present in First DataFrameand Not in Second DataFrame
出现在第一个数据帧中,而不出现在第二个数据帧中
Solution: by Indexdf = df_one[~df_one.index.isin(df_two.index)]
解决方案:按索引df = df_one[~df_one.index.isin(df_two.index)]
indexcan be replaced by required columnupon which you wish to do exclusion. In above example, I've used index as a reference between both Data Frames
索引可以替换为您希望排除的所需列。在上面的示例中,我使用索引作为两个数据帧之间的参考
Additionally, you can also use a more complex query using boolean pandas.Series to solve for above.
此外,您还可以使用 boolean pandas.Series 使用更复杂的查询来解决上述问题。
回答by KUMN
I suggest using parameter 'indicator' in merge. Also if 'on' is None this defaults to the intersection of the columns in both DataFrames.
我建议在合并中使用参数“指标”。此外,如果 'on' 为 None,则默认为两个 DataFrame 中列的交集。
new = df1.merge(df2,how='left', indicator=True) # adds a new column '_merge'
new = new[(new['_merge']=='left_only')].copy() #rows only in df1 and not df2
new = new.drop(columns='_merge').copy()
Team Year foo
0 Hawks 2001 5
1 Hawks 2004 4
2 Nets 1987 3
4 Nets 2001 8
5 Nets 2000 10
Reference: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
参考:https: //pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
indicator : boolean or string, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
Information column is Categorical-type and takes on a value of
“left_only” for observations whose merge key only appears in ‘left' DataFrame,
“right_only” for observations whose merge key only appears in ‘right' DataFrame,
and “both” if the observation's merge key is found in both.