Python 如何从另一个数据帧中删除熊猫数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37313691/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to remove a pandas dataframe from another dataframe
提问by 176coding
How to remove a pandas dataframe from another dataframe, just like the set subtraction:
如何从另一个数据帧中删除一个熊猫数据帧,就像集合减法一样:
a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]
And now we have two pandas dataframe, how to remove df2 from df1:
现在我们有两个 Pandas 数据框,如何从 df1 中删除 df2:
In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
a b
0 1 2
1 3 4
2 5 6
In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
a b
0 1 2
1 5 6
Then we expect df1-df2 result will be:
那么我们预计 df1-df2 结果将是:
In [14]: df
Out[14]:
a b
0 3 4
How to do it?
怎么做?
Thank you.
谢谢你。
回答by piRSquared
Solution
解决方案
Use pd.concat
followed by drop_duplicates(keep=False)
使用pd.concat
后跟drop_duplicates(keep=False)
pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
It looks like
看起来像
a b
1 3 4
Explanation
解释
pd.concat
adds the two DataFrame
s together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicates
method. However, drop_duplicates
by default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=False
parameter which does exactly that.
pd.concat
DataFrame
通过一个接一个地附加来将两个s 相加。如果有任何重叠,它将被drop_duplicates
方法捕获。但是,drop_duplicates
默认情况下会保留第一个观察结果并删除所有其他观察结果。在这种情况下,我们希望删除所有重复项。因此,keep=False
参数正是这样做的。
A special note to the repeated df2
. With only one df2
any row in df2
not in df1
won't be considered a duplicate and will remain. This solution with only one df2
only works when df2
is a subset of df1
. However, if we concat df2
twice, it is guaranteed to be a duplicate and will subsequently be removed.
特别注意重复df2
。只有一个df2
任何行df2
不在df1
不会被视为重复并将保留。这个只有一个的解决方案仅df2
在df2
是 的子集时才有效df1
。但是,如果我们连接df2
两次,它肯定是重复的,随后将被删除。
回答by Stefan
You can use .duplicated
, which has the benefit of being fairly expressive:
您可以使用.duplicated
,它的好处是具有相当的表现力:
%%timeit
combined = df1.append(df2)
combined[~combined.index.duplicated(keep=False)]
1000 loops, best of 3: 875 μs per loop
For comparison:
比较:
%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
100 loops, best of 3: 4.57 ms per loop
%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
1000 loops, best of 3: 987 μs per loop
%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]
1000 loops, best of 3: 546 μs per loop
In sum, using the np.array
comparison is fastest. Don't need the .tolist()
there.
总之,使用np.array
比较是最快的。不需要.tolist()
那里。
回答by knagaev
My shot with merge df1 and df2 from the question.
我从问题中对合并 df1 和 df2 进行了拍摄。
Using 'indicator' parameter
使用“指标”参数
In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
Out[74]:
a b
1 3 4
回答by piRSquared
A set logic approach. Turn the rows of df1
and df2
into sets. Then use set
subtraction to define new DataFrame
一套逻辑方法。打开的行df1
和df2
成组。然后使用set
减法来定义新的DataFrame
idx1 = set(df1.set_index(['a', 'b']).index)
idx2 = set(df2.set_index(['a', 'b']).index)
pd.DataFrame(list(idx1 - idx2), columns=df1.columns)
a b
0 3 4
回答by piRSquared
A masking approach
一种掩蔽方法
df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)]
a b
1 3 4
回答by Peter Abdou
I think the first tolist()
needs to be removed, but keep the second one:
我认为第一个tolist()
需要删除,但保留第二个:
df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]
回答by frozen shine
An easiest option is to use indexes.
最简单的选择是使用索引。
Append df1 and df2 and reset their indexes.
df = df1.concat(df2)
df.reset_index(inplace=True)
e.g:
This will give df2 indexesindexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]
附加 df1 和 df2 并重置它们的索引。
df = df1.concat(df2)
df.reset_index(inplace=True)
例如:
这将给出 df2 索引indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]
Hope it will help to new readers, although the question posted a little time ago :)
希望它对新读者有所帮助,尽管这个问题是在不久前发布的:)