Python 如何从另一个数据帧中删除熊猫数据帧

Question

提问by 176coding

How to remove a pandas dataframe from another dataframe, just like the set subtraction:

如何从另一个数据帧中删除一个熊猫数据帧，就像集合减法一样：

a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]

And now we have two pandas dataframe, how to remove df2 from df1:

现在我们有两个 Pandas 数据框，如何从 df1 中删除 df2：

In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
   a  b
0  1  2
1  3  4
2  5  6


In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
   a  b
0  1  2
1  5  6

Then we expect df1-df2 result will be:

那么我们预计 df1-df2 结果将是：

In [14]: df
Out[14]:
   a  b
0  3  4

How to do it?

怎么做？

Thank you.

谢谢你。

Answer 1

回答by piRSquared

Solution

解决方案

Use pd.concatfollowed by drop_duplicates(keep=False)

使用pd.concat后跟drop_duplicates(keep=False)

pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

It looks like

看起来像

   a  b
1  3  4

Explanation

解释

pd.concatadds the two DataFrames together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicatesmethod. However, drop_duplicatesby default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=Falseparameter which does exactly that.

pd.concatDataFrame通过一个接一个地附加来将两个s 相加。如果有任何重叠，它将被drop_duplicates方法捕获。但是，drop_duplicates默认情况下会保留第一个观察结果并删除所有其他观察结果。在这种情况下，我们希望删除所有重复项。因此，keep=False参数正是这样做的。

A special note to the repeated df2. With only one df2any row in df2not in df1won't be considered a duplicate and will remain. This solution with only one df2only works when df2is a subset of df1. However, if we concat df2twice, it is guaranteed to be a duplicate and will subsequently be removed.

特别注意重复df2。只有一个df2任何行df2不在df1不会被视为重复并将保留。这个只有一个的解决方案仅df2在df2是的子集时才有效df1。但是，如果我们连接df2两次，它肯定是重复的，随后将被删除。

Answer 2

回答by Stefan

You can use .duplicated, which has the benefit of being fairly expressive:

您可以使用.duplicated，它的好处是具有相当的表现力：

%%timeit
combined = df1.append(df2)
combined[~combined.index.duplicated(keep=False)]

1000 loops, best of 3: 875 μs per loop

For comparison:

比较：

%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']

100 loops, best of 3: 4.57 ms per loop


%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

1000 loops, best of 3: 987 μs per loop


%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]

1000 loops, best of 3: 546 μs per loop

In sum, using the np.arraycomparison is fastest. Don't need the .tolist()there.

总之，使用np.array比较是最快的。不需要.tolist()那里。

Answer 3

回答by knagaev

My shot with merge df1 and df2 from the question.

我从问题中对合并 df1 和 df2 进行了拍摄。

Using 'indicator' parameter

使用“指标”参数

In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
Out[74]: 
   a  b
1  3  4

Answer 4

回答by piRSquared

A set logic approach. Turn the rows of df1and df2into sets. Then use setsubtraction to define new DataFrame

一套逻辑方法。打开的行df1和df2成组。然后使用set减法来定义新的DataFrame

idx1 = set(df1.set_index(['a', 'b']).index)
idx2 = set(df2.set_index(['a', 'b']).index)

pd.DataFrame(list(idx1 - idx2), columns=df1.columns)

   a  b
0  3  4

Answer 5

回答by piRSquared

A masking approach

一种掩蔽方法

df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)]

   a  b
1  3  4

Answer 6

回答by Peter Abdou

I think the first tolist()needs to be removed, but keep the second one:

我认为第一个tolist()需要删除，但保留第二个：

df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]

Answer 7

回答by frozen shine

An easiest option is to use indexes.

最简单的选择是使用索引。

Append df1 and df2 and reset their indexes.
df = df1.concat(df2)
df.reset_index(inplace=True)
e.g:
This will give df2 indexes
indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]

附加 df1 和 df2 并重置它们的索引。
df = df1.concat(df2)
df.reset_index(inplace=True)
例如：
这将给出 df2 索引
indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]

Hope it will help to new readers, although the question posted a little time ago :)

希望它对新读者有所帮助，尽管这个问题是在不久前发布的:)

Python 如何从另一个数据帧中删除熊猫数据帧

提问by 176coding

回答by piRSquared

Solution

解决方案

Explanation

解释

回答by Stefan

回答by knagaev

回答by piRSquared

回答by piRSquared

回答by Peter Abdou

回答by frozen shine

相关推荐

最近更新

标签

Python 如何从另一个数据帧中删除熊猫数据帧

提问by 176coding

回答by piRSquared

Solution

解决方案

Explanation

解释

回答by Stefan

回答by knagaev

回答by piRSquared

回答by piRSquared

回答by Peter Abdou

回答by frozen shine

相关推荐

Python 用行号填充一个新的 Pandas 列

Python 如何通过pandas get_dummies() 方法为某些列创建虚拟对象？

无法使用 Pandas 在 Python 中将 0 替换为 nan

按键值的顺序绘制python dict

相关推荐

最近更新

标签