Python 如何从另一个数据帧中删除熊猫数据帧

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37313691/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:12:51  来源:igfitidea点击:

How to remove a pandas dataframe from another dataframe

pythonpandasdataframesubtraction

提问by 176coding

How to remove a pandas dataframe from another dataframe, just like the set subtraction:

如何从另一个数据帧中删除一个熊猫数据帧,就像集合减法一样:

a=[1,2,3,4,5]
b=[1,5]
a-b=[2,3,4]

And now we have two pandas dataframe, how to remove df2 from df1:

现在我们有两个 Pandas 数据框,如何从 df1 中删除 df2:

In [5]: df1=pd.DataFrame([[1,2],[3,4],[5,6]],columns=['a','b'])
In [6]: df1
Out[6]:
   a  b
0  1  2
1  3  4
2  5  6


In [9]: df2=pd.DataFrame([[1,2],[5,6]],columns=['a','b'])
In [10]: df2
Out[10]:
   a  b
0  1  2
1  5  6

Then we expect df1-df2 result will be:

那么我们预计 df1-df2 结果将是:

In [14]: df
Out[14]:
   a  b
0  3  4

How to do it?

怎么做?

Thank you.

谢谢你。

回答by piRSquared

Solution

解决方案

Use pd.concatfollowed by drop_duplicates(keep=False)

使用pd.concat后跟drop_duplicates(keep=False)

pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

It looks like

看起来像

   a  b
1  3  4

Explanation

解释

pd.concatadds the two DataFrames together by appending one right after the other. if there is any overlap, it will be captured by the drop_duplicatesmethod. However, drop_duplicatesby default leaves the first observation and removes every other observation. In this case, we want every duplicate removed. Hence, the keep=Falseparameter which does exactly that.

pd.concatDataFrame通过一个接一个地附加来将两个s 相加。如果有任何重叠,它将被drop_duplicates方法捕获。但是,drop_duplicates默认情况下会保留第一个观察结果并删除所有其他观察结果。在这种情况下,我们希望删除所有重复项。因此,keep=False参数正是这样做的。

A special note to the repeated df2. With only one df2any row in df2not in df1won't be considered a duplicate and will remain. This solution with only one df2only works when df2is a subset of df1. However, if we concat df2twice, it is guaranteed to be a duplicate and will subsequently be removed.

特别注意重复df2。只有一个df2任何行df2不在df1不会被视为重复并将保留。这个只有一个的解决方案仅df2df2是 的子集时才有效df1。但是,如果我们连接df2两次,它肯定是重复的,随后将被删除。

回答by Stefan

You can use .duplicated, which has the benefit of being fairly expressive:

您可以使用.duplicated,它的好处是具有相当的表现力:

%%timeit
combined = df1.append(df2)
combined[~combined.index.duplicated(keep=False)]

1000 loops, best of 3: 875 μs per loop

For comparison:

比较:

%timeit df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']

100 loops, best of 3: 4.57 ms per loop


%timeit pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

1000 loops, best of 3: 987 μs per loop


%timeit df2[df2.apply(lambda x: x.value not in df2.values, axis=1)]

1000 loops, best of 3: 546 μs per loop

In sum, using the np.arraycomparison is fastest. Don't need the .tolist()there.

总之,使用np.array比较是最快的。不需要.tolist()那里。

回答by knagaev

My shot with merge df1 and df2 from the question.

我从问题中对合并 df1 和 df2 进行了拍摄。

Using 'indicator' parameter

使用“指标”参数

In [74]: df1.loc[pd.merge(df1, df2, on=['a','b'], how='left', indicator=True)['_merge'] == 'left_only']
Out[74]: 
   a  b
1  3  4

回答by piRSquared

A set logic approach. Turn the rows of df1and df2into sets. Then use setsubtraction to define new DataFrame

一套逻辑方法。打开的行df1df2成组。然后使用set减法来定义新的DataFrame

idx1 = set(df1.set_index(['a', 'b']).index)
idx2 = set(df2.set_index(['a', 'b']).index)

pd.DataFrame(list(idx1 - idx2), columns=df1.columns)

   a  b
0  3  4

回答by piRSquared

A masking approach

一种掩蔽方法

df1[df1.apply(lambda x: x.values.tolist() not in df2.values.tolist(), axis=1)]

   a  b
1  3  4

回答by Peter Abdou

I think the first tolist()needs to be removed, but keep the second one:

我认为第一个tolist()需要删除,但保留第二个:

df1[df1.apply(lambda x: x.values() not in df2.values.tolist(), axis=1)]

回答by frozen shine

An easiest option is to use indexes.

最简单的选择是使用索引。

  1. Append df1 and df2 and reset their indexes.

    df = df1.concat(df2)
    df.reset_index(inplace=True)

  2. e.g:
    This will give df2 indexes

    indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]

  1. 附加 df1 和 df2 并重置它们的索引。

    df = df1.concat(df2)
    df.reset_index(inplace=True)

  2. 例如:
    这将给出 df2 索引

    indexes_df2 = df.index[ (df["a"].isin(df2["a"]) ) & (df["b"].isin(df2["b"]) ) result_index = df.index[~index_df2] result_data = df.iloc[ result_index,:]

Hope it will help to new readers, although the question posted a little time ago :)

希望它对新读者有所帮助,尽管这个问题是在不久前发布的:)