Python 为熊猫设置差异

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/18180763/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 10:07:16  来源:igfitidea点击:

set difference for pandas

pythonpandasdataframe

提问by Robert Smith

A simple pandas question:

一个简单的熊猫问题:

Is there a drop_duplicates()functionality to drop every row involved in the duplication?

是否有drop_duplicates()删除重复中涉及的每一行的功能?

An equivalent question is the following: Does pandas have a set difference for dataframes?

一个等效的问题如下:pandas 对数据帧有一定的差异吗?

For example:

例如:

In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})

In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

In [7]: df1
Out[7]: 
   col1  col2
0     1     2
1     2     3
2     3     4

In [8]: df2
Out[8]: 
   col1  col2
0     4     6
1     2     3
2     5     5

so maybe something like df2.set_diff(df1)will produce this:

所以也许类似的东西df2.set_diff(df1)会产生这个:

   col1  col2
0     4     6
2     5     5

However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.

但是,我不想依赖索引,因为就我而言,我必须处理具有不同索引的数据帧。

By the way, I initially thought about an extension of the current drop_duplicates()method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.

顺便说一下,我最初考虑了当前drop_duplicates()方法的扩展,但现在我意识到使用集合论性质的第二种方法通常会更有用。不过,这两种方法都解决了我当前的问题。

Thanks!

谢谢!

采纳答案by Mir Shahriar Sabuj

from pandas import  DataFrame

df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})


print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))

回答by Joop

Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:

有点复杂,但如果你想完全忽略索引数据。将数据帧的内容转换为包含列的元组集:

ds1 = set([tuple(line) for line in df1.values])
ds2 = set([tuple(line) for line in df2.values])

This step will get rid of any duplicates in the dataframes as well (index ignored)

此步骤也将消除数据帧中的任何重复项(忽略索引)

set([(1, 2), (3, 4), (2, 3)])   # ds1

can then use set methods to find anything. Eg to find differences:

然后可以使用 set 方法来查找任何内容。例如找出差异:

ds1.difference(ds2)

gives: set([(1, 2), (3, 4)])

给出:set([(1, 2), (3, 4)])

can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:

如果需要,可以将其带回数据框。注意必须将 set 转换为 list 1st,因为 set 不能用于构造数据帧:

pd.DataFrame(list(ds1.difference(ds2)))

回答by Jeff

Apply by the columns of the object you want to map (df2); find the rows that are not in the set (isinis like a set operator)

按要映射的对象的列应用(df2);找到不在集合中的行(isin就像一个集合运算符)

In [32]: df2.apply(lambda x: df2.loc[~x.isin(df1[x.name]),x.name])
Out[32]: 
   col1  col2
0     4     6
2     5     5

Same thing, but include all values in df1, but still per column in df2

同样的事情,但包括 df1 中的所有值,但仍然是 df2 中的每列

In [33]: df2.apply(lambda x: df2.loc[~x.isin(df1.values.ravel()),x.name])
Out[33]: 
   col1  col2
0   NaN     6
2     5     5

2nd example

第二个例子

In [34]: g = pd.DataFrame({'x': [1.2,1.5,1.3], 'y': [4,4,4]})

In [35]: g.columns=df1.columns

In [36]: g
Out[36]:?
? ?col1 ?col2
0 ? 1.2 ? ? 4
1 ? 1.5 ? ? 4
2 ? 1.3 ? ? 4

In [32]: g.apply(lambda x: g.loc[~x.isin(df1[x.name]),x.name])
Out[32]: 
   col1  col2
0   1.2   NaN
1   1.5   NaN
2   1.3   NaN

Note, in 0.13, there will be an isinoperator on the frame level, so something like: df2.isin(df1)should be possible

请注意,在 0.13 中,将isin在帧级别上有一个运算符,因此类似于:df2.isin(df1)应该是可能的

回答by ignacio

Get the indices of the intersection with a merge, then drop them:

通过合并获取交集的索引,然后删除它们:

>>> df_all = pd.DataFrame(np.arange(8).reshape((4,2)), columns=['A','B']); df_all
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
>>> df_completed = df_all.iloc[::2]; df_completed
   A  B
0  0  1
2  4  5
>>> merged = pd.merge(df_all.reset_index(), df_completed); merged
   index  A  B
0      0  0  1
1      2  4  5
>>> df_pending = df_all.drop(merged['index']); df_pending
   A  B
1  2  3
3  6  7

回答by radream

Here's another answer that keeps the index and does not require identical indexes in two data frames.

这是另一个保留索引并且不需要在两个数据框中使用相同索引的答案。

pd.concat([df2, df1, df1]).drop_duplicates(keep=False)

It is fast and the result is

它很快,结果是

   col1  col2
0     4     6
2     5     5

回答by Alex Petralia

I'm not sure how pd.concat()implicitly joins overlapping columns but I had to do a little tweak on @radream's answer.

我不确定如何pd.concat()隐式连接重叠的列,但我不得不对@radream 的回答做一些调整。

Conceptually, a set difference (symmetric) on multiple columns is a set union (outer join) minus a set intersection (or inner join):

从概念上讲,多列上的集合差(对称)是集合并集(外连接)减去集合交集(或内连接):

df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
o = pd.merge(df1, df2, how='outer')
i = pd.merge(df1, df2)
set_diff = pd.concat([o, i]).drop_duplicates(keep=False)

This yields:

这产生:

   col1  col2
0     1     2
2     3     4
3     4     6
4     5     5

回答by Piotr Zio?o

There are 3 methods which work, but two of them have some flaws.

有 3 种方法有效,但其中两种有一些缺陷。

Method 1 (Hash method):

方法一(哈希法):

It worked for all cases I tested.

它适用于我测试的所有情况。

df1.loc[:, "hash"] = df1.apply(lambda x: hash(tuple(x)), axis = 1)
df2.loc[:, "hash"] = df2.apply(lambda x: hash(tuple(x)), axis = 1)
df1 = df1.loc[~df1["hash"].isin(df2["hash"]), :]

Method 2 (Dict method):

方法二(字典法):

It fails if DataFrames contain datetime columns.

如果 DataFrame 包含日期时间列,则失败。

df1 = df1.loc[~df1.isin(df2.to_dict(orient="list")).all(axis=1), :]

Method 3 (MultiIndex method):

方法3(MultiIndex方法):

I encountered cases when it failed on columns with None's or NaN's.

我遇到过在带有 None 或 NaN 的列上失败的情况。

df1 = df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)

回答by Jacek Pliszka

Assumption:

假设:

  1. df1 and df2 have identical columns
  2. it is a set operation so duplicates are ignored
  3. sets are not extremely large so you do not worry about memory
  1. df1 和 df2 具有相同的列
  2. 这是一个集合操作,因此忽略重复项
  3. 集合不是很大,所以你不用担心内存
union = pd.concat([df1,df2])
sym_diff = union[~union.duplicated(keep=False)]
union_of_df1_and_sym_diff = pd.concat([df1, sym_diff])
diff = union_of_df1_and_sym_diff[union_of_df1_and_sym_diff.duplicated()]

回答by Ian Kent

Edit: You can now make MultiIndex objects directly from data frames as of pandas 0.24.0 which greatly simplifies the syntax of this answer

编辑:您现在可以直接从 Pandas 0.24.0 的数据框中创建 MultiIndex 对象,这大大简化了此答案的语法

df1mi = pd.MultiIndex.from_frame(df1)
df2mi = pd.MultiIndex.from_frame(df2)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)


Original Answer

原答案

Pandas MultiIndex objects have fast set operations implemented as methods, so you can convert the DataFrames to MultiIndexes, use the difference()method, then convert the result back to a DataFrame. This solution should be much faster (by ~100x or more from my brief testing) than the solutions given here so far, and it will not depend on the row indexing of the original frames. As Piotr mentioned for his answer, this will fail with null values, since np.nan != np.nan. Any row in df2 with a null value will always appear in the difference. Also, the columns should be in the same order for both DataFrames.

Pandas MultiIndex 对象具有作为方法实现的快速设置操作,因此您可以将 DataFrame 转换为 MultiIndex,使用该difference()方法,然后将结果转换回 DataFrame。这个解决方案应该比到目前为止给出的解决方案快得多(从我的简短测试中提高约 100 倍或更多),并且它不依赖于原始帧的行索引。正如 Piotr 在他的回答中提到的,这将因空值而失败,因为 np.nan != np.nan。df2 中具有空值的任何行将始终出现在差异中。此外,两个 DataFrame 的列的顺序应该相同。

df1mi = pd.MultiIndex.from_arrays(df1.values.transpose(), names=df1.columns)
df2mi = pd.MultiIndex.from_arrays(df2.values.transpose(), names=df2.columns)
dfdiff = df2mi.difference(df1mi).to_frame().reset_index(drop=True)

回答by SummmerFort

this should work even if you have multiple columns in both dataframes. But make sure that the column names of both the dataframes are the exact same.

即使您在两个数据框中都有多个列,这也应该有效。但请确保两个数据框的列名完全相同。

set_difference = pd.concat([df2, df1, df1]).drop_duplicates(keep=False)

With multiple columns you can also use:

对于多列,您还可以使用:

col_names=['col_1','col_2']
set_difference = pd.concat([df2[col_names], df1[col_names], 
df1[col_names]]).drop_duplicates(keep=False)