比较 Python Pandas DataFrames 以匹配行

Question

提问by edesz

I have this DataFrame (df1) in Pandas:

我df1在 Pandas 中有这个 DataFrame ( )：

df1 = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
print df1

       A         B         C         D
0.860379  0.726956  0.394529  0.833217
0.014180  0.813828  0.559891  0.339647
0.782838  0.698993  0.551252  0.361034
0.833370  0.982056  0.741821  0.006864
0.855955  0.546562  0.270425  0.136006
0.491538  0.445024  0.971603  0.690001
0.911696  0.065338  0.796946  0.853456
0.744923  0.545661  0.492739  0.337628
0.576235  0.219831  0.946772  0.752403
0.164873  0.454862  0.745890  0.437729

I would like to check if any row (all columns) from another dataframe (df2) are present in df1. Here is df2:

我想检查来自另一个数据框 ( df2) 的任何行（所有列）是否存在于df1. 这是df2：

df2 = df1.ix[4:8]
df2.reset_index(drop=True,inplace=True)
df2.loc[-1] = [2, 3, 4, 5]
df2.loc[-2] = [14, 15, 16, 17]
df2.reset_index(drop=True,inplace=True)
print df2

           A         B         C         D
    0.855955  0.546562  0.270425  0.136006
    0.491538  0.445024  0.971603  0.690001
    0.911696  0.065338  0.796946  0.853456
    0.744923  0.545661  0.492739  0.337628
    0.576235  0.219831  0.946772  0.752403
    2.000000  3.000000  4.000000  5.000000
   14.000000 15.000000 16.000000 17.000000

I tried using df.lookupto search for one row at a time. I did it this way:

我尝试使用一次df.lookup搜索一行。我是这样做的：

list1 = df2.ix[0].tolist()
cols = df1.columns.tolist()
print df1.lookup(list1, cols)

but I got this error message:

但我收到此错误消息：

  File "C:\Users\test.py", line 19, in <module>
    print df1.lookup(list1, cols)
  File "C:\python27\lib\site-packages\pandas\core\frame.py", line 2217, in lookup
    raise KeyError('One or more row labels was not found')
KeyError: 'One or more row labels was not found'

I also tried .all()using:

我也尝试.all()使用：

print (df2 == df1).all(1).any()

but I got this error message:

但我收到此错误消息：

  File "C:\Users\test.py", line 12, in <module>
    print (df2 == df1).all(1).any()
  File "C:\python27\lib\site-packages\pandas\core\ops.py", line 884, in f
    return self._compare_frame(other, func, str_rep)
  File "C:\python27\lib\site-packages\pandas\core\frame.py", line 3010, in _compare_frame
    raise ValueError('Can only compare identically-labeled '
ValueError: Can only compare identically-labeled DataFrame objects

I also tried isin()like this:

我也试过isin()这样：

print df2.isin(df1)

but I got Falseeverywhere, which is not correct:

但我False到处都是，这是不正确的：

    A      B      C      D
False  False  False  False
False  False  False  False
False  False  False  False
False  False  False  False
False  False  False  False
False  False  False  False
False  False  False  False
False  False  False  False
False  False  False  False
False  False  False  False

Is it possible to search for a set of rows in a DataFrame, by comparing it to another dataframe's rows?

是否可以通过将数据帧与另一个数据帧的行进行比较来搜索数据帧中的一组行？

EDIT: Is is possible to drop df2rows if those rows are also present in df1?

编辑：df2如果这些行也存在于中，是否可以删除行df1？

Answer 1

采纳答案by Andrew

One possible solution to your problem would be to use merge. Checking if any row (all columns) from another dataframe (df2) are present in df1 is equivalent to determining the intersection of the the two dataframes. This can be accomplished using the following function:

您的问题的一种可能解决方案是使用merge。检查 df1 中是否存在来自另一个数据帧 (df2) 的任何行（所有列）相当于确定两个数据帧的交集。这可以使用以下函数来完成：

pd.merge(df1, df2, on=['A', 'B', 'C', 'D'], how='inner')

For example, if df1 was

例如，如果 df1 是

    A           B            C          D
0   0.403846    0.312230    0.209882    0.397923
1   0.934957    0.731730    0.484712    0.734747
2   0.588245    0.961589    0.910292    0.382072
3   0.534226    0.276908    0.323282    0.629398
4   0.259533    0.277465    0.043652    0.925743
5   0.667415    0.051182    0.928655    0.737673
6   0.217923    0.665446    0.224268    0.772592
7   0.023578    0.561884    0.615515    0.362084
8   0.346373    0.375366    0.083003    0.663622
9   0.352584    0.103263    0.661686    0.246862

and df2 was defined as:

df2 定义为：

     A          B            C           D
0   0.259533    0.277465    0.043652    0.925743
1   0.667415    0.051182    0.928655    0.737673
2   0.217923    0.665446    0.224268    0.772592
3   0.023578    0.561884    0.615515    0.362084
4   0.346373    0.375366    0.083003    0.663622
5   2.000000    3.000000    4.000000    5.000000
6   14.000000   15.000000   16.000000   17.000000

The function pd.merge(df1, df2, on=['A', 'B', 'C', 'D'], how='inner')produces:

该函数pd.merge(df1, df2, on=['A', 'B', 'C', 'D'], how='inner')产生：

     A           B           C           D
0   0.259533    0.277465    0.043652    0.925743
1   0.667415    0.051182    0.928655    0.737673
2   0.217923    0.665446    0.224268    0.772592
3   0.023578    0.561884    0.615515    0.362084
4   0.346373    0.375366    0.083003    0.663622

The results are all of the rows (all columns) that are both in df1 and df2.

结果是 df1 和 df2 中的所有行（所有列）。

We can also modify this example if the columns are not the same in df1 and df2 and just compare the row values that are the same for a subset of the columns. If we modify the original example:

如果 df1 和 df2 中的列不相同，我们也可以修改此示例，只需比较列子集相同的行值。如果我们修改原始示例：

df1 = pd.DataFrame(np.random.rand(10,4),columns=list('ABCD'))
df2 = df1.ix[4:8]
df2.reset_index(drop=True,inplace=True)
df2.loc[-1] = [2, 3, 4, 5]
df2.loc[-2] = [14, 15, 16, 17]
df2.reset_index(drop=True,inplace=True)
df2 = df2[['A', 'B', 'C']] # df2 has only columns A B C

Then we can look at the common columns using common_cols = list(set(df1.columns) & set(df2.columns))between the two dataframes then merge:

然后我们可以查看common_cols = list(set(df1.columns) & set(df2.columns))两个数据帧之间使用的公共列，然后合并：

pd.merge(df1, df2, on=common_cols, how='inner')

EDIT:New question (comments), having identified the rows from df2 that were also present in the first dataframe (df1), is it possible to take the result of the pd.merge() and to then drop the rows from df2 that are also present in df1

编辑：新问题（评论）确定了 df2 中也存在于第一个数据帧（df1）中的行，是否可以获取 pd.merge() 的结果，然后从 df2 中删除行也存在于 df1

I do not know of a straightforward way to accomplish the task of dropping the rows from df2 that are also present in df1. That said, you could use the following:

我不知道有什么简单的方法可以完成从 df2 中删除也存在于 df1 中的行的任务。也就是说，您可以使用以下内容：

ds1 = set(tuple(line) for line in df1.values)
ds2 = set(tuple(line) for line in df2.values)
df = pd.DataFrame(list(ds2.difference(ds1)), columns=df2.columns)

There probably exists a better way to accomplish that task but i am unaware of such a method / function.

可能存在完成该任务的更好方法，但我不知道这样的方法/功能。

EDIT 2:How to drop the rows from df2 that are also present in df1 as shown in @WR answer.

编辑 2：如何从 df2 中删除也存在于 df1 中的行，如@WR 答案所示。

The method provided df2[~df2['A'].isin(df12['A'])]does not account for all types of situations. Consider the following DataFrames:

所提供的方法df2[~df2['A'].isin(df12['A'])]并未考虑所有类型的情况。考虑以下数据帧：

df1:

df1：

   A  B  C  D
0  6  4  1  6
1  7  6  6  8
2  1  6  2  7
3  8  0  4  1
4  1  0  2  3
5  8  4  7  5
6  4  7  1  1
7  3  7  3  4
8  5  2  8  8
9  3  2  8  4

df2:

df2：

   A  B  C  D
0  1  0  2  3
1  8  4  7  5
2  4  7  1  1
3  3  7  3  4
4  5  2  8  8
5  1  1  1  1
6  2  2  2  2

df12:

df12：

   A  B  C  D
0  1  0  2  3
1  8  4  7  5
2  4  7  1  1
3  3  7  3  4
4  5  2  8  8

Using the above DataFrames with the goal of dropping rows from df2 that are also present in df1 would result in the following:

使用上述 DataFrames 的目的是从 df2 中删除也存在于 df1 中的行将导致以下结果：

   A  B  C  D
0  1  1  1  1
1  2  2  2  2

Rows (1, 1, 1, 1) and (2, 2, 2, 2) are in df2 and not in df1. Unfortunately, using the provided method (df2[~df2['A'].isin(df12['A'])]) results in:

行 (1, 1, 1, 1) 和 (2, 2, 2, 2) 在 df2 中而不在 df1 中。不幸的是，使用提供的方法 ( df2[~df2['A'].isin(df12['A'])]) 会导致：

   A  B  C  D
6  2  2  2  2

This occurs because the value of 1 in column A is found in both the intersection DataFrame (i.e. (1, 0, 2, 3)) and df2 and thus removes both (1, 0, 2, 3) and (1, 1, 1, 1). This is unintended since the row (1, 1, 1, 1) is not in df1 and should not be removed.

发生这种情况是因为在交集 DataFrame（即 (1, 0, 2, 3)) 和 df2 中都找到了 A 列中的值 1，因此删除了 (1, 0, 2, 3) 和 (1, 1, 1、1)。这是意外的，因为行 (1, 1, 1, 1) 不在 df1 中，不应删除。

I think the following will provide a solution. It creates a dummy column that is later used to subset the DataFrame to the desired results:

我认为以下将提供解决方案。它创建了一个虚拟列，稍后用于将 DataFrame 子集化为所需的结果：

df12['key'] = 'x'
temp_df = pd.merge(df2, df12, on=df2.columns.tolist(), how='left')
temp_df[temp_df['key'].isnull()].drop('key', axis=1)

Answer 2

回答by edesz

@Andrew: I believe I found a way to drop the rows of one dataframe that are already present in another (i.e. to answer my EDIT) without using loops - let me know if you disagree and/or if my OP + EDIT did not clearly state this:

@Andrew：我相信我找到了一种方法，可以在不使用循环的情况下删除另一个数据帧中已经存在的行（即回答我的 EDIT）-如果您不同意和/或我的 OP + EDIT 不清楚，请告诉我声明：

THIS WORKS

这有效

The columns for both dataframes are always the same - A, B, Cand D. With this in mind, based heavily on Andrew's approach, here is how to drop the rows from df2that are also present in df1:

两个dataframes列总是相同的- ，，和。考虑到这一点，在很大程度上基于 Andrew 的方法，这里是如何从中删除也存在于中的行：ABCDdf2df1

common_cols = df1.columns.tolist()                         #generate list of column names
df12 = pd.merge(df1, df2, on=common_cols, how='inner')     #extract common rows with merge
df2 = df2[~df2['A'].isin(df12['A'])]

Line 3 does the following:

第 3 行执行以下操作：

Extract only rows from df2that do not match rows in df1:
In order for 2 rows to be different, ANY one column of one row must
necessarily be different that the correspondingcolumn in another row.
Here, I picked column Ato make this comparison - it is
possible to use any of the column names, but notALL of the
column names.

仅从中提取df2与中的行不匹配的行df1：
为了使两行不同，一行中的任何一列必须与另一行
中的相应列不同。
在这里，我选择了列A来进行比较 -
可以使用任何列名，但不能使用所有
列名。

NOTE: this method is essentially the equivalent of the SQL NOT IN().

注意：此方法本质上等同于 SQL NOT IN()。

比较 Python Pandas DataFrames 以匹配行

提问by edesz

采纳答案by Andrew

回答by edesz

相关推荐

最近更新

标签

比较 Python Pandas DataFrames 以匹配行

提问by edesz

采纳答案by Andrew

回答by edesz

相关推荐

Python 如何使用 numpy 导入 xlsx 文件

如何在 Mac OS 上安装 2 个 Anacondas（Python 2 和 3）

Python 如何在 Windows 上安装 lxml

Python 带参数的 Pandas read_sql

相关推荐

最近更新

标签