Python 熊猫获取不在其他数据框中的行

Question

提问by think nice things

I've two pandas data frames which have some rows in common.

我有两个 Pandas 数据框，它们有一些共同的行。

Suppose dataframe2 is a subset of dataframe1.

假设 dataframe2 是 dataframe1 的子集。

How can I get the rows of dataframe1 which are not in dataframe2?

如何获取不在 dataframe2 中的 dataframe1 行？

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})

Answer 1

采纳答案by EdChum

One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:

一种方法是将内部合并的结果存储在两个 dfs 中，然后当一列的值不常见时，我们可以简单地选择行：

In [119]:

common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
   col1  col2
0     1    10
1     2    11
2     3    12
Out[119]:
   col1  col2
3     4    13
4     5    14

EDIT

编辑

Another method as you've found is to use isinwhich will produce NaNrows which you can drop:

您发现的另一种方法是使用isinwhich 将生成NaN您可以删除的行：

In [138]:

df1[~df1.isin(df2)].dropna()
Out[138]:
   col1  col2
3     4    13
4     5    14

However if df2 does not start rows in the same manner then this won't work:

但是，如果 df2 没有以相同的方式开始行，那么这将不起作用：

df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})

will produce the entire df:

将产生整个 df：

In [140]:

df1[~df1.isin(df2)].dropna()
Out[140]:
   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

Answer 2

回答by Rune Lyngsoe

As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:

正如已经暗示的那样， isin 要求列和索引相同才能匹配。如果匹配应该只在行内容上，获取用于过滤当前行的掩码的一种方法是将行转换为 (Multi)Index：

In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
   col1  col2
1     2    11
4     5    14
5     3    10

If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.

如果应考虑索引，则 set_index 具有关键字参数 append 以将列附加到现有索引。如果列不对齐，可以将 list(df.columns) 替换为列规范以对齐数据。

pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())

could alternatively be used to create the indices, though I doubt this is more efficient.

也可以用于创建索引，尽管我怀疑这更有效。

Answer 3

回答by Pragalbh kulshrestha

Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.

假设您有两个数据框 df_1 和 df_2，它们具有多个字段（column_names），并且您想根据某些字段（例如 fields_x、fields_y）在 df_1 中找到唯一不在 df_2 中的条目，请按照以下步骤操作。

Step1.Add a column key1 and key2 to df_1 and df_2 respectively.

Step1.分别给df_1和df_2添加一列key1和key2。

Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.

Step2. 合并数据框，如下所示。field_x 和 field_y 是我们想要的列。

Step3.Select only those rows from df_1 where key1 is not equal to key2.

Step3. 仅从 df_1 中选择那些 key1 不等于 key2 的行。

Step4.Drop key1 and key2.

Step4.放下key1和key2。

This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.

这种方法将解决您的问题，即使使用大数据集也能快速运行。我已经对超过 1,000,000 行的数据帧进行了尝试。

df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)

Answer 4

回答by MaxU

you can do it using isin(dict)method:

你可以使用isin(dict)方法来做到这一点：

In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
   col1  col2
3     4    13
4     5    14

Explanation:

解释：

In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}

In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
    col1   col2
0   True   True
1   True   True
2   True   True
3  False  False
4  False  False

In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0     True
1     True
2     True
3    False
4    False
dtype: bool

Answer 5

回答by jabellcu

a bit late, but it might be worth checking the "indicator" parameter of pd.merge.

有点晚了，但可能值得检查 pd.merge 的“指标”参数。

See this other question for an example: Compare PandaS DataFrames and return rows that are missing from the first one

有关示例，请参阅另一个问题：比较 PandaS DataFrames 并返回第一个中缺少的行

Answer 6

回答by r.rz

My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry

我这样做的方法包括添加一个对一个数据框唯一的新列，并使用它来选择是否保留条目

df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)

This makes it so every entry in df1 has a code - 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want

这使得 df1 中的每个条目都有一个代码 - 如果它对 df1 是唯一的，则为 0，如果它在两个数据帧中，则为 1。然后你用它来限制你想要的

answer = nonuni[nonuni['Empt'] == 0]

Answer 7

回答by Dennis Golomazov

Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):

假设索引在数据帧中是一致的（不考虑实际的 col 值）：

df1[~df1.index.isin(df2.index)]

Answer 8

回答by adamwlev

How about this:

这个怎么样：

df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 
                               'col2' : [10, 11, 12, 13, 14]}) 
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 
                               'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]

Answer 9

回答by Ted Petrou

The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1to df2, making sure to first get just the unique rows for df2.

当前选择的解决方案产生不正确的结果。为了正确解决这个问题，我们可以从df1to执行左连接df2，确保首先只获得的唯一行df2。

First, we need to modify the original DataFrame to add the row with data [3, 10].

首先，我们需要修改原来的DataFrame，添加数据为[3, 10]的行。

df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 
                           'col2' : [10, 11, 12, 13, 14, 10]}) 
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
                           'col2' : [10, 11, 12]})

df1

   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14
5     3    10

df2

   col1  col2
0     1    10
1     2    11
2     3    12

Perform a left-join, eliminating duplicates in df2so that each row of df1joins with exactly 1 row of df2. Use the parameter indicatorto return an extra column indicating which table the row was from.

执行左连接，消除重复项，df2以便每行df1连接正好有 1 行df2. 使用该参数indicator返回一个额外的列，指示该行来自哪个表。

df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'], 
                   how='left', indicator=True)
df_all

   col1  col2     _merge
0     1    10       both
1     2    11       both
2     3    12       both
3     4    13  left_only
4     5    14  left_only
5     3    10  left_only

Create a boolean condition:

创建一个布尔条件：

df_all['_merge'] == 'left_only'

0    False
1    False
2    False
3     True
4     True
5     True
Name: _merge, dtype: bool

Why other solutions are wrong

为什么其他解决方案是错误的

A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2exposes the mistake:

一些解决方案会犯同样的错误 - 他们只检查每个值是否在每列中独立存在，而不是在同一行中。添加最后一行，它是唯一的，但具有来自两列的值，df2暴露了错误：

common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

This solution gets the same wrong result:

此解决方案得到相同的错误结果：

df1.isin(df2.to_dict('l')).all(1)

Answer 10

回答by Semeon Balagula

You can also concat df1, df2:

您还可以连接df1, df2：

x = pd.concat([df1, df2])

and then remove all duplicates:

然后删除所有重复项：

y = x.drop_duplicates(keep=False, inplace=False)

Python 熊猫获取不在其他数据框中的行

提问by think nice things

采纳答案by EdChum

回答by Rune Lyngsoe

回答by Pragalbh kulshrestha

回答by MaxU

回答by jabellcu

回答by r.rz

回答by Dennis Golomazov

回答by adamwlev

回答by Ted Petrou

Why other solutions are wrong

为什么其他解决方案是错误的

回答by Semeon Balagula

相关推荐

最近更新

标签

Python 熊猫获取不在其他数据框中的行

提问by think nice things

采纳答案by EdChum

回答by Rune Lyngsoe

回答by Pragalbh kulshrestha

回答by MaxU

回答by jabellcu

回答by r.rz

回答by Dennis Golomazov

回答by adamwlev

回答by Ted Petrou

Why other solutions are wrong

为什么其他解决方案是错误的

回答by Semeon Balagula

相关推荐

Python Pandas GroupBy 获取组列表

Python 了解 sklearn 中 CountVectorizer 中的 `ngram_range` 参数

在python的集合操作中添加vs更新

Python：模拟上下文管理器

相关推荐

最近更新

标签