Python 熊猫获取不在其他数据框中的行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28901683/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas get rows which are NOT in other dataframe
提问by think nice things
I've two pandas data frames which have some rows in common.
我有两个 Pandas 数据框,它们有一些共同的行。
Suppose dataframe2 is a subset of dataframe1.
假设 dataframe2 是 dataframe1 的子集。
How can I get the rows of dataframe1 which are not in dataframe2?
如何获取不在 dataframe2 中的 dataframe1 行?
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})
采纳答案by EdChum
One method would be to store the result of an inner merge form both dfs, then we can simply select the rows when one column's values are not in this common:
一种方法是将内部合并的结果存储在两个 dfs 中,然后当一列的值不常见时,我们可以简单地选择行:
In [119]:
common = df1.merge(df2,on=['col1','col2'])
print(common)
df1[(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))]
col1 col2
0 1 10
1 2 11
2 3 12
Out[119]:
col1 col2
3 4 13
4 5 14
EDIT
编辑
Another method as you've found is to use isin
which will produce NaN
rows which you can drop:
您发现的另一种方法是使用isin
which 将生成NaN
您可以删除的行:
In [138]:
df1[~df1.isin(df2)].dropna()
Out[138]:
col1 col2
3 4 13
4 5 14
However if df2 does not start rows in the same manner then this won't work:
但是,如果 df2 没有以相同的方式开始行,那么这将不起作用:
df2 = pd.DataFrame(data = {'col1' : [2, 3,4], 'col2' : [11, 12,13]})
will produce the entire df:
将产生整个 df:
In [140]:
df1[~df1.isin(df2)].dropna()
Out[140]:
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
回答by Rune Lyngsoe
As already hinted at, isin requires columns and indices to be the same for a match. If match should only be on row contents, one way to get the mask for filtering the rows present is to convert the rows to a (Multi)Index:
正如已经暗示的那样, isin 要求列和索引相同才能匹配。如果匹配应该只在行内容上,获取用于过滤当前行的掩码的一种方法是将行转换为 (Multi)Index:
In [77]: df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3], 'col2' : [10, 11, 12, 13, 14, 10]})
In [78]: df2 = pandas.DataFrame(data = {'col1' : [1, 3, 4], 'col2' : [10, 12, 13]})
In [79]: df1.loc[~df1.set_index(list(df1.columns)).index.isin(df2.set_index(list(df2.columns)).index)]
Out[79]:
col1 col2
1 2 11
4 5 14
5 3 10
If index should be taken into account, set_index has keyword argument append to append columns to existing index. If columns do not line up, list(df.columns) can be replaced with column specifications to align the data.
如果应考虑索引,则 set_index 具有关键字参数 append 以将列附加到现有索引。如果列不对齐,可以将 list(df.columns) 替换为列规范以对齐数据。
pandas.MultiIndex.from_tuples(df<N>.to_records(index = False).tolist())
could alternatively be used to create the indices, though I doubt this is more efficient.
也可以用于创建索引,尽管我怀疑这更有效。
回答by Pragalbh kulshrestha
Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.
假设您有两个数据框 df_1 和 df_2,它们具有多个字段(column_names),并且您想根据某些字段(例如 fields_x、fields_y)在 df_1 中找到唯一不在 df_2 中的条目,请按照以下步骤操作。
Step1.Add a column key1 and key2 to df_1 and df_2 respectively.
Step1.分别给df_1和df_2添加一列key1和key2。
Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.
Step2. 合并数据框,如下所示。field_x 和 field_y 是我们想要的列。
Step3.Select only those rows from df_1 where key1 is not equal to key2.
Step3. 仅从 df_1 中选择那些 key1 不等于 key2 的行。
Step4.Drop key1 and key2.
Step4.放下key1和key2。
This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.
这种方法将解决您的问题,即使使用大数据集也能快速运行。我已经对超过 1,000,000 行的数据帧进行了尝试。
df_1['key1'] = 1
df_2['key2'] = 1
df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
df_1 = df_1[~(df_1.key2 == df_1.key1)]
df_1 = df_1.drop(['key1','key2'], axis=1)
回答by MaxU
you can do it using isin(dict)method:
你可以使用isin(dict)方法来做到这一点:
In [74]: df1[~df1.isin(df2.to_dict('l')).all(1)]
Out[74]:
col1 col2
3 4 13
4 5 14
Explanation:
解释:
In [75]: df2.to_dict('l')
Out[75]: {'col1': [1, 2, 3], 'col2': [10, 11, 12]}
In [76]: df1.isin(df2.to_dict('l'))
Out[76]:
col1 col2
0 True True
1 True True
2 True True
3 False False
4 False False
In [77]: df1.isin(df2.to_dict('l')).all(1)
Out[77]:
0 True
1 True
2 True
3 False
4 False
dtype: bool
回答by jabellcu
a bit late, but it might be worth checking the "indicator" parameter of pd.merge.
有点晚了,但可能值得检查 pd.merge 的“指标”参数。
See this other question for an example: Compare PandaS DataFrames and return rows that are missing from the first one
有关示例,请参阅另一个问题: 比较 PandaS DataFrames 并返回第一个中缺少的行
回答by r.rz
My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry
我这样做的方法包括添加一个对一个数据框唯一的新列,并使用它来选择是否保留条目
df2[col3] = 1
df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
df1['Empt'].fillna(0, inplace=True)
This makes it so every entry in df1 has a code - 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want
这使得 df1 中的每个条目都有一个代码 - 如果它对 df1 是唯一的,则为 0,如果它在两个数据帧中,则为 1。然后你用它来限制你想要的
answer = nonuni[nonuni['Empt'] == 0]
回答by Dennis Golomazov
Assuming that the indexes are consistent in the dataframes (not taking into account the actual col values):
假设索引在数据帧中是一致的(不考虑实际的 col 值):
df1[~df1.index.isin(df2.index)]
回答by adamwlev
How about this:
这个怎么样:
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5],
'col2' : [10, 11, 12, 13, 14]})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
records_df2 = set([tuple(row) for row in df2.values])
in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
result = df1[~in_df2_mask]
回答by Ted Petrou
The currently selected solution produces incorrect results. To correctly solve this problem, we can perform a left-join from df1
to df2
, making sure to first get just the unique rows for df2
.
当前选择的解决方案产生不正确的结果。为了正确解决这个问题,我们可以从df1
to执行左连接df2
,确保首先只获得 的唯一行df2
。
First, we need to modify the original DataFrame to add the row with data [3, 10].
首先,我们需要修改原来的DataFrame,添加数据为[3, 10]的行。
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12]})
df1
col1 col2
0 1 10
1 2 11
2 3 12
3 4 13
4 5 14
5 3 10
df2
col1 col2
0 1 10
1 2 11
2 3 12
Perform a left-join, eliminating duplicates in df2
so that each row of df1
joins with exactly 1 row of df2
. Use the parameter indicator
to return an extra column indicating which table the row was from.
执行左连接,消除重复项,df2
以便每行df1
连接正好有 1 行df2
. 使用该参数indicator
返回一个额外的列,指示该行来自哪个表。
df_all = df1.merge(df2.drop_duplicates(), on=['col1','col2'],
how='left', indicator=True)
df_all
col1 col2 _merge
0 1 10 both
1 2 11 both
2 3 12 both
3 4 13 left_only
4 5 14 left_only
5 3 10 left_only
Create a boolean condition:
创建一个布尔条件:
df_all['_merge'] == 'left_only'
0 False
1 False
2 False
3 True
4 True
5 True
Name: _merge, dtype: bool
Why other solutions are wrong
为什么其他解决方案是错误的
A few solutions make the same mistake - they only check that each value is independently in each column, not together in the same row. Adding the last row, which is unique but has the values from both columns from df2
exposes the mistake:
一些解决方案会犯同样的错误 - 他们只检查每个值是否在每列中独立存在,而不是在同一行中。添加最后一行,它是唯一的,但具有来自两列的值,df2
暴露了错误:
common = df1.merge(df2,on=['col1','col2'])
(~df1.col1.isin(common.col1))&(~df1.col2.isin(common.col2))
0 False
1 False
2 False
3 True
4 True
5 False
dtype: bool
This solution gets the same wrong result:
此解决方案得到相同的错误结果:
df1.isin(df2.to_dict('l')).all(1)
回答by Semeon Balagula
You can also concat df1
, df2
:
您还可以连接df1
, df2
:
x = pd.concat([df1, df2])
and then remove all duplicates:
然后删除所有重复项:
y = x.drop_duplicates(keep=False, inplace=False)