Python pandas - 按行元素通过另一个数据帧过滤数据帧
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/33282119/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas - filter dataframe by another dataframe by row elements
提问by Fabio Lamanna
I have a dataframe df1
which looks like:
我有一个数据框df1
,它看起来像:
c k l
0 A 1 a
1 A 2 b
2 B 2 a
3 C 2 a
4 C 2 d
and another called df2
like:
另一个叫做df2
:
c l
0 A b
1 C a
I would like to filter df1
keeping only the values that ARE NOT in df2
. Values to filter are expected to be as (A,b)
and (C,a)
tuples. So far I tried to apply the isin
method:
我想过滤df1
只保留不在df2
. 要过滤的值应为 as(A,b)
和(C,a)
元组。到目前为止,我尝试应用该isin
方法:
d = df[~(df['l'].isin(dfc['l']) & df['c'].isin(dfc['c']))]
Apart that seems to me too complicated, it returns:
除了在我看来太复杂之外,它返回:
c k l
2 B 2 a
4 C 2 d
but I'm expecting:
但我期待:
c k l
0 A 1 a
2 B 2 a
4 C 2 d
采纳答案by jakevdp
You can do this efficiently using isin
on a multiindex constructed from the desired columns:
您可以isin
在从所需列构造的多索引上有效地执行此操作:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
keys = list(df2.columns.values)
i1 = df1.set_index(keys).index
i2 = df2.set_index(keys).index
df1[~i1.isin(i2)]
I think this improves on @IanS's similar solution because it doesn't assume any column type (i.e. it will work with numbers as well as strings).
我认为这改进了@IanS 的类似解决方案,因为它不假设任何列类型(即它可以处理数字和字符串)。
(Above answer is an edit. Following was my initial answer)
(以上答案是编辑。以下是我的初步答案)
Interesting! This is something I haven't come across before... I would probably solve it by merging the two arrays, then dropping rows where df2
is defined. Here is an example, which makes use of a temporary array:
有趣的!这是我以前从未遇到过的事情......我可能会通过合并两个数组来解决它,然后在df2
定义的地方删除行。这是一个使用临时数组的示例:
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
# create a column marking df2 values
df2['marker'] = 1
# join the two, keeping all of df1's indices
joined = pd.merge(df1, df2, on=['c', 'l'], how='left')
joined
# extract desired columns where marker is NaN
joined[pd.isnull(joined['marker'])][df1.columns]
There may be a way to do this without using the temporary array, but I can't think of one. As long as your data isn't huge the above method should be a fast and sufficient answer.
可能有一种方法可以在不使用临时数组的情况下执行此操作,但我想不出一个方法。只要您的数据不是很大,上述方法应该是一个快速而充分的答案。
回答by IanS
How about:
怎么样:
df1['key'] = df1['c'] + df1['l']
d = df1[~df1['key'].isin(df2['c'] + df2['l'])].drop(['key'], axis=1)
回答by Randy
Another option that avoids creating an extra column or doing a merge would be to do a groupby on df2 to get the distinct (c, l) pairs and then just filter df1 using that.
避免创建额外列或进行合并的另一个选择是在 df2 上进行 groupby 以获得不同的 (c, l) 对,然后使用它过滤 df1 。
gb = df2.groupby(("c", "l")).groups
df1[[p not in gb for p in zip(df1['c'], df1['l'])]]]
For this small example, it actually seems to run a bit faster than the pandas-based approach (666 μs vs. 1.76 ms on my machine), but I suspect it could be slower on larger examples since it's dropping into pure Python.
对于这个小例子,它实际上似乎比基于 Pandas 的方法运行得快一点(在我的机器上为 666 μs 与 1.76 ms),但我怀疑它在更大的例子上可能会更慢,因为它进入了纯 Python。
回答by Haroon Hassan
This is pretty succinct and works well:
这非常简洁并且运行良好:
df1 = df1[~df1.index.isin(df2.index)]
回答by dasilvadaniel
I think this is a quite simple approach when you want to filter a dataframe based on multiple columns from another dataframe or even based on a custom list.
我认为这是一种非常简单的方法,当您想要基于来自另一个数据帧的多列甚至基于自定义列表过滤数据帧时。
df1 = pd.DataFrame({'c': ['A', 'A', 'B', 'C', 'C'],
'k': [1, 2, 2, 2, 2],
'l': ['a', 'b', 'a', 'a', 'd']})
df2 = pd.DataFrame({'c': ['A', 'C'],
'l': ['b', 'a']})
#values of df2 columns 'c' and 'l' that will be used to filter df1
idxs = list(zip(df2.c.values, df2.l.values)) #[('A', 'b'), ('C', 'a')]
#so df1 is filtered based on the values present in columns c and l of df2 (idxs)
df1 = df1[pd.Series(list(zip(df1.c, df1.l)), index=df1.index).isin(idxs)]
回答by Erfan
Using DataFrame.merge
& DataFrame.query
:
使用DataFrame.merge
& DataFrame.query
:
A more elegant method would be to do left join
with the argument indicator=True
, then filter all the rows which are left_only
with query
:
更巧妙的方法是做left join
的说法indicator=True
,然后过滤所有的行left_only
用query
:
d = (
df1.merge(df2,
on=['c', 'l'],
how='left',
indicator=True)
.query('_merge == "left_only"')
.drop(columns='_merge')
)
print(d)
c k l
0 A 1 a
2 B 2 a
4 C 2 d
indicator=True
returns a dataframe with an extra column _merge
which marks each row left_only, both, right_only
:
indicator=True
返回一个带有_merge
标记每一行的额外列的数据框left_only, both, right_only
:
df1.merge(df2, on=['c', 'l'], how='left', indicator=True)
c k l _merge
0 A 1 a left_only
1 A 2 b both
2 B 2 a left_only
3 C 2 a both
4 C 2 d left_only