Python 在两个 Pandas 数据框中查找公共行(交集)
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19618912/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Finding common rows (intersection) in two Pandas dataframes
提问by David Chouinard
Assume I have two dataframes of this format (call them df1
and df2
):
假设我有两个这种格式的数据框(调用它们df1
和df2
):
+------------------------+------------------------+--------+
| user_id | business_id | rating |
+------------------------+------------------------+--------+
| rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA | 4 |
| C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA | 5 |
| mlBC3pN9GXlUUfQi1qBBZA | KtheitroaddcIfh3XWxiCeV1BDmA | 3 |
+------------------------+------------------------+--------+
I'm looking to get a dataframe of all the rows that have a common user_id
in df1
and df2
. (ie. if a user_id
is in both df1
and df2
, include the two rows in the output dataframe)
我正在寻找user_id
在df1
和中有共同点的所有行的数据框df2
。(即。如果 auser_id
在两者df1
和中df2
,则在输出数据帧中包含两行)
I can think of many ways to approach this, but they all strike me as clunky. For example, we could find all the unique user_id
s in each dataframe, create a set of each, find their intersection, filter the two dataframes with the resulting set and concatenate the two filtered dataframes.
我可以想到很多方法来解决这个问题,但它们都让我觉得笨拙。例如,我们可以user_id
在每个数据帧中找到所有唯一的s,创建一个集合,找到它们的交集,用结果集过滤两个数据帧并连接两个过滤后的数据帧。
Maybe that's the best approach, but I know Pandas is clever. Is there a simpler way to do this? I've looked at merge
but I don't think that's what I need.
也许这是最好的方法,但我知道 Pandas 很聪明。有没有更简单的方法来做到这一点?我看过了,merge
但我认为这不是我需要的。
采纳答案by aldorath
My understanding is that this question is better answered over in this post.
我的理解是这个问题在这篇文章中得到了更好的回答。
But briefly, the answer to the OP with this method is simply:
但简而言之,使用此方法对 OP 的回答很简单:
s1 = pd.merge(df1, df2, how='inner', on=['user_id'])
Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.
这为 s1 提供了 5 列:user_id 和来自 df1 和 df2 的另外两列。
回答by Phillip Cloud
If I understand you correctly, you can use a combination of Series.isin()
and DataFrame.append()
:
如果我理解正确的话,你可以使用的组合Series.isin()
和DataFrame.append()
:
In [80]: df1
Out[80]:
rating user_id
0 2 0x21abL
1 1 0x21abL
2 1 0xdafL
3 0 0x21abL
4 4 0x1d14L
5 2 0x21abL
6 1 0x21abL
7 0 0xdafL
8 4 0x1d14L
9 1 0x21abL
In [81]: df2
Out[81]:
rating user_id
0 2 0x1d14L
1 1 0xdbdcad7
2 1 0x21abL
3 3 0x21abL
4 3 0x21abL
5 1 0x5734a81e2
6 2 0x1d14L
7 0 0xdafL
8 0 0x1d14L
9 4 0x5734a81e2
In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)
In [83]: ind
Out[83]:
0 True
1 False
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 False
Name: user_id, dtype: bool
In [84]: df1[ind].append(df2[ind])
Out[84]:
rating user_id
0 2 0x21abL
2 1 0xdafL
3 0 0x21abL
4 4 0x1d14L
6 1 0x21abL
7 0 0xdafL
8 4 0x1d14L
0 2 0x1d14L
2 1 0x21abL
3 3 0x21abL
4 3 0x21abL
6 2 0x1d14L
7 0 0xdafL
8 0 0x1d14L
This is essentially the algorithm you described as "clunky", using idiomatic pandas
methods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1
and df2
have no overlapping row indices, i.e., if
这本质上是您使用惯用pandas
方法描述为“笨拙”的算法。注意重复的行索引。另外,请注意,如果df1
并且df2
没有重叠的行索引,即,如果
In [93]: df1.index & df2.index
Out[93]: Int64Index([], dtype='int64')
In fact, it won't give the expected output if their row indices are not equal.
事实上,如果它们的行索引不相等,它不会给出预期的输出。
回答by Roman Pekar
In SQL, this problem could be solved by several methods:
在 SQL 中,这个问题可以通过几种方法解决:
select * from df1 where exists (select * from df2 where df2.user_id = df1.user_id)
union all
select * from df2 where exists (select * from df1 where df1.user_id = df2.user_id)
or join and then unpivot (possible in SQL server)
或加入然后反旋转(可能在 SQL 服务器中)
select
df1.user_id,
c.rating
from df1
inner join df2 on df2.user_i = df1.user_id
outer apply (
select df1.rating union all
select df2.rating
) as c
Second one could be written in pandas with something like:
第二个可以用熊猫写成:
>>> df1 = pd.DataFrame({"user_id":[1,2,3], "rating":[10, 15, 20]})
>>> df2 = pd.DataFrame({"user_id":[3,4,5], "rating":[30, 35, 40]})
>>>
>>> df4 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df = pd.merge(df1, df2, on='user_id', suffixes=['_1', '_2'])
>>> df3 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df4 = df[['user_id', 'rating_2']].rename(columns={'rating_2':'rating'})
>>> pd.concat([df3, df4], axis=0)
user_id rating
0 3 20
0 3 30