Python 在两个 Pandas 数据框中查找公共行(交集)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/19618912/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 14:12:14  来源:igfitidea点击:

Finding common rows (intersection) in two Pandas dataframes

pythonpandasdataframeintersect

提问by David Chouinard

Assume I have two dataframes of this format (call them df1and df2):

假设我有两个这种格式的数据框(调用它们df1df2):

+------------------------+------------------------+--------+
|        user_id         |      business_id       | rating |
+------------------------+------------------------+--------+
| rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA |      4 |
| C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA |      5 |
| mlBC3pN9GXlUUfQi1qBBZA | KtheitroaddcIfh3XWxiCeV1BDmA |      3 |
+------------------------+------------------------+--------+

I'm looking to get a dataframe of all the rows that have a common user_idin df1and df2. (ie. if a user_idis in both df1and df2, include the two rows in the output dataframe)

我正在寻找user_iddf1和中有共同点的所有行的数据框df2。(即。如果 auser_id在两者df1和中df2,则在输出数据帧中包含两行)

I can think of many ways to approach this, but they all strike me as clunky. For example, we could find all the unique user_ids in each dataframe, create a set of each, find their intersection, filter the two dataframes with the resulting set and concatenate the two filtered dataframes.

我可以想到很多方法来解决这个问题,但它们都让我觉得笨拙。例如,我们可以user_id在每个数据帧中找到所有唯一的s,创建一个集合,找到它们的交集,用结果集过滤两个数据帧并连接两个过滤后的数据帧。

Maybe that's the best approach, but I know Pandas is clever. Is there a simpler way to do this? I've looked at mergebut I don't think that's what I need.

也许这是最好的方法,但我知道 Pandas 很聪明。有没有更简单的方法来做到这一点?我看过了,merge但我认为这不是我需要的。

采纳答案by aldorath

My understanding is that this question is better answered over in this post.

我的理解是这个问题在这篇文章中得到了更好的回答。

But briefly, the answer to the OP with this method is simply:

但简而言之,使用此方法对 OP 的回答很简单:

s1 = pd.merge(df1, df2, how='inner', on=['user_id'])

Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.

这为 s1 提供了 5 列:user_id 和来自 df1 和 df2 的另外两列。

回答by Phillip Cloud

If I understand you correctly, you can use a combination of Series.isin()and DataFrame.append():

如果我理解正确的话,你可以使用的组合Series.isin()DataFrame.append()

In [80]: df1
Out[80]:
   rating  user_id
0       2  0x21abL
1       1  0x21abL
2       1   0xdafL
3       0  0x21abL
4       4  0x1d14L
5       2  0x21abL
6       1  0x21abL
7       0   0xdafL
8       4  0x1d14L
9       1  0x21abL

In [81]: df2
Out[81]:
   rating      user_id
0       2      0x1d14L
1       1    0xdbdcad7
2       1      0x21abL
3       3      0x21abL
4       3      0x21abL
5       1  0x5734a81e2
6       2      0x1d14L
7       0       0xdafL
8       0      0x1d14L
9       4  0x5734a81e2

In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)

In [83]: ind
Out[83]:
0     True
1    False
2     True
3     True
4     True
5    False
6     True
7     True
8     True
9    False
Name: user_id, dtype: bool

In [84]: df1[ind].append(df2[ind])
Out[84]:
   rating  user_id
0       2  0x21abL
2       1   0xdafL
3       0  0x21abL
4       4  0x1d14L
6       1  0x21abL
7       0   0xdafL
8       4  0x1d14L
0       2  0x1d14L
2       1  0x21abL
3       3  0x21abL
4       3  0x21abL
6       2  0x1d14L
7       0   0xdafL
8       0  0x1d14L

This is essentially the algorithm you described as "clunky", using idiomatic pandasmethods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1and df2have no overlapping row indices, i.e., if

这本质上是您使用惯用pandas方法描述为“笨拙”的算法。注意重复的行索引。另外,请注意,如果df1并且df2没有重叠的行索引,即,如果

In [93]: df1.index & df2.index
Out[93]: Int64Index([], dtype='int64')

In fact, it won't give the expected output if their row indices are not equal.

事实上,如果它们的行索引不相等,它不会给出预期的输出。

回答by Roman Pekar

In SQL, this problem could be solved by several methods:

在 SQL 中,这个问题可以通过几种方法解决:

select * from df1 where exists (select * from df2 where df2.user_id = df1.user_id)
union all
select * from df2 where exists (select * from df1 where df1.user_id = df2.user_id)

or join and then unpivot (possible in SQL server)

或加入然后反旋转(可能在 SQL 服务器中)

select
    df1.user_id,
    c.rating
from df1
    inner join df2 on df2.user_i = df1.user_id
    outer apply (
        select df1.rating union all
        select df2.rating
    ) as c

Second one could be written in pandas with something like:

第二个可以用熊猫写成:

>>> df1 = pd.DataFrame({"user_id":[1,2,3], "rating":[10, 15, 20]})
>>> df2 = pd.DataFrame({"user_id":[3,4,5], "rating":[30, 35, 40]})
>>>
>>> df4 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df = pd.merge(df1, df2, on='user_id', suffixes=['_1', '_2'])
>>> df3 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df4 = df[['user_id', 'rating_2']].rename(columns={'rating_2':'rating'})
>>> pd.concat([df3, df4], axis=0)
   user_id  rating
0        3      20
0        3      30