Python Pandas 中的 join 和 merge 有什么区别?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22676081/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:23:26  来源:igfitidea点击:

What is the difference between join and merge in Pandas?

pythonpandasdataframejoin

提问by munk

Suppose I have two DataFrames like so:

假设我有两个像这样的 DataFrame:

left = pd.DataFrame({'key1': ['foo', 'bar'], 'lval': [1, 2]})

right = pd.DataFrame({'key2': ['foo', 'bar'], 'rval': [4, 5]})

I want to merge them, so I try something like this:

我想合并它们,所以我尝试这样的事情:

pd.merge(left, right, left_on='key1', right_on='key2')

And I'm happy

我很高兴

    key1    lval    key2    rval
0   foo     1       foo     4
1   bar     2       bar     5

But I'm trying to use the join method, which I've been lead to believe is pretty similar.

但是我正在尝试使用 join 方法,我一直认为它非常相似。

left.join(right, on=['key1', 'key2'])

And I get this:

我明白了:

//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
    406             if self.right_index:
    407                 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408                     raise AssertionError()
    409                 self.right_on = [None] * n
    410         elif self.right_on is not None:

AssertionError: 

What am I missing?

我错过了什么?

采纳答案by Paul H

I always use joinon indices:

我总是join在索引上使用:

import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')

     val_l  val_r
key            
foo      1      4
bar      2      5

The same functionality can be had by using mergeon the columns follows:

通过merge在以下列上使用可以具有相同的功能:

left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))

   key  val_l  val_r
0  foo      1      4
1  bar      2      5

回答by Noah

I believe that join()is just a convenience method. Try df1.merge(df2)instead, which allows you to specify left_onand right_on:

我相信这join()只是一种方便的方法。试试吧df1.merge(df2),它允许您指定left_onright_on

In [30]: left.merge(right, left_on="key1", right_on="key2")
Out[30]: 
  key1  lval key2  rval
0  foo     1  foo     4
1  bar     2  bar     5

回答by Romain Jouin

From this documentation

这个文档

pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects:

merge(left, right, how='inner', on=None, left_on=None, right_on=None,
      left_index=False, right_index=False, sort=True,
      suffixes=('_x', '_y'), copy=True, indicator=False)

pandas 提供了一个函数,merge,作为 DataFrame 对象之间所有标准数据库连接操作的入口点:

merge(left, right, how='inner', on=None, left_on=None, right_on=None,
      left_index=False, right_index=False, sort=True,
      suffixes=('_x', '_y'), copy=True, indicator=False)

And :

和 :

DataFrame.joinis a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. Here is a very basic example: The data alignment here is on the indexes (row labels). This same behavior can be achieved using merge plus additional arguments instructing it to use the indexes:

result = pd.merge(left, right, left_index=True, right_index=True,
how='outer')

DataFrame.join是一种将两个可能具有不同索引的 DataFrame 的列组合成单个结果 DataFrame 的便捷方法。这是一个非常基本的示例:此处的数据对齐位于索引(行标签)上。可以使用合并以及指示它使用索引的其他参数来实现相同的行为:

result = pd.merge(left, right, left_index=True, right_index=True,
how='outer')

回答by Matthias Fripp

pandas.merge()is the underlying function used for all merge/join behavior.

pandas.merge()是用于所有合并/连接行为的底层函数。

DataFrames provide the pandas.DataFrame.merge()and pandas.DataFrame.join()methods as a convenient way to access the capabilities of pandas.merge(). For example, df1.merge(right=df2, ...)is equivalent to pandas.merge(left=df1, right=df2, ...).

DataFrames 提供pandas.DataFrame.merge()pandas.DataFrame.join()方法作为访问pandas.merge(). 例如,df1.merge(right=df2, ...)相当于pandas.merge(left=df1, right=df2, ...)

These are the main differences between df.join()and df.merge():

这些是df.join()和之间的主要区别df.merge()

  1. lookup on right table: df1.join(df2)always joins via the index of df2, but df1.merge(df2)can join to one or more columns of df2(default) or to the index of df2(with right_index=True).
  2. lookup on left table: by default, df1.join(df2)uses the index of df1and df1.merge(df2)uses column(s) of df1. That can be overridden by specifying df1.join(df2, on=key_or_keys)or df1.merge(df2, left_index=True).
  3. left vs inner join: df1.join(df2)does a left join by default (keeps all rows of df1), but df.mergedoes an inner join by default (returns only matching rows of df1and df2).
  1. 在右表上查找:df1.join(df2)始终通过 的索引连接df2,但df1.merge(df2)可以连接到df2(默认)的一列或多列或df2(与right_index=True)的索引。
  2. 在左表上查找:默认情况下,df1.join(df2)使用 的索引df1df1.merge(df2)使用 的列df1。这可以通过指定df1.join(df2, on=key_or_keys)或来覆盖df1.merge(df2, left_index=True)
  3. 左与内连接:df1.join(df2)默认情况下进行左连接(保留 的所有行df1),但df.merge默认情况下进行内连接(仅返回匹配的df1和行df2)。

So, the generic approach is to use pandas.merge(df1, df2)or df1.merge(df2). But for a number of common situations (keeping all rows of df1and joining to an index in df2), you can save some typing by using df1.join(df2)instead.

因此,通用方法是使用pandas.merge(df1, df2)or df1.merge(df2)。但是对于许多常见情况(保留 中的所有行df1并加入索引df2),您可以通过使用df1.join(df2)来节省一些输入。

Some notes on these issues from the documentation at http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging:

http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging 上的文档中关于这些问题的一些说明:

mergeis a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.

The related DataFrame.joinmethod, uses mergeinternally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.jointo save yourself some typing.

merge是 pandas 命名空间中的一个函数,它也可用作 DataFrame 实例方法,调用 DataFrame 被隐式视为连接中的左对象。

相关DataFrame.join方法在merge内部用于索引索引和列索引连接,但默认情况下连接索引而不是尝试连接公共列( 的默认行为merge)。如果您加入索引,您可能希望使用它DataFrame.join来节省一些输入。

...

...

These two function calls are completely equivalent:

left.join(right, on=key_or_keys)
pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)

这两个函数调用是完全等价的:

left.join(right, on=key_or_keys)
pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)

回答by steco

One of the difference is that mergeis creating a new index, and joinis keeping the left side index. It can have a big consequence on your later transformations if you wrongly assume that your index isn't changed with merge.

区别之一merge是创建一个新索引,并join保留左侧索引。如果您错误地假设您的索引没有随merge.

For example:

例如:

import pandas as pd

df1 = pd.DataFrame({'org_index': [101, 102, 103, 104],
                    'date': [201801, 201801, 201802, 201802],
                    'val': [1, 2, 3, 4]}, index=[101, 102, 103, 104])
df1

       date  org_index  val
101  201801        101    1
102  201801        102    2
103  201802        103    3
104  201802        104    4

-

——

df2 = pd.DataFrame({'date': [201801, 201802], 'dateval': ['A', 'B']}).set_index('date')
df2

       dateval
date          
201801       A
201802       B

-

——

df1.merge(df2, on='date')

     date  org_index  val dateval
0  201801        101    1       A
1  201801        102    2       A
2  201802        103    3       B
3  201802        104    4       B

-

——

df1.join(df2, on='date')
       date  org_index  val dateval
101  201801        101    1       A
102  201801        102    2       A
103  201802        103    3       B
104  201802        104    4       B

回答by Harsh

  • Join: Default Index (If any same column name then it will throw an error in default mode because u have not defined lsuffix or rsuffix))
  • 加入:默认索引(如果有任何相同的列名,那么它会在默认模式下抛出错误,因为你没有定义 lsuffix 或 rsuffix))
df_1.join(df_2)
  • Merge: Default Same Column Names (If no same column name it will throw an error in default mode)
  • 合并:默认相同的列名(如果没有相同的列名,它将在默认模式下引发错误)
df_1.merge(df_2)
  • onparameter has different meaning in both cases
  • on参数在两种情况下具有不同的含义
df_1.merge(df_2, on='column_1')

df_1.join(df_2, on='column_1') // It will throw error
df_1.join(df_2.set_index('column_1'), on='column_1')

回答by Kaustubh J

To put it analogously to SQL "Pandas merge is to outer/inner join and Pandas join is to natural join". Hence when you use merge in pandas, you want to specify which kind of sqlish join you want to use whereas when you use pandas join, you really want to have a matching column label to ensure it joins

与 SQL 类似,“Pandas 合并是外部/内部连接,Pandas 连接是自然连接”。因此,当您在 Pandas 中使用 merge 时,您想指定要使用的 sqlish join 类型,而当您使用 pandas join 时,您确实希望有一个匹配的列标签以确保它连接