Python pandas 中 merge() 和 concat() 之间的区别

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/38256104/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:34:30  来源:igfitidea点击:

Difference(s) between merge() and concat() in pandas

pythonpandasjoinmergeconcat

提问by WindChimes

What's the essential difference(s) between pd.DataFrame.merge()and pd.concat()?

pd.DataFrame.merge()和之间的本质区别是pd.concat()什么?

So far, this is what I found, please comment on how complete and accurate my understanding is:

到目前为止,这是我发现的,请评论我的理解有多完整和准确:

  • .merge()can only use columns (plus row-indices) and it is semantically suitable for database-style operations. .concat()can be used with either axis, using only indices, and gives the option for adding a hierarchical index.

  • Incidentally, this allows for the following redundancy: both can combine two dataframes using the rows indices.

  • pd.DataFrame.join()merely offers a shorthand for a subset of the use cases of .merge()

  • .merge()只能使用列(加上行索引)并且它在语义上适用于数据库样式的操作。.concat()可以与任一轴一起使用,仅使用索引,并提供添加分层索引的选项。

  • 顺便说一下,这允许以下冗余:两者都可以使用行索引组合两个数据帧。

  • pd.DataFrame.join()仅提供了用例的子集的速记 .merge()

(Pandas is great at addressing a very wide spectrum of use cases in data analysis. It can be a bit daunting exploring the documentation to figure out what is the best way to perform a particular task. )

(Pandas 非常擅长解决数据分析中非常广泛的用例。探索文档以找出执行特定任务的最佳方式可能有点令人生畏。)

回答by Abhishek Sawant

A very high level difference is that merge()is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=Trueand/or right_index=True), and concat()is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axisoption is set to 0 or 1).

一个非常高级的区别是merge()用于根据公共列的值组合两个(或多个)数据帧(也可以使用索引,使用left_index=True和/或right_index=True),并concat()用于附加一个(或多个)数据帧一个低于另一个(或横向,取决于axis选项设置为 0 还是 1)。

join()is used to merge 2 dataframes on the basis of the index; instead of using merge()with the option left_index=Truewe can use join().

join()用于根据索引合并2个数据帧;而不是使用我们可以使用merge()的选项。left_index=Truejoin()

For example:

例如:

df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})

df1:
   Key  data1
0   b   0
1   b   1
2   a   2
3   c   3
4   a   4
5   a   5
6   b   6

df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})

df2:
    Key data2
0   a   0
1   b   1
2   d   2

#Merge
# The 2 dataframes are merged on the basis of values in column "Key" as it is 
# a common column in 2 dataframes

pd.merge(df1, df2)

   Key data1 data2
0   b    0    1
1   b    1    1
2   b    6    1
3   a    2    0
4   a    4    0
5   a    5    0

#Concat
# df2 dataframe is appended at the bottom of df1 

pd.concat([df1, df2])

   Key data1 data2
0   b   0     NaN
1   b   1     NaN
2   a   2     NaN
3   c   3     NaN
4   a   4     NaN
5   a   5     NaN
6   b   6     NaN
0   a   Nan   0
1   b   Nan   1
2   d   Nan   2

回答by Piyush Malhotra Nova_Outlaw

pd.concattakes an Iterableas its argument. Hence, it cannot take DataFrames directly as its argument. Also Dimensions of the DataFrameshould match along axis while concatenating.

pd.concat将 anIterable作为其参数。因此,它不能DataFrame直接将s 作为其参数。另外DimensionS的的DataFrame同时,串联沿轴线应该匹配。

pd.mergecan take DataFrames as its argument, and is used to combine two DataFrames with same columns or index, which can't be done with pd.concatsince it will show the repeated column in the DataFrame.

pd.merge可以将DataFrames 作为参数,用于将DataFrame具有相同列或索引的两个s组合在一起,这是无法完成的,pd.concat因为它会在 DataFrame 中显示重复的列。

Whereas join can be used to join two DataFrames with different indices.

而 join 可用于连接DataFrame具有不同索引的两个s。

回答by prosti

I am currently trying to understand the essential difference(s) between pd.DataFrame.merge()and pd.concat().

我目前正在尝试了解pd.DataFrame.merge()和之间的本质区别pd.concat()

Nice question. The main difference:

好问题。主要区别:

pd.concatworks on both axes.

pd.concat适用于两个轴。

The other difference, is pd.concathas innerdefaultand outerjoins only, while pd.DataFrame.merge()has left, right, outer, innerdefaultjoins.

另一个区别是pd.concat只有内部默认外部连接,而pd.DataFrame.merge()具有外部内部默认连接。

Third notable other difference is: pd.DataFrame.merge()has the option to set the column suffixes when merging columns with the same name, while for pd.concatthis is not possible.

第三个值得注意的其他区别是:pd.DataFrame.merge()在合并具有相同名称的列时可以选择设置列后缀,而pd.concat这是不可能的。



With pd.concatby default you are able to stack rows of multiple dataframes (axis=0) and when you set the axis=1then you mimic the pd.DataFrame.merge()function.

随着pd.concat在默认情况下,你都能够堆叠多个dataframes(行axis=0),当你设置axis=1那么你模仿的pd.DataFrame.merge()功能。

Some useful examples of pd.concat:

一些有用的例子pd.concat

df2=pd.concat([df]*2, ignore_index=True) #double the rows of a dataframe

df2=pd.concat([df, df.iloc[[0]]]) # add first row to the end

df3=pd.concat([df1,df2], join='inner', ignore_index=True) # concat two df's

回答by Jake Wu

At a high level:

在高层次上:

  • .concat()simply stacks multiple DataFrametogether either vertically, or stitches horizontally after aligning on index
  • .merge()first aligns two DataFrame' selected common column(s) or index, and then pick up the remaining columns from the aligned rows of each DataFrame.
  • .concat()DataFrame在索引上对齐后,简单地将多个堆叠在一起,或者垂直堆叠或水平缝合
  • .merge()首先对齐两个DataFrame' 选定的公共列或索引,然后从每个 ' 的对齐行中选取剩余的列DataFrame

More specifically, .concat():

更具体地说,.concat()

  • Is a top-level pandas function
  • Combines two or more pandas DataFrameverticallyor horizontally
  • Aligns only on the indexwhen combining horizontally
  • Errors when any of the DataFramecontains a duplicate index.
  • Defaults to outer join with the option for inner join
  • 是一个顶级的pandas函数
  • DataFrame垂直或水平组合两个或多个熊猫
  • 水平组合时仅在索引上对齐
  • 当任何DataFrame包含重复索引时出错。
  • 默认为带内连接选项的外连接

And .merge():

并且.merge()

  • Exists both as a top-level pandas function and a DataFramemethod (as of pandas 1.0)
  • Combines exactly two DataFramehorizontally
  • Aligns the calling DataFrame's column(s) or index with the other DataFrame's column(s) or index
  • Handles duplicate values on the joining columns or indexby performing a cartesian product
  • Defaults to inner join with options for left, outer, and right
  • 作为顶级熊猫函数和DataFrame方法存在(从熊猫 1.0 开始)
  • DataFrame水平合并两个
  • 将调用DataFrame者的列或索引与另一个DataFrame的列或索引 对齐
  • 通过执行笛卡尔积处理连接列或索引上的重复值
  • 默认为内连接,带有左、外和右选项

Note that when performing pd.merge(left, right), if lefthas two rows containing the same values from the joining columns or index, each row will combine with right's corresponding row(s) resulting in a cartesian product. On the other hand, if .concat()is used to combine columns, we need to make sure no duplicated index exists in either DataFrame.

请注意,在执行 时pd.merge(left, right),如果left有两行包含来自连接列或索引的相同值,则每行将与right的相应行组合,从而产生笛卡尔积。另一方面,如果.concat()用于组合列,我们需要确保其中任何一个都没有重复的索引DataFrame

Practically speaking:

实际来说:

  • Consider .concat()first when combining homogeneous DataFrame, while consider .merge()first when combining complementary DataFrame.
  • If need to merge vertically, go with .concat(). If need to merge horizontally via columns, go with .merge(), which by default merge on the columns in common.
  • 考虑.concat()混合均匀时,首先DataFrame,在考虑.merge()合并时,互补第一DataFrame
  • 如果需要垂直合并,请使用.concat(). 如果需要通过列进行水平合并,请使用.merge(),默认情况下在公共列上合并。

Reference: Pandas 1.x Cookbook

参考:Pandas 1.x Cookbook

回答by null

The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.

merge 和 concat 之间的主要区别在于,merge 允许您对表执行更结构化的“连接”,其中 concat 的使用范围更广且结构性更差。

Merge

合并

Referring the documentation, pd.DataFrame.mergetakes rightas a required argument, which you can think it as joining left table and right table according to some pre-defined structured join operation. Note the definition for parameter right.

参考文档pd.DataFrame.mergeright作为必需参数,您可以将其视为根据一些预定义的结构化连接操作连接左表和右表。注意参数right的定义。

Required Parameters

所需参数

  • right: DataFrame or named Series
  • :DataFrame 或命名系列

Optional Parameters

可选参数

  • how: {‘left', ‘right', ‘outer', ‘inner'} default ‘inner'
  • on: label or list
  • left_on: label or list, or array-like
  • right_on: label or list, or array-like
  • left_index: bool, default False
  • right_index: bool, default False
  • sort: bool, default False
  • suffixes: tuple of (str, str), default (‘_x', ‘_y')
  • copy: bool, default True
  • indicator: bool or str, default False
  • validate: str, optional
  • 方式:{'left', 'right', 'outer', 'inner'} 默认为 'inner'
  • on: 标签或列表
  • left_on: 标签或列表,或类似数组
  • right_on: 标签或列表,或类似数组
  • left_index: bool,默认为False
  • right_index: bool,默认为 False
  • 排序:bool,默认为False
  • 后缀: (str, str) 的元组,默认 ('_x', '_y')
  • 复制:bool,默认为True
  • 指标: bool 或 str,默认为 False
  • 验证:str,可选

Important:pd.DataFrame.mergerequires right to be a pd.DataFrameor named pd.Seriesobject.

重要提示:pd.DataFrame.merge需要权限是一个pd.DataFrame或命名的pd.Series对象。

Output

输出

  • Returns: DataFrame
  • 返回:数据帧

Furthermore, if we check the docstring for Merge Operation on pandas is below:

此外,如果我们检查熊猫合并操作的文档字符串如下:

Perform a database (SQL) merge operation between two DataFrame or Series objects using either columns as keys or their row indexes

使用列作为键或其行索引在两个 DataFrame 或 Series 对象之间执行数据库 (SQL) 合并操作

Concat

康卡特

Refer to documentationof pd.concat, first note that the parameter is not named any of table, data_frame, series,?matrix, etc., but objsinstead. That is, you can pass many "data containers", which are defined as:

请参阅文件pd.concat,首先要注意的参数中指定的任何的表,data_frame,系列,?矩阵等,但OBJ文件来代替。也就是说,你可以传递许多“数据容器”,它们被定义为:

Iterable[FrameOrSeriesUnion], Mapping[Optional[Hashable], FrameOrSeriesUnion]

Iterable[FrameOrSeriesUnion], Mapping[Optional[Hashable], FrameOrSeriesUnion]

Required Parameters

所需参数

  • objs: a sequence or mapping of Series or DataFrame objects
  • objs: Series 或 DataFrame 对象的序列或映射

Optional Parameters

可选参数

  • axis: {0/'index', 1/'columns'}, default 0
  • join: {‘inner', ‘outer'}, default ‘outer'
  • ignore_index: bool, default False
  • keys: sequence, default None
  • levels: list of sequences, default None
  • names: list, default None
  • verify_integrity: bool, default False
  • sort: bool, default False
  • copy: bool, default True
  • :{0/'index', 1/'columns'},默认为 0
  • 加入: {'inner', 'outer'}, 默认为 'outer'
  • ignore_index: bool,默认为 False
  • :序列,默认无
  • levels: 序列列表,默认无
  • 名称:列表,默认无
  • verify_integrity: bool,默认为 False
  • 排序:bool,默认为False
  • 复制:bool,默认为True

Output

输出

  • Returns: object, type of objs
  • 返回:对象,对象类型

Example

例子

Code

代码

import pandas as pd

v1 = pd.Series([1, 5, 9, 13])
v2 = pd.Series([10, 100, 1000, 10000])
v3 = pd.Series([0, 1, 2, 3])

df_left = pd.DataFrame({
    "v1": v1,
    "v2": v2,
    "v3": v3
    })
df_right = pd.DataFrame({
    "v4": [5, 5, 5, 5],
    "v5": [3, 2, 1, 0]
    })


df_concat = pd.concat([v1, v2, v3])

# Performing operations on default

merge_result = df_left.merge(df_right, left_index=True, right_index=True)
concat_result = pd.concat([df_left, df_right], sort=False)
print(merge_result)
print('='*20)
print(concat_result)

Code Output

代码输出

   v1     v2  v3  v4  v5
0   1     10   0   5   3
1   5    100   1   5   2
2   9   1000   2   5   1
3  13  10000   3   5   0
====================
     v1       v2   v3   v4   v5
0   1.0     10.0  0.0  NaN  NaN
1   5.0    100.0  1.0  NaN  NaN
2   9.0   1000.0  2.0  NaN  NaN
3  13.0  10000.0  3.0  NaN  NaN
0   NaN      NaN  NaN  5.0  3.0
1   NaN      NaN  NaN  5.0  2.0
2   NaN      NaN  NaN  5.0  1.0

You can achieve, however, the first output (merge) with concat by changing the axisparameter

但是,您可以通过更改参数来使用 concat 实现第一个输出(合并)

concat_result = pd.concat([df_left, df_right], sort=False, axis=1)

Observe the following behavior,

观察以下行为,

concat_result = pd.concat([df_left, df_right, df_left, df_right], sort=False)

outputs;

输出;

     v1       v2   v3   v4   v5
0   1.0     10.0  0.0  NaN  NaN
1   5.0    100.0  1.0  NaN  NaN
2   9.0   1000.0  2.0  NaN  NaN
3  13.0  10000.0  3.0  NaN  NaN
0   NaN      NaN  NaN  5.0  3.0
1   NaN      NaN  NaN  5.0  2.0
2   NaN      NaN  NaN  5.0  1.0
3   NaN      NaN  NaN  5.0  0.0
0   1.0     10.0  0.0  NaN  NaN
1   5.0    100.0  1.0  NaN  NaN
2   9.0   1000.0  2.0  NaN  NaN
3  13.0  10000.0  3.0  NaN  NaN
0   NaN      NaN  NaN  5.0  3.0
1   NaN      NaN  NaN  5.0  2.0
2   NaN      NaN  NaN  5.0  1.0
3   NaN      NaN  NaN  5.0  0.0

, which you cannot perform a similar operation with merge, since it only allows a single DataFrame or named Series.

,您不能使用合并执行类似的操作,因为它只允许单个 DataFrame 或命名系列。

merge_result = df_left.merge([df_right, df_left, df_right], left_index=True, right_index=True)

outputs;

输出;

TypeError: Can only merge Series or DataFrame objects, a <class 'list'> was passed

Conclusion

结论

As you may have notice already that input and outputs may be different between "merge" and "concat".

您可能已经注意到,“合并”和“连接”之间的输入和输出可能不同。

As I mentioned at the beginning, the very first (main) difference is that "merge" performs a more structured join with a set of restricted set of objects and parameters where as "concat" performs a less strict/broader join with a broader set of objects and parameters.

正如我在开头提到的,第一个(主要)区别是“合并”对一组受限制的对象和参数执行更结构化的连接,而“连接”对更广泛的集合执行不太严格/更广泛的连接对象和参数。

All in all, merge is less tolerant to changes/(the input) and "concat" is looser/less sensitive to changes/(the input). You can achieve "merge" by using "concat", but the reverse is not always true.

总而言之,merge 对更改/(输入)的容忍度较低,而“concat”对更改/(输入)更松散/不那么敏感。您可以通过使用“concat”来实现“合并”,但反过来并不总是正确的。

"Merge" operation uses Data Frame columns (or name of pd.Seriesobject) or row indices, and since it uses those entities only it performs horizontal merge of Data Frames or Series, and does not apply vertical operation as a result.

“合并”操作使用数据框列(或pd.Series对象名称)或行索引,并且由于它仅使用这些实体,因此它执行数据框或系列的水平合并,因此不应用垂直操作。

If you want to see more, you can deep dive in the source code a bit;

如果你想看更多,你可以深入研究一下源代码;

回答by vicpal

by default:
join is a column-wise left join
pd.merge is a column-wise inner join
pd.concat is a row-wise outer join

默认情况下:
join 是按列的左连接
pd.merge 是按列的内连接
pd.concat 是按行的外连接

pd.concat:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2])
Dimensions of DataFrame should match along axis

pd.concat:
采用 Iterable 参数。因此,它不能直接使用 DataFrames(使用 [df,df2])
DataFrame 的维度应该沿轴匹配

Join and pd.merge:
can take DataFrame arguments

Join 和 pd.merge:
可以采用 DataFrame 参数

Click to see picture for understanding why code below does the same thing

点击查看图片了解为什么下面的代码做同样的事情

df1.join(df2)
pd.merge(df1, df2, left_index=True, right_index=True)
pd.concat([df1, df2], axis=1)