Python pandas 中 merge() 和 concat() 之间的区别
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/38256104/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Difference(s) between merge() and concat() in pandas
提问by WindChimes
What's the essential difference(s) between pd.DataFrame.merge()
and pd.concat()
?
pd.DataFrame.merge()
和之间的本质区别是pd.concat()
什么?
So far, this is what I found, please comment on how complete and accurate my understanding is:
到目前为止,这是我发现的,请评论我的理解有多完整和准确:
.merge()
can only use columns (plus row-indices) and it is semantically suitable for database-style operations..concat()
can be used with either axis, using only indices, and gives the option for adding a hierarchical index.Incidentally, this allows for the following redundancy: both can combine two dataframes using the rows indices.
pd.DataFrame.join()
merely offers a shorthand for a subset of the use cases of.merge()
.merge()
只能使用列(加上行索引)并且它在语义上适用于数据库样式的操作。.concat()
可以与任一轴一起使用,仅使用索引,并提供添加分层索引的选项。顺便说一下,这允许以下冗余:两者都可以使用行索引组合两个数据帧。
pd.DataFrame.join()
仅提供了用例的子集的速记.merge()
(Pandas is great at addressing a very wide spectrum of use cases in data analysis. It can be a bit daunting exploring the documentation to figure out what is the best way to perform a particular task. )
(Pandas 非常擅长解决数据分析中非常广泛的用例。探索文档以找出执行特定任务的最佳方式可能有点令人生畏。)
回答by Abhishek Sawant
A very high level difference is that merge()
is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True
and/or right_index=True
), and concat()
is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis
option is set to 0 or 1).
一个非常高级的区别是merge()
用于根据公共列的值组合两个(或多个)数据帧(也可以使用索引,使用left_index=True
和/或right_index=True
),并concat()
用于附加一个(或多个)数据帧一个低于另一个(或横向,取决于axis
选项设置为 0 还是 1)。
join()
is used to merge 2 dataframes on the basis of the index; instead of using merge()
with the option left_index=True
we can use join()
.
join()
用于根据索引合并2个数据帧;而不是使用我们可以使用merge()
的选项。left_index=True
join()
For example:
例如:
df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df1:
Key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})
df2:
Key data2
0 a 0
1 b 1
2 d 2
#Merge
# The 2 dataframes are merged on the basis of values in column "Key" as it is
# a common column in 2 dataframes
pd.merge(df1, df2)
Key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
#Concat
# df2 dataframe is appended at the bottom of df1
pd.concat([df1, df2])
Key data1 data2
0 b 0 NaN
1 b 1 NaN
2 a 2 NaN
3 c 3 NaN
4 a 4 NaN
5 a 5 NaN
6 b 6 NaN
0 a Nan 0
1 b Nan 1
2 d Nan 2
回答by Piyush Malhotra Nova_Outlaw
pd.concat
takes an Iterable
as its argument. Hence, it cannot take DataFrame
s directly as its argument. Also Dimension
s of the DataFrame
should match along axis while concatenating.
pd.concat
将 anIterable
作为其参数。因此,它不能DataFrame
直接将s 作为其参数。另外Dimension
S的的DataFrame
同时,串联沿轴线应该匹配。
pd.merge
can take DataFrame
s as its argument, and is used to combine two DataFrame
s with same columns or index, which can't be done with pd.concat
since it will show the repeated column in the DataFrame.
pd.merge
可以将DataFrame
s 作为参数,用于将DataFrame
具有相同列或索引的两个s组合在一起,这是无法完成的,pd.concat
因为它会在 DataFrame 中显示重复的列。
Whereas join can be used to join two DataFrame
s with different indices.
而 join 可用于连接DataFrame
具有不同索引的两个s。
回答by prosti
I am currently trying to understand the essential difference(s) between
pd.DataFrame.merge()
andpd.concat()
.
我目前正在尝试了解
pd.DataFrame.merge()
和之间的本质区别pd.concat()
。
Nice question. The main difference:
好问题。主要区别:
pd.concat
works on both axes.
pd.concat
适用于两个轴。
The other difference, is pd.concat
has innerdefaultand outerjoins only, while pd.DataFrame.merge()
has left, right, outer, innerdefaultjoins.
另一个区别是pd.concat
只有内部默认和外部连接,而pd.DataFrame.merge()
具有左、右、外部、内部默认连接。
Third notable other difference is: pd.DataFrame.merge()
has the option to set the column suffixes when merging columns with the same name, while for pd.concat
this is not possible.
第三个值得注意的其他区别是:pd.DataFrame.merge()
在合并具有相同名称的列时可以选择设置列后缀,而pd.concat
这是不可能的。
With pd.concat
by default you are able to stack rows of multiple dataframes (axis=0
) and when you set the axis=1
then you mimic the pd.DataFrame.merge()
function.
随着pd.concat
在默认情况下,你都能够堆叠多个dataframes(行axis=0
),当你设置axis=1
那么你模仿的pd.DataFrame.merge()
功能。
Some useful examples of pd.concat
:
一些有用的例子pd.concat
:
df2=pd.concat([df]*2, ignore_index=True) #double the rows of a dataframe
df2=pd.concat([df, df.iloc[[0]]]) # add first row to the end
df3=pd.concat([df1,df2], join='inner', ignore_index=True) # concat two df's
回答by Jake Wu
At a high level:
在高层次上:
.concat()
simply stacks multipleDataFrame
together either vertically, or stitches horizontally after aligning on index.merge()
first aligns twoDataFrame
' selected common column(s) or index, and then pick up the remaining columns from the aligned rows of eachDataFrame
.
.concat()
DataFrame
在索引上对齐后,简单地将多个堆叠在一起,或者垂直堆叠或水平缝合.merge()
首先对齐两个DataFrame
' 选定的公共列或索引,然后从每个 ' 的对齐行中选取剩余的列DataFrame
。
More specifically, .concat()
:
更具体地说,.concat()
:
- Is a top-level pandas function
- Combines two or more pandas
DataFrame
verticallyor horizontally - Aligns only on the indexwhen combining horizontally
- Errors when any of the
DataFrame
contains a duplicate index. - Defaults to outer join with the option for inner join
- 是一个顶级的pandas函数
DataFrame
垂直或水平组合两个或多个熊猫- 水平组合时仅在索引上对齐
- 当任何
DataFrame
包含重复索引时出错。 - 默认为带内连接选项的外连接
And .merge()
:
并且.merge()
:
- Exists both as a top-level pandas function and a
DataFrame
method (as of pandas 1.0) - Combines exactly two
DataFrame
horizontally - Aligns the calling
DataFrame
's column(s) or index with the otherDataFrame
's column(s) or index - Handles duplicate values on the joining columns or indexby performing a cartesian product
- Defaults to inner join with options for left, outer, and right
- 作为顶级熊猫函数和
DataFrame
方法存在(从熊猫 1.0 开始) DataFrame
水平合并两个- 将调用
DataFrame
者的列或索引与另一个DataFrame
的列或索引 对齐 - 通过执行笛卡尔积处理连接列或索引上的重复值
- 默认为内连接,带有左、外和右选项
Note that when performing pd.merge(left, right)
, if left
has two rows containing the same values from the joining columns or index, each row will combine with right
's corresponding row(s) resulting in a cartesian product. On the other hand, if .concat()
is used to combine columns, we need to make sure no duplicated index exists in either DataFrame
.
请注意,在执行 时pd.merge(left, right)
,如果left
有两行包含来自连接列或索引的相同值,则每行将与right
的相应行组合,从而产生笛卡尔积。另一方面,如果.concat()
用于组合列,我们需要确保其中任何一个都没有重复的索引DataFrame
。
Practically speaking:
实际来说:
- Consider
.concat()
first when combining homogeneousDataFrame
, while consider.merge()
first when combining complementaryDataFrame
. - If need to merge vertically, go with
.concat()
. If need to merge horizontally via columns, go with.merge()
, which by default merge on the columns in common.
- 考虑
.concat()
混合均匀时,首先DataFrame
,在考虑.merge()
合并时,互补第一DataFrame
。 - 如果需要垂直合并,请使用
.concat()
. 如果需要通过列进行水平合并,请使用.merge()
,默认情况下在公共列上合并。
Reference: Pandas 1.x Cookbook
回答by null
The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.
merge 和 concat 之间的主要区别在于,merge 允许您对表执行更结构化的“连接”,其中 concat 的使用范围更广且结构性更差。
Merge
合并
Referring the documentation, pd.DataFrame.merge
takes rightas a required argument, which you can think it as joining left table and right table according to some pre-defined structured join operation. Note the definition for parameter right.
参考文档,pd.DataFrame.merge
将right作为必需参数,您可以将其视为根据一些预定义的结构化连接操作连接左表和右表。注意参数right的定义。
Required Parameters
所需参数
- right: DataFrame or named Series
- 右:DataFrame 或命名系列
Optional Parameters
可选参数
- how: {‘left', ‘right', ‘outer', ‘inner'} default ‘inner'
- on: label or list
- left_on: label or list, or array-like
- right_on: label or list, or array-like
- left_index: bool, default False
- right_index: bool, default False
- sort: bool, default False
- suffixes: tuple of (str, str), default (‘_x', ‘_y')
- copy: bool, default True
- indicator: bool or str, default False
- validate: str, optional
- 方式:{'left', 'right', 'outer', 'inner'} 默认为 'inner'
- on: 标签或列表
- left_on: 标签或列表,或类似数组
- right_on: 标签或列表,或类似数组
- left_index: bool,默认为False
- right_index: bool,默认为 False
- 排序:bool,默认为False
- 后缀: (str, str) 的元组,默认 ('_x', '_y')
- 复制:bool,默认为True
- 指标: bool 或 str,默认为 False
- 验证:str,可选
Important:pd.DataFrame.merge
requires right to be a pd.DataFrame
or named pd.Series
object.
重要提示:pd.DataFrame.merge
需要权限是一个pd.DataFrame
或命名的pd.Series
对象。
Output
输出
- Returns: DataFrame
- 返回:数据帧
Furthermore, if we check the docstring for Merge Operation on pandas is below:
此外,如果我们检查熊猫合并操作的文档字符串如下:
Perform a database (SQL) merge operation between two DataFrame or Series objects using either columns as keys or their row indexes
使用列作为键或其行索引在两个 DataFrame 或 Series 对象之间执行数据库 (SQL) 合并操作
Concat
康卡特
Refer to documentationof pd.concat
, first note that the parameter is not named any of table, data_frame, series,?matrix, etc., but objsinstead. That is, you can pass many "data containers", which are defined as:
请参阅文件中pd.concat
,首先要注意的参数中指定的任何的表,data_frame,系列,?矩阵等,但OBJ文件来代替。也就是说,你可以传递许多“数据容器”,它们被定义为:
Iterable[FrameOrSeriesUnion], Mapping[Optional[Hashable], FrameOrSeriesUnion]
Iterable[FrameOrSeriesUnion], Mapping[Optional[Hashable], FrameOrSeriesUnion]
Required Parameters
所需参数
- objs: a sequence or mapping of Series or DataFrame objects
- objs: Series 或 DataFrame 对象的序列或映射
Optional Parameters
可选参数
- axis: {0/'index', 1/'columns'}, default 0
- join: {‘inner', ‘outer'}, default ‘outer'
- ignore_index: bool, default False
- keys: sequence, default None
- levels: list of sequences, default None
- names: list, default None
- verify_integrity: bool, default False
- sort: bool, default False
- copy: bool, default True
- 轴:{0/'index', 1/'columns'},默认为 0
- 加入: {'inner', 'outer'}, 默认为 'outer'
- ignore_index: bool,默认为 False
- 键:序列,默认无
- levels: 序列列表,默认无
- 名称:列表,默认无
- verify_integrity: bool,默认为 False
- 排序:bool,默认为False
- 复制:bool,默认为True
Output
输出
- Returns: object, type of objs
- 返回:对象,对象类型
Example
例子
Code
代码
import pandas as pd
v1 = pd.Series([1, 5, 9, 13])
v2 = pd.Series([10, 100, 1000, 10000])
v3 = pd.Series([0, 1, 2, 3])
df_left = pd.DataFrame({
"v1": v1,
"v2": v2,
"v3": v3
})
df_right = pd.DataFrame({
"v4": [5, 5, 5, 5],
"v5": [3, 2, 1, 0]
})
df_concat = pd.concat([v1, v2, v3])
# Performing operations on default
merge_result = df_left.merge(df_right, left_index=True, right_index=True)
concat_result = pd.concat([df_left, df_right], sort=False)
print(merge_result)
print('='*20)
print(concat_result)
Code Output
代码输出
v1 v2 v3 v4 v5
0 1 10 0 5 3
1 5 100 1 5 2
2 9 1000 2 5 1
3 13 10000 3 5 0
====================
v1 v2 v3 v4 v5
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
You can achieve, however, the first output (merge) with concat by changing the axisparameter
但是,您可以通过更改轴参数来使用 concat 实现第一个输出(合并)
concat_result = pd.concat([df_left, df_right], sort=False, axis=1)
Observe the following behavior,
观察以下行为,
concat_result = pd.concat([df_left, df_right, df_left, df_right], sort=False)
outputs;
输出;
v1 v2 v3 v4 v5
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
3 NaN NaN NaN 5.0 0.0
0 1.0 10.0 0.0 NaN NaN
1 5.0 100.0 1.0 NaN NaN
2 9.0 1000.0 2.0 NaN NaN
3 13.0 10000.0 3.0 NaN NaN
0 NaN NaN NaN 5.0 3.0
1 NaN NaN NaN 5.0 2.0
2 NaN NaN NaN 5.0 1.0
3 NaN NaN NaN 5.0 0.0
, which you cannot perform a similar operation with merge, since it only allows a single DataFrame or named Series.
,您不能使用合并执行类似的操作,因为它只允许单个 DataFrame 或命名系列。
merge_result = df_left.merge([df_right, df_left, df_right], left_index=True, right_index=True)
outputs;
输出;
TypeError: Can only merge Series or DataFrame objects, a <class 'list'> was passed
Conclusion
结论
As you may have notice already that input and outputs may be different between "merge" and "concat".
您可能已经注意到,“合并”和“连接”之间的输入和输出可能不同。
As I mentioned at the beginning, the very first (main) difference is that "merge" performs a more structured join with a set of restricted set of objects and parameters where as "concat" performs a less strict/broader join with a broader set of objects and parameters.
正如我在开头提到的,第一个(主要)区别是“合并”对一组受限制的对象和参数执行更结构化的连接,而“连接”对更广泛的集合执行不太严格/更广泛的连接对象和参数。
All in all, merge is less tolerant to changes/(the input) and "concat" is looser/less sensitive to changes/(the input). You can achieve "merge" by using "concat", but the reverse is not always true.
总而言之,merge 对更改/(输入)的容忍度较低,而“concat”对更改/(输入)更松散/不那么敏感。您可以通过使用“concat”来实现“合并”,但反过来并不总是正确的。
"Merge" operation uses Data Frame columns (or name of pd.Series
object) or row indices, and since it uses those entities only it performs horizontal merge of Data Frames or Series, and does not apply vertical operation as a result.
“合并”操作使用数据框列(或pd.Series
对象名称)或行索引,并且由于它仅使用这些实体,因此它执行数据框或系列的水平合并,因此不应用垂直操作。
If you want to see more, you can deep dive in the source code a bit;
如果你想看更多,你可以深入研究一下源代码;
回答by vicpal
by default:
join is a column-wise left join
pd.merge is a column-wise inner join
pd.concat is a row-wise outer join
默认情况下:
join 是按列的左连接
pd.merge 是按列的内连接
pd.concat 是按行的外连接
pd.concat:
takes Iterable arguments. Thus, it cannot take DataFrames directly (use [df,df2])
Dimensions of DataFrame should match along axis
pd.concat:
采用 Iterable 参数。因此,它不能直接使用 DataFrames(使用 [df,df2])
DataFrame 的维度应该沿轴匹配
Join and pd.merge:
can take DataFrame arguments
Join 和 pd.merge:
可以采用 DataFrame 参数
Click to see picture for understanding why code below does the same thing
df1.join(df2)
pd.merge(df1, df2, left_index=True, right_index=True)
pd.concat([df1, df2], axis=1)