Python 熊猫合并 101
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/53645882/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas Merging 101
提问by cs95
- How to perform a (
LEFT
|RIGHT
|FULL
) (INNER
|OUTER
) join with pandas? - How do I add NaNs for missing rows after merge?
- How do I get rid of NaNs after merging?
- Can I merge on the index?
- Cross join with pandas?
- How do I merge multiple DataFrames?
merge
?join
?concat
?update
? Who? What? Why?!
- 如何与熊猫执行 (
LEFT
|RIGHT
|FULL
) (INNER
|OUTER
) 连接? - 合并后如何为丢失的行添加 NaN?
- 合并后如何摆脱NaN?
- 我可以在索引上合并吗?
- 与熊猫交叉加入?
- 如何合并多个数据帧?
merge
?join
?concat
?update
? WHO?什么?为什么?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
... 和更多。我已经看到这些反复出现的问题询问熊猫合并功能的各个方面。今天关于合并及其各种用例的大部分信息都分散在数十个措辞恶劣、无法搜索的帖子中。这里的目的是为后代整理一些更重要的观点。
This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
这个 QnA 是关于常见 Pandas 习语的一系列有用的用户指南的下一部分(请参阅这篇关于旋转的文章,以及关于串联的这篇文章,我稍后会谈到)。
Please note that this post is notmeant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
请注意,这篇文章并不是要替代文档,所以也请阅读它!一些例子取自那里。
回答by cs95
This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.
这篇文章旨在向读者介绍 SQL 风格与 Pandas 的合并、如何使用它以及何时不使用它。
In particular, here's what this post will go through:
特别是,这是这篇文章将要经历的:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
- merging with different column names
- avoiding duplicate merge key column in output
- Merging with index under different conditions
- effectively using your named index
- merge key as the index of one and column of another
- Multiway merges on columns and indexes (unique and non-unique)
- Notable alternatives to
merge
andjoin
基础知识 - 连接类型(LEFT、RIGHT、OUTER、INNER)
- 合并不同的列名
- 避免输出中重复的合并键列
- 不同条件下与索引合并
- 有效地使用您的命名索引
- 合并键作为一个的索引和另一个的列
- 多路合并列和索引(唯一和非唯一)
- 值得注意的替代品
merge
和join
What this post will not go through:
这篇文章不会通过:
- Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
- Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
- 与性能相关的讨论和时间安排(目前)。在适当的地方,主要提到了更好的替代方案。
- 处理后缀、删除额外的列、重命名输出和其他特定用例。还有其他(阅读:更好)的帖子可以解决这个问题,所以弄清楚!
Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this poston how to read DataFrames from your clipboard.
Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.
注意
大多数示例在演示各种功能时默认使用 INNER JOIN 操作,除非另有说明。此外,这里的所有数据帧都可以复制和复制,以便您可以使用它们。另外,请参阅 有关如何从剪贴板读取数据帧的帖子。
最后,JOIN 操作的所有视觉表示都是使用 Google 绘图手绘的。灵感来自这里。
Enough Talk, just show me how to use merge
!
废话不多说,教我怎么用merge
!
Setup
设置
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right
key value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
For the sake of simplicity, the key column has the same name (for now).
为简单起见,键列具有相同的名称(暂时)。
An INNER JOINis represented by
一个内连接由下式表示
Note
This, along with the forthcoming figures all follow this convention:
- blueindicates rows that are present in the merge result
- redindicates rows that are excluded from the result (i.e., removed)
- greenindicates missing values that are replaced with NaNs in the result
注意
这与即将到来的数字都遵循这个约定:
- 蓝色表示合并结果中存在的行
- 红色表示从结果中排除(即删除)的行
- 绿色表示在结果中用 NaN 替换的缺失值
To perform an INNER JOIN, call merge
on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.
要执行 INNER JOIN,请调用merge
左侧的 DataFrame,指定右侧的 DataFrame 和连接键(至少)作为参数。
left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
This returns only rows from left
and right
which share a common key (in this example, "B" and "D).
这仅返回来自left
和right
共享公共键(在本例中为“B”和“D”)的行。
A LEFT OUTER JOIN, or LEFT JOIN is represented by
甲LEFT OUTER JOIN,或LEFT JOIN由下式表示
This can be performed by specifying how='left'
.
这可以通过指定来执行how='left'
。
left.merge(right, on='key', how='left')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
Carefully note the placement of NaNs here. If you specify how='left'
, then only keys from left
are used, and missing data from right
is replaced by NaN.
请仔细注意此处 NaN 的位置。如果指定how='left'
,则仅left
使用来自的键,并且丢失的数据来自right
NaN 替换。
And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...
同样,对于RIGHT OUTER JOIN或 RIGHT JOIN 是...
...specify how='right'
:
...指定how='right'
:
left.merge(right, on='key', how='right')
key value_x value_y
0 B 0.400157 1.867558
1 D 2.240893 -0.977278
2 E NaN 0.950088
3 F NaN -0.151357
Here, keys from right
are used, and missing data from left
is replaced by NaN.
这里right
使用了来自的键,来自的缺失数据left
由 NaN 替换。
Finally, for the FULL OUTER JOIN, given by
最后,对于FULL OUTER JOIN,由
specify how='outer'
.
指定how='outer'
。
left.merge(right, on='key', how='outer')
key value_x value_y
0 A 1.764052 NaN
1 B 0.400157 1.867558
2 C 0.978738 NaN
3 D 2.240893 -0.977278
4 E NaN 0.950088
5 F NaN -0.151357
This uses the keys from both frames, and NaNs are inserted for missing rows in both.
这使用来自两个帧的键,并为两者中缺失的行插入 NaN。
The documentation summarises these various merges nicely:
文档很好地总结了这些不同的合并:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
其他连接 - 左排除、右排除和完全排除/反连接
If you need LEFT-Excluding JOINsand RIGHT-Excluding JOINsin two steps.
如果您需要分两步进行LEFT-Exducing JOINs和RIGHT-Exducing JOINs。
For LEFT-Excluding JOIN, represented as
对于 LEFT-Exclusion JOIN,表示为
Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from left
only,
首先执行 LEFT OUTER JOIN,然后过滤(排除!)来自left
仅的行,
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
Where,
在哪里,
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge
0 A 1.764052 NaN left_only
1 B 0.400157 1.867558 both
2 C 0.978738 NaN left_only
3 D 2.240893 -0.977278 both
And similarly, for a RIGHT-Excluding JOIN,
同样,对于 RIGHT-Excluded JOIN,
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y
2 E NaN 0.950088
3 F NaN -0.151357
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
最后,如果您需要进行合并,只保留左侧或右侧的键,但不能同时保留两者(IOW,执行ANTI-JOIN),
You can do this in similar fashion—
你可以用类似的方式做到这一点——
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y
0 A 1.764052 NaN
2 C 0.978738 NaN
4 E NaN 0.950088
5 F NaN -0.151357
Different names for key columns
键列的不同名称
If the key columns are named differently—for example, left
has keyLeft
, and right
has keyRight
instead of key
—then you will have to specify left_on
and right_on
as arguments instead of on
:
如果键列的名称不同——例如,left
haskeyLeft
和right
haskeyRight
而不是key
——那么你必须指定left_on
andright_on
作为参数而不是on
:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value
0 A 1.764052
1 B 0.400157
2 C 0.978738
3 D 2.240893
right2
keyRight value
0 B 1.867558
1 D -0.977278
2 E 0.950088
3 F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y
0 B 0.400157 B 1.867558
1 D 2.240893 D -0.977278
Avoiding duplicate key column in output
避免输出中的重复键列
When merging on keyLeft
from left
and keyRight
from right
, if you only want either of the keyLeft
or keyRight
(but not both) in the output, you can start by setting the index as a preliminary step.
合并keyLeft
fromleft
和keyRight
from 时right
,如果您只想要输出中的keyLeft
or keyRight
(但不是两者)之一,则可以通过将索引设置为初步步骤开始。
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y
0 0.400157 B 1.867558
1 2.240893 D -0.977278
Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
), you'll notice keyLeft
is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.
将此与之前命令的输出(thst 是 的输出left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
)进行对比,您会注意到keyLeft
丢失了。您可以根据将哪个帧的索引设置为键来确定要保留的列。例如,在执行某些 OUTER JOIN 操作时,这可能很重要。
Merging only a single column from one of the DataFrames
仅合并来自其中之一的单个列 DataFrames
For example, consider
例如,考虑
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol
0 B 1.867558 0
1 D -0.977278 1
2 E 0.950088 2
3 F -0.151357 3
If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:
如果您只需要合并“new_val”(没有任何其他列),您通常可以在合并之前对列进行子集:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol
0 B 0.400157 0
1 D 2.240893 1
If you're doing a LEFT OUTER JOIN, a more performant solution would involve map
:
如果您正在执行 LEFT OUTER JOIN,则性能更高的解决方案将涉及map
:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
As mentioned, this is similar to, but faster than
如前所述,这类似于,但比
left.merge(right3[['key', 'newcol']], on='key', how='left')
key value newcol
0 A 1.764052 NaN
1 B 0.400157 0.0
2 C 0.978738 NaN
3 D 2.240893 1.0
Merging on multiple columns
多列合并
To join on more than one column, specify a list for on
(or left_on
and right_on
, as appropriate).
要连接多个列,请指定on
(或left_on
和right_on
,视情况而定)的列表。
left.merge(right, on=['key1', 'key2'] ...)
Or, in the event the names are different,
或者,如果名称不同,
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])
Other useful merge*
operations and functions
其他有用的merge*
操作和功能
- Merging a DataFrame with Series on index: See this answer.
Besides
merge
,DataFrame.update
andDataFrame.combine_first
are also used in certain cases to update one DataFrame with another.pd.merge_ordered
is a useful function for ordered JOINs.pd.merge_asof
(read: merge_asOf) is useful for approximatejoins.
- 将 DataFrame 与索引上的系列合并:请参阅此答案。
此外
merge
,DataFrame.update
和DataFrame.combine_first
也用在某些情况下,更新一个数据帧与另一个。pd.merge_ordered
是有序 JOIN 的有用函数。pd.merge_asof
(阅读:merge_asOf)对于近似连接很有用。
This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge
, join
, and concat
as well as the links to the function specs.
本节仅涵盖非常基础的内容,旨在满足您的胃口。更多的例子和案例,看到的文档merge
,join
以及concat
还有链接的功能规格。
Index-based *-JOIN (+ index-column merge
s)
基于索引的 *-JOIN (+ index-column merge
s)
Setup
设置
np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'
left
value
idxkey
A -0.602923
B -0.402655
C 0.302329
D -0.524349
right
value
idxkey
B 0.543843
D 0.013135
E -0.326498
F 1.385076
Typically, a merge on index would look like this:
通常,索引上的合并如下所示:
left.merge(right, left_index=True, right_index=True)
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Support for index names
支持索引名称
If your index is named, then v0.23 users can also specify the level name to on
(or left_on
and right_on
as necessary).
如果您的索引已命名,则 v0.23 用户还可以指定级别名称on
(或left_on
和right_on
根据需要)。
left.merge(right, on='idxkey')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Merging on index of one, column(s) of another
合并一个索引,另一个列
It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,
可以(而且非常简单)使用一个的索引和另一个的列来执行合并。例如,
left.merge(right, left_on='key1', right_index=True)
Or vice versa (right_on=...
and left_index=True
).
反之亦然(right_on=...
和left_index=True
)。
right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
colkey value
0 B 0.543843
1 D 0.013135
2 E -0.326498
3 F 1.385076
left.merge(right2, left_index=True, right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
In this special case, the index for left
is named, so you can also use the index name with left_on
, like this:
在这种特殊情况下,索引 forleft
已命名,因此您也可以将索引名称与 一起使用left_on
,如下所示:
left.merge(right2, left_on='idxkey', right_on='colkey')
value_x colkey value_y
0 -0.402655 B 0.543843
1 -0.524349 D 0.013135
DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.join
which defaults to joins on the index. DataFrame.join
does a LEFT OUTER JOIN by default, so how='inner'
is necessary here.
DataFrame.join
除了这些,还有另一个简洁的选择。您可以使用DataFrame.join
哪些默认值来连接索引。DataFrame.join
默认情况下执行 LEFT OUTER JOIN,所以how='inner'
这里是必要的。
left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
value_x value_y
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Note that I needed to specify the lsuffix
and rsuffix
arguments since join
would otherwise error out:
请注意,我需要指定lsuffix
和rsuffix
参数join
,否则会出错:
left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Since the column names are the same. This would not be a problem if they were differently named.
由于列名相同。如果它们的名称不同,这将不是问题。
left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
leftvalue value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat
:
pd.concat
最后,作为基于索引的连接的替代方法,您可以使用pd.concat
:
pd.concat([left, right], axis=1, sort=False, join='inner')
value value
idxkey
B -0.402655 0.543843
D -0.524349 0.013135
Omit join='inner'
if you need a FULL OUTER JOIN (the default):
join='inner'
如果需要 FULL OUTER JOIN(默认值),请省略:
pd.concat([left, right], axis=1, sort=False)
value value
A -0.602923 NaN
B -0.402655 0.543843
C 0.302329 NaN
D -0.524349 0.013135
E NaN -0.326498
F NaN 1.385076
For more information, see this canonical post on pd.concat
by @piRSquared.
有关更多信息,请参阅@piRSquared发布的此规范帖子pd.concat
。
Generalizing: merge
ing multiple DataFrames
泛化:merge
多个 DataFrames
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge
calls:
通常,当多个 DataFrame 合并在一起时会出现这种情况。天真地,这可以通过链接merge
调用来完成:
df1.merge(df2, ...).merge(df3, ...)
However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.
但是,这对于许多 DataFrame 来说很快就会失控。此外,可能需要对未知数量的 DataFrame 进行泛化。
Here I introduce pd.concat
for multi-way joins on uniquekeys, and DataFrame.join
for multi-way joins on non-uniquekeys. First, the setup.
下面我介绍pd.concat
的多路连接上独一无二的密钥和DataFrame.join
用于多路连接上非唯一的密钥。首先,设置。
# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C]
# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')
dfs2 = [A2, B2, C2]
Multiway merge on unique keys (or index)
多路合并唯一键(或索引)
If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat
. Note that pd.concat
joins DataFrames on the index.
如果您的键(在这里,键可以是列或索引)是唯一的,那么您可以使用pd.concat
. 请注意,pd.concat
在索引上连接 DataFrames。
# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')
valueA valueB valueC
key
D 2.240893 -0.977278 1.0
Omit join='inner'
for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join
, described below).
省略join='inner'
完全外部联接。请注意,您不能指定 LEFT 或 RIGHT OUTER 连接(如果需要这些,请使用join
,如下所述)。
Multiway merge on keys with duplicates
多路合并具有重复项的键
concat
is fast, but has its shortcomings. It cannot handle duplicates.
concat
速度快,但也有缺点。它无法处理重复项。
A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)
In this situation, we can use join
since it can handle non-unique keys (note that join
joins DataFrames on their index; it calls merge
under the hood and does a LEFT OUTER JOIN unless otherwise specified).
在这种情况下,我们可以使用join
它,因为它可以处理非唯一键(请注意,join
在它们的索引上连接 DataFrame;merge
除非另有说明,否则它会在后台调用并执行 LEFT OUTER JOIN)。
# join on `key` column, set as the index first
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join(
[df.set_index('key') for df in (B, C)], how='inner').reset_index()
key valueA valueB valueC
0 D 2.240893 -0.977278 1.0
# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')
valueA valueB valueC
key
D 1.454274 -0.977278 1.0
D 0.761038 -0.977278 1.0