Python 熊猫合并 101

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/53645882/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 20:20:24  来源:igfitidea点击:

Pandas Merging 101

pythonpandasjoinmerge

提问by cs95

  • How to perform a (LEFT|RIGHT|FULL) (INNER|OUTER) join with pandas?
  • How do I add NaNs for missing rows after merge?
  • How do I get rid of NaNs after merging?
  • Can I merge on the index?
  • Cross join with pandas?
  • How do I merge multiple DataFrames?
  • merge? join? concat? update? Who? What? Why?!
  • 如何与熊猫执行 ( LEFT| RIGHT| FULL) ( INNER| OUTER) 连接?
  • 合并后如何为丢失的行添加 NaN?
  • 合并后如何摆脱NaN?
  • 我可以在索引上合并吗?
  • 与熊猫交叉加入?
  • 如何合并多个数据帧?
  • merge? join? concat? update? WHO?什么?为什么?!

... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

... 和更多。我已经看到这些反复出现的问题询问熊猫合并功能的各个方面。今天关于合并及其各种用例的大部分信息都分散在数十个措辞恶劣、无法搜索的帖子中。这里的目的是为后代整理一些更重要的观点。

This QnA is meant to be the next installment in a series of helpful user-guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).

这个 QnA 是关于常见 Pandas 习语的一系列有用的用户指南的下一部分(请参阅这篇关于旋转的文章,以及关于串联的这篇文章,我稍后会谈到)。

Please note that this post is notmeant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.

请注意,这篇文章并不是要替代文档,所以也请阅读它!一些例子取自那里。

回答by cs95

This post aims to give readers a primer on SQL-flavoured merging with pandas, how to use it, and when not to use it.

这篇文章旨在向读者介绍 SQL 风格与 Pandas 的合并、如何使用它以及何时不使用它。

In particular, here's what this post will go through:

特别是,这是这篇文章将要经历的:

  • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)

    • merging with different column names
    • avoiding duplicate merge key column in output
  • Merging with index under different conditions
    • effectively using your named index
    • merge key as the index of one and column of another
  • Multiway merges on columns and indexes (unique and non-unique)
  • Notable alternatives to mergeand join
  • 基础知识 - 连接类型(LEFT、RIGHT、OUTER、INNER)

    • 合并不同的列名
    • 避免输出中重复的合并键列
  • 不同条件下与索引合并
    • 有效地使用您的命名索引
    • 合并键作为一个的索引和另一个的列
  • 多路合并列和索引(唯一和非唯一)
  • 值得注意的替代品mergejoin

What this post will not go through:

这篇文章不会通过:

  • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
  • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!
  • 与性能相关的讨论和时间安排(目前)。在适当的地方,主要提到了更好的替代方案。
  • 处理后缀、删除额外的列、重命名输出和其他特定用例。还有其他(阅读:更好)的帖子可以解决这个问题,所以弄清楚!

Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so you can play with them. Also, see this poston how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.

注意
大多数示例在演示各种功能时默认使用 INNER JOIN 操作,除非另有说明。

此外,这里的所有数据帧都可以复制和复制,以便您可以使用它们。另外,请参阅 有关如何从剪贴板读取数据帧的帖子

最后,JOIN 操作的所有视觉表示都是使用 Google 绘图手绘的。灵感来自这里

Enough Talk, just show me how to use merge!

废话不多说,教我怎么用merge

Setup

设置

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})    
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

For the sake of simplicity, the key column has the same name (for now).

为简单起见,键列具有相同的名称(暂时)。

An INNER JOINis represented by

一个内连接由下式表示

Note
This, along with the forthcoming figures all follow this convention:

  • blueindicates rows that are present in the merge result
  • redindicates rows that are excluded from the result (i.e., removed)
  • greenindicates missing values that are replaced with NaNs in the result

注意
这与即将到来的数字都遵循这个约定:

  • 蓝色表示合并结果中存在的行
  • 红色表示从结果中排除(即删除)的行
  • 绿色表示在结果中用 NaN 替换的缺失值

To perform an INNER JOIN, call mergeon the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

要执行 INNER JOIN,请调用merge左侧的 DataFrame,指定右侧的 DataFrame 和连接键(至少)作为参数。

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

This returns only rows from leftand rightwhich share a common key (in this example, "B" and "D).

这仅返回来自leftright共享公共键(在本例中为“B”和“D”)的行。

A LEFT OUTER JOIN, or LEFT JOIN is represented by

LEFT OUTER JOIN,或LEFT JOIN由下式表示

This can be performed by specifying how='left'.

这可以通过指定来执行how='left'

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

Carefully note the placement of NaNs here. If you specify how='left', then only keys from leftare used, and missing data from rightis replaced by NaN.

请仔细注意此处 NaN 的位置。如果指定how='left',则仅left使用来自的键,并且丢失的数据来自rightNaN 替换。

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...

同样,对于RIGHT OUTER JOIN或 RIGHT JOIN 是...

...specify how='right':

...指定how='right'

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

Here, keys from rightare used, and missing data from leftis replaced by NaN.

这里right使用了来自的键,来自的缺失数据left由 NaN 替换。

Finally, for the FULL OUTER JOIN, given by

最后,对于FULL OUTER JOIN,由

specify how='outer'.

指定how='outer'

left.merge(right, on='key', how='outer')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

这使用来自两个帧的键,并为两者中缺失的行插入 NaN。

The documentation summarises these various merges nicely:

文档很好地总结了这些不同的合并:

enter image description here

enter image description here

Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

其他连接 - 左排除、右排除和完全排除/反连接

If you need LEFT-Excluding JOINsand RIGHT-Excluding JOINsin two steps.

如果您需要分两步进行LEFT-Exducing JOINsRIGHT-Exducing JOINs

For LEFT-Excluding JOIN, represented as

对于 LEFT-Exclusion JOIN,表示为

Start by performing a LEFT OUTER JOIN and then filtering (excluding!) rows coming from leftonly,

首先执行 LEFT OUTER JOIN,然后过滤(排除!)来自left仅的行,

(left.merge(right, on='key', how='left', indicator=True)
     .query('_merge == "left_only"')
     .drop('_merge', 1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN

Where,

在哪里,

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

And similarly, for a RIGHT-Excluding JOIN,

同样,对于 RIGHT-Excluded JOIN,

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', 1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

最后,如果您需要进行合并,只保留左侧或右侧的键,但不能同时保留两者(IOW,执行ANTI-JOIN),

You can do this in similar fashion—

你可以用类似的方式做到这一点——

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', 1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

Different names for key columns

键列的不同名称

If the key columns are named differently—for example, lefthas keyLeft, and righthas keyRightinstead of key—then you will have to specify left_onand right_onas arguments instead of on:

如果键列的名称不同——例如,lefthaskeyLeftrighthaskeyRight而不是key——那么你必须指定left_onandright_on作为参数而不是on

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357

left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

Avoiding duplicate key column in output

避免输出中的重复键列

When merging on keyLeftfrom leftand keyRightfrom right, if you only want either of the keyLeftor keyRight(but not both) in the output, you can start by setting the index as a preliminary step.

合并keyLeftfromleftkeyRightfrom 时right,如果您只想要输出中的keyLeftor keyRight(但不是两者)之一,则可以通过将索引设置为初步步骤开始。

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

Contrast this with the output of the command just before (thst is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeftis missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.

将此与之前命令的输出(thst 是 的输出left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner'))进行对比,您会注意到keyLeft丢失了。您可以根据将哪个帧的索引设置为键来确定要保留的列。例如,在执行某些 OUTER JOIN 操作时,这可能很重要。

Merging only a single column from one of the DataFrames

仅合并来自其中之一的单个列 DataFrames

For example, consider

例如,考虑

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

If you are required to merge only "new_val" (without any of the other columns), you can usually just subset columns before merging:

如果您只需要合并“new_val”(没有任何其他列),您通常可以在合并之前对列进行子集:

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:

如果您正在执行 LEFT OUTER JOIN,则性能更高的解决方案将涉及map

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

As mentioned, this is similar to, but faster than

如前所述,这类似于,但比

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

Merging on multiple columns

多列合并

To join on more than one column, specify a list for on(or left_onand right_on, as appropriate).

要连接多个列,请指定on(或left_onright_on,视情况而定)的列表。

left.merge(right, on=['key1', 'key2'] ...)

Or, in the event the names are different,

或者,如果名称不同,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

Other useful merge*operations and functions

其他有用的merge*操作和功能

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concatas well as the links to the function specs.

本节仅涵盖非常基础的内容,旨在满足您的胃口。更多的例子和案例,看到的文档mergejoin以及concat还有链接的功能规格。



Index-based *-JOIN (+ index-column merges)

基于索引的 *-JOIN (+ index-column merges)

Setup

设置

np.random.seed([3, 14])
left = pd.DataFrame({'value': np.random.randn(4)}, index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame({'value': np.random.randn(4)}, index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right

           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076

Typically, a merge on index would look like this:

通常,索引上的合并如下所示:

left.merge(right, left_index=True, right_index=True)


         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Support for index names

支持索引名称

If your index is named, then v0.23 users can also specify the level name to on(or left_onand right_onas necessary).

如果您的索引已命名,则 v0.23 用户还可以指定级别名称on(或left_onright_on根据需要)。

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Merging on index of one, column(s) of another

合并一个索引,另一个列

It is possible (and quite simple) to use the index of one, and the column of another, to perform a merge. For example,

可以(而且非常简单)使用一个的索引和另一个的列来执行合并。例如,

left.merge(right, left_on='key1', right_index=True)

Or vice versa (right_on=...and left_index=True).

反之亦然(right_on=...left_index=True)。

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2

  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

In this special case, the index for leftis named, so you can also use the index name with left_on, like this:

在这种特殊情况下,索引 forleft已命名,因此您也可以将索引名称与 一起使用left_on,如下所示:

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

DataFrame.join
Besides these, there is another succinct option. You can use DataFrame.joinwhich defaults to joins on the index. DataFrame.joindoes a LEFT OUTER JOIN by default, so how='inner'is necessary here.

DataFrame.join
除了这些,还有另一个简洁的选择。您可以使用DataFrame.join哪些默认值来连接索引。DataFrame.join默认情况下执行 LEFT OUTER JOIN,所以how='inner'这里是必要的。

left.join(right, how='inner', lsuffix='_x', rsuffix='_y')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Note that I needed to specify the lsuffixand rsuffixarguments since joinwould otherwise error out:

请注意,我需要指定lsuffixrsuffix参数join,否则会出错:

left.join(right)
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

Since the column names are the same. This would not be a problem if they were differently named.

由于列名相同。如果它们的名称不同,这将不是问题。

left.rename(columns={'value':'leftvalue'}).join(right, how='inner')

        leftvalue     value
idxkey                     
B       -0.402655  0.543843
D       -0.524349  0.013135

pd.concat
Lastly, as an alternative for index-based joins, you can use pd.concat:

pd.concat
最后,作为基于索引的连接的替代方法,您可以使用pd.concat

pd.concat([left, right], axis=1, sort=False, join='inner')

           value     value
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Omit join='inner'if you need a FULL OUTER JOIN (the default):

join='inner'如果需要 FULL OUTER JOIN(默认值),请省略:

pd.concat([left, right], axis=1, sort=False)

      value     value
A -0.602923       NaN
B -0.402655  0.543843
C  0.302329       NaN
D -0.524349  0.013135
E       NaN -0.326498
F       NaN  1.385076

For more information, see this canonical post on pd.concatby @piRSquared.

有关更多信息,请参阅@piRSquared发布的此规范帖子pd.concat



Generalizing: mergeing multiple DataFrames

泛化:merge多个 DataFrames

Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining mergecalls:

通常,当多个 DataFrame 合并在一起时会出现这种情况。天真地,这可以通过链接merge调用来完成:

df1.merge(df2, ...).merge(df3, ...)

However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.

但是,这对于许多 DataFrame 来说很快就会失控。此外,可能需要对未知数量的 DataFrame 进行泛化。

Here I introduce pd.concatfor multi-way joins on uniquekeys, and DataFrame.joinfor multi-way joins on non-uniquekeys. First, the setup.

下面我介绍pd.concat的多路连接上独一无二的密钥和DataFrame.join用于多路连接上非唯一的密钥。首先,设置。

# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note, the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')

dfs2 = [A2, B2, C2]

Multiway merge on unique keys (or index)

多路合并唯一键(或索引)

If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concatjoins DataFrames on the index.

如果您的键(在这里,键可以是列或索引)是唯一的,那么您可以使用pd.concat. 请注意,pd.concat在索引上连接 DataFrames

# merge on `key` column, you'll need to set the index before concatenating
pd.concat([
    df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# merge on `key` index
pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0

Omit join='inner'for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).

省略join='inner'完全外部联接。请注意,您不能指定 LEFT 或 RIGHT OUTER 连接(如果需要这些,请使用join,如下所述)。

Multiway merge on keys with duplicates

多路合并具有重复项的键

concatis fast, but has its shortcomings. It cannot handle duplicates.

concat速度快,但也有缺点。它无法处理重复项。

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})

pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)

In this situation, we can use joinsince it can handle non-unique keys (note that joinjoins DataFrames on their index; it calls mergeunder the hood and does a LEFT OUTER JOIN unless otherwise specified).

在这种情况下,我们可以使用join它,因为它可以处理非唯一键(请注意,join在它们的索引上连接 DataFrame;merge除非另有说明,否则它会在后台调用并执行 LEFT OUTER JOIN)。

# join on `key` column, set as the index first
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join(
    [df.set_index('key') for df in (B, C)], how='inner').reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# join on `key` index
A3.set_index('key').join([B2, C2], how='inner')

       valueA    valueB  valueC
key                            
D    1.454274 -0.977278     1.0
D    0.761038 -0.977278     1.0

回答by eliu

A supplemental visual view of pd.concat([df0, df1], kwargs). Notice that, kwarg axis=0or axis=1's meaning is not as intuitive as df.mean()or df.apply(func)

的补充视觉视图pd.concat([df0, df1], kwargs)。请注意, kwarg axis=0oraxis=1的含义不如df.mean()or直观df.apply(func)



on pd.concat([df0, df1])

on pd.concat([df0, df1])