Pandas 的 concat 函数中的“级别”、“键”和名称参数是什么?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/49620538/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:25:03  来源:igfitidea点击:

What are the 'levels', 'keys', and names arguments for in Pandas' concat function?

pythonpandas

提问by piRSquared

Questions

问题

  • How do I use pd.concat?
  • What is the levelsargument for?
  • What is the keysargument for?
  • Are there a bunch of examples to help explain how to use all the arguments?
  • 我如何使用pd.concat
  • 什么是levels对的说法?
  • 什么是keys对的说法?
  • 是否有一堆示例可以帮助解释如何使用所有参数?

Pandas' concatfunction is the Swiss Army knifeof the merging utilities. The variety of situations in which it is useful are numerous. The existing documentation leaves out a few details on some of the optional arguments. Among them are the levelsand keysarguments. I set out to figure out what those arguments do.

Pandas 的concat功能是合并实用程序的瑞士军刀。它适用的各种情况很多。现有文档遗漏了一些可选参数的一些细节。其中包括levelskeys参数。我开始弄清楚这些论点的作用。

I'll pose a question that will act as a gateway into many aspects of pd.concat.

我将提出一个问题,作为进入pd.concat.

Consider the data frames d1, d2, and d3:

考虑数据帧d1d2以及d3

import pandas as pd

d1 = pd.DataFrame(dict(A=.1, B=.2, C=.3), [2, 3])
d2 = pd.DataFrame(dict(B=.4, C=.5, D=.6), [1, 2])
d3 = pd.DataFrame(dict(A=.7, B=.8, D=.9), [1, 3])

If I were to concatenate these together with

如果我将这些连接在一起

pd.concat([d1, d2, d3], keys=['d1', 'd2', 'd3'])

I get the expected result with a pandas.MultiIndexfor my columnsobject:

我得到了pandas.MultiIndex我的columns对象的预期结果:

        A    B    C    D
d1 2  0.1  0.2  0.3  NaN
   3  0.1  0.2  0.3  NaN
d2 1  NaN  0.4  0.5  0.6
   2  NaN  0.4  0.5  0.6
d3 1  0.7  0.8  NaN  0.9
   3  0.7  0.8  NaN  0.9

However, I wanted to use the levelsargument documentation:

但是,我想使用levels参数文档

levels: list of sequences, default None. Specific levels (unique values) to use for constructing a MultiIndex. Otherwise, they will be inferred from the keys.

级别:序列列表,默认无。用于构建 MultiIndex 的特定级别(唯一值)。否则,它们将从密钥中推断出来。

So I passed

所以我通过了

pd.concat([d1, d2, d3], keys=['d1', 'd2', 'd3'], levels=[['d1', 'd2']])

And get a KeyError

并得到一个 KeyError

ValueError: Key d3 not in level Index(['d1', 'd2'], dtype='object')

ValueError: Key d3 not in level Index(['d1', 'd2'], dtype='object')

This made sense. The levels I passed were inadequate to describe the necessary levels indicated by the keys. Had I not passed anything, as I did above, the levels are inferred (as stated in the documentation). But how else can I use this argument to better effect?

这是有道理的。我通过的级别不足以描述按键指示的必要级别。如果我没有通过任何东西,就像我上面所做的那样,可以推断出级别(如文档中所述)。但是我还能如何使用这个论点来获得更好的效果呢?

If I tried this instead:

如果我尝试这样做:

pd.concat([d1, d2, d3], keys=['d1', 'd2', 'd3'], levels=[['d1', 'd2', 'd3']])

I and got the same results as above. But when I add one more value to the levels,

我得到了与上面相同的结果。但是当我在关卡中再增加一个值时,

df = pd.concat([d1, d2, d3], keys=['d1', 'd2', 'd3'], levels=[['d1', 'd2', 'd3', 'd4']])

I end up with the same looking data frame, but the resulting MultiIndexhas an unused level.

我最终得到了相同的数据框,但结果MultiIndex有一个未使用的级别。

df.index.levels[0]

Index(['d1', 'd2', 'd3', 'd4'], dtype='object')

So what is the point of the levelargument and should I be using keysdifferently?

那么level争论的重点是什么,我应该以keys不同的方式使用吗?

I'm using Python 3.6 and Pandas 0.22.

我使用的是 Python 3.6 和 Pandas 0.22。

回答by piRSquared

In the process of answering this question for myself, I learned many things, and I wanted to put together a catalog of examples and some explanation.

自己在回答这个问题的过程中,学到了很多东西,想整理一个例子目录和一些解释。

The specific answer to the point of the levelsargument will come towards the end.

levels争论点的具体答案将接近尾声。

pandas.concat: The Missing Manual

pandas.concat: 丢失的手册

Link To Current Documentation

当前文档的链接

Imports and defining objects

导入和定义对象

import pandas as pd

d1 = pd.DataFrame(dict(A=.1, B=.2, C=.3), index=[2, 3])
d2 = pd.DataFrame(dict(B=.4, C=.5, D=.6), index=[1, 2])
d3 = pd.DataFrame(dict(A=.7, B=.8, D=.9), index=[1, 3])

s1 = pd.Series([1, 2], index=[2, 3])
s2 = pd.Series([3, 4], index=[1, 2])
s3 = pd.Series([5, 6], index=[1, 3])


Arguments

参数

objs

objs

The first argument we come across is objs:

我们遇到的第一个论点是objs

objs: a sequence or mapping of Series, DataFrame, or Panel objects If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised

objs: Series、DataFrame 或 Panel 对象的序列或映射如果传递 dict,则排序的键将用作键参数,除非传递,在这种情况下将选择值(见下文)。任何 None 对象都将被静默删除,除非它们都是 None 在这种情况下将引发 ValueError

  • We typically see this used with a list of Seriesor DataFrameobjects.
  • I'll show that dictcan be very useful as well.
  • Generators may also be used and can be useful when using mapas in map(f, list_of_df)
  • 我们通常看到这与SeriesDataFrame对象列表一起使用。
  • 我会证明这dict也非常有用。
  • 发电机也可以被使用和使用时可以是有用的map,如map(f, list_of_df)

For now, we'll stick with a list of some of the DataFrameand Seriesobjects defined above. I'll show how dictionaries can be leveraged to give very useful MultiIndexresults later.

现在,我们将坚持使用上面定义的一些DataFrameSeries对象的列表。稍后我将展示如何利用字典来提供非常有用的MultiIndex结果。

pd.concat([d1, d2])

     A    B    C    D
2  0.1  0.2  0.3  NaN
3  0.1  0.2  0.3  NaN
1  NaN  0.4  0.5  0.6
2  NaN  0.4  0.5  0.6


axis

axis

The second argument we encounter is axiswhose default value is 0:

我们遇到的第二个参数是axis其默认值是0

axis: {0/'index', 1/'columns'}, default 0 The axis to concatenate along.

:{0/'index', 1/'columns'},默认值 0 要连接的轴。

Two DataFrames with axis=0(stacked)

两个DataFrames 与axis=0(堆叠)

For values of 0or indexwe mean to say: "Align along the columns and add to the index".

对于0或 的值,index我们的意思是说:“沿列对齐并添加到索引”。

As shown above where we used axis=0, because 0is the default value, and we see that the index of d2extends the index of d1despite there being overlap of the value 2:

如上所示,我们使用axis=0, 因为0是默认值,我们看到尽管值重叠,但我们看到索引d2扩展了索引:d12

pd.concat([d1, d2], axis=0)

     A    B    C    D
2  0.1  0.2  0.3  NaN
3  0.1  0.2  0.3  NaN
1  NaN  0.4  0.5  0.6
2  NaN  0.4  0.5  0.6

Two DataFrames with axis=1(side by side)

两个DataFrames with axis=1(并排)

For values 1or columnswe mean to say: "Align along the index and add to the columns",

对于值1或者columns我们的意思是说:“沿着索引对齐并添加到列中”,

pd.concat([d1, d2], axis=1)

     A    B    C    B    C    D
1  NaN  NaN  NaN  0.4  0.5  0.6
2  0.1  0.2  0.3  0.4  0.5  0.6
3  0.1  0.2  0.3  NaN  NaN  NaN

We can see that the resulting index is the union of indices and the resulting columns are the extension of columns from d1by the columns of d2.

我们可以看到,最终得到的指数是指数的工会所得列列从延伸d1通过的列d2

Two (or Three) Serieswith axis=0(stacked)

两个(或三个)Seriesaxis=0(堆叠)

When combining pandas.Seriesalong axis=0, we get back a pandas.Series. The name of the resulting Serieswill be Noneunless all Seriesbeing combined have the same name. Pay attention to the 'Name: A'when we print out the resulting Series. When it isn't present, we can assume the Seriesname is None.

当合并pandas.Seriesaxis=0,我们得到一个pandas.Series. 结果的名称Series将是,None除非所有Series被组合的名称都相同。注意'Name: A'我们打印结果的时候Series。当它不存在时,我们可以假设Series名称是None.

               |                       |                        |  pd.concat(
               |  pd.concat(           |  pd.concat(            |      [s1.rename('A'),
 pd.concat(    |      [s1.rename('A'), |      [s1.rename('A'),  |       s2.rename('B'),
     [s1, s2]) |       s2])            |       s2.rename('A')]) |       s3.rename('A')])
-------------- | --------------------- | ---------------------- | ----------------------
2    1         | 2    1                | 2    1                 | 2    1
3    2         | 3    2                | 3    2                 | 3    2
1    3         | 1    3                | 1    3                 | 1    3
2    4         | 2    4                | 2    4                 | 2    4
dtype: int64   | dtype: int64          | Name: A, dtype: int64  | 1    5
               |                       |                        | 3    6
               |                       |                        | dtype: int64

Two (or Three) Serieswith axis=1(side by side)

二(或三)Seriesaxis=1(并排)

When combining pandas.Seriesalong axis=1, it is the nameattribute that we refer to in order to infer a column name in the resulting pandas.DataFrame.

组合pandas.Seriesaxis=1name我们引用该属性以推断结果中的列名pandas.DataFrame

                       |                       |  pd.concat(
                       |  pd.concat(           |      [s1.rename('X'),
 pd.concat(            |      [s1.rename('X'), |       s2.rename('Y'),
     [s1, s2], axis=1) |       s2], axis=1)    |       s3.rename('Z')], axis=1)
---------------------- | --------------------- | ------------------------------
     0    1            |      X    0           |      X    Y    Z
1  NaN  3.0            | 1  NaN  3.0           | 1  NaN  3.0  5.0
2  1.0  4.0            | 2  1.0  4.0           | 2  1.0  4.0  NaN
3  2.0  NaN            | 3  2.0  NaN           | 3  2.0  NaN  6.0

Mixed Seriesand DataFramewith axis=0(stacked)

混合SeriesDataFrameaxis=0(堆叠)

When performing a concatenation of a Seriesand DataFramealong axis=0, we convert all Seriesto single column DataFrames.

当执行 aSeriesDataFrame沿的串联时axis=0,我们将 all 转换Series为单列DataFrames。

Take special note that this is a concatenation along axis=0; that means extending the index (rows) while aligning the columns. In the examples below, we see the index becomes [2, 3, 2, 3]which is an indiscriminate appending of indices. The columns do not overlap unless I force the naming of the Seriescolumn with the argument to to_frame:

请特别注意,这是一个串联axis=0; 这意味着在对齐列的同时扩展索引(行)。在下面的示例中,我们看到索引变成[2, 3, 2, 3]了不加选择地附加索引。除非我强制Series使用参数命名列,否则列不会重叠to_frame

 pd.concat(               |
     [s1.to_frame(), d1]) |  pd.concat([s1, d1])
------------------------- | ---------------------
     0    A    B    C     |      0    A    B    C
2  1.0  NaN  NaN  NaN     | 2  1.0  NaN  NaN  NaN
3  2.0  NaN  NaN  NaN     | 3  2.0  NaN  NaN  NaN
2  NaN  0.1  0.2  0.3     | 2  NaN  0.1  0.2  0.3
3  NaN  0.1  0.2  0.3     | 3  NaN  0.1  0.2  0.3

You can see the results of pd.concat([s1, d1])are the same as if I had perfromed the to_framemyself.

你可以看到结果和pd.concat([s1, d1])我自己表演的一样to_frame

However, I can control the name of the resulting column with a parameter to to_frame. Renaming the Serieswith the renamemethod does notcontrol the column name in the resulting DataFrame.

但是,我可以使用 to 参数控制结果列的名称to_frameSeries使用rename方法重命名不会控制结果中的列名DataFrame

 # Effectively renames       |                            |
 # `s1` but does not align   |  # Does not rename.  So    |  # Renames to something
 # with columns in `d1`      |  # Pandas defaults to `0`  |  # that does align with `d1`
 pd.concat(                  |  pd.concat(                |  pd.concat(
     [s1.to_frame('X'), d1]) |      [s1.rename('X'), d1]) |      [s1.to_frame('B'), d1])
---------------------------- | -------------------------- | ----------------------------
     A    B    C    X        |      0    A    B    C      |      A    B    C
2  NaN  NaN  NaN  1.0        | 2  1.0  NaN  NaN  NaN      | 2  NaN  1.0  NaN
3  NaN  NaN  NaN  2.0        | 3  2.0  NaN  NaN  NaN      | 3  NaN  2.0  NaN
2  0.1  0.2  0.3  NaN        | 2  NaN  0.1  0.2  0.3      | 2  0.1  0.2  0.3
3  0.1  0.2  0.3  NaN        | 3  NaN  0.1  0.2  0.3      | 3  0.1  0.2  0.3

Mixed Seriesand DataFramewith axis=1(side by side)

混合SeriesDataFrameaxis=1(并排)

This is fairly intuitive. Seriescolumn name defaults to an enumeration of such Seriesobjects when a nameattribute is not available.

这是相当直观的。当属性不可用时,Series列名默认为此类Series对象的枚举name

                    |  pd.concat(
 pd.concat(         |      [s1.rename('X'),
     [s1, d1],      |       s2, s3, d1],
     axis=1)        |      axis=1)
------------------- | -------------------------------
   0    A    B    C |      X    0    1    A    B    C
2  1  0.1  0.2  0.3 | 1  NaN  3.0  5.0  NaN  NaN  NaN
3  2  0.1  0.2  0.3 | 2  1.0  4.0  NaN  0.1  0.2  0.3
                    | 3  2.0  NaN  6.0  0.1  0.2  0.3


join

join

The third argument is jointhat describes whether the resulting merge should be an outer merge (default) or an inner merge.

第三个参数是join描述结果合并应该是外部合并(默认)还是内部合并。

join: {‘inner', ‘outer'}, default ‘outer'
How to handle indexes on other axis(es).

join: {'inner', 'outer'}, default 'outer'
如何处理其他轴上的索引。

It turns out, there is no leftor rightoption as pd.concatcan handle more than just two objects to merge.

事实证明,没有leftorright选项pd.concat可以处理多于两个要合并的对象。

In the case of d1and d2, the options look like:

在的情况下,d1d2,选项如下所示:

outer

outer

pd.concat([d1, d2], axis=1, join='outer')

     A    B    C    B    C    D
1  NaN  NaN  NaN  0.4  0.5  0.6
2  0.1  0.2  0.3  0.4  0.5  0.6
3  0.1  0.2  0.3  NaN  NaN  NaN

inner

inner

pd.concat([d1, d2], axis=1, join='inner')

     A    B    C    B    C    D
2  0.1  0.2  0.3  0.4  0.5  0.6


join_axes

join_axes

Fourth argument is the thing that allows us to do our leftmerge and more.

第四个参数是允许我们进行left合并等的事情。

join_axes: list of Index objects
Specific indexes to use for the other n - 1 axes instead of performing inner/outer set logic.

join_axes:索引对象列表
用于其他 n - 1 轴的特定索引,而不是执行内部/外部集合逻辑。

Left Merge

左合并

pd.concat([d1, d2, d3], axis=1, join_axes=[d1.index])

     A    B    C    B    C    D    A    B    D
2  0.1  0.2  0.3  0.4  0.5  0.6  NaN  NaN  NaN
3  0.1  0.2  0.3  NaN  NaN  NaN  0.7  0.8  0.9

Right Merge

右合并

pd.concat([d1, d2, d3], axis=1, join_axes=[d3.index])

     A    B    C    B    C    D    A    B    D
1  NaN  NaN  NaN  0.4  0.5  0.6  0.7  0.8  0.9
3  0.1  0.2  0.3  NaN  NaN  NaN  0.7  0.8  0.9


ignore_index

ignore_index

ignore_index: boolean, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

ignore_index:布尔值,默认为 False
如果为 True,则不使用沿串联轴的索引值。结果轴将被标记为 0, ..., n - 1。如果您在连接轴没有有意义的索引信息的情况下连接对象,这将非常有用。请注意其他轴上的索引值在连接中仍然有效。

Like when I stack d1on top of d2, if I don't care about the index values, I could reset them or ignore them.

就像当我堆叠d1在 之上时d2,如果我不关心索引值,我可以重置它们或忽略它们。

                      |  pd.concat(             |  pd.concat(
                      |      [d1, d2],          |      [d1, d2]
 pd.concat([d1, d2])  |      ignore_index=True) |  ).reset_index(drop=True)
--------------------- | ----------------------- | -------------------------
     A    B    C    D |      A    B    C    D   |      A    B    C    D
2  0.1  0.2  0.3  NaN | 0  0.1  0.2  0.3  NaN   | 0  0.1  0.2  0.3  NaN
3  0.1  0.2  0.3  NaN | 1  0.1  0.2  0.3  NaN   | 1  0.1  0.2  0.3  NaN
1  NaN  0.4  0.5  0.6 | 2  NaN  0.4  0.5  0.6   | 2  NaN  0.4  0.5  0.6
2  NaN  0.4  0.5  0.6 | 3  NaN  0.4  0.5  0.6   | 3  NaN  0.4  0.5  0.6

And when using axis=1:

当使用axis=1

                                   |     pd.concat(
                                   |         [d1, d2], axis=1,
 pd.concat([d1, d2], axis=1)       |         ignore_index=True)
-------------------------------    |    -------------------------------
     A    B    C    B    C    D    |         0    1    2    3    4    5
1  NaN  NaN  NaN  0.4  0.5  0.6    |    1  NaN  NaN  NaN  0.4  0.5  0.6
2  0.1  0.2  0.3  0.4  0.5  0.6    |    2  0.1  0.2  0.3  0.4  0.5  0.6
3  0.1  0.2  0.3  NaN  NaN  NaN    |    3  0.1  0.2  0.3  NaN  NaN  NaN


keys

keys

We can pass a list of scalar values or tuples in order to assign tuple or scalar values to corresponding MultiIndex. The length of the passed list must be the same length as the number of items being concatenated.

我们可以传递标量值或元组的列表,以便将元组或标量值分配给相应的 MultiIndex。传递的列表的长度必须与被连接的项目数的长度相同。

keys: sequence, default None
If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level

:序列,默认无
如果通过多个级别,应包含元组。使用传递的键作为最外层构建分层索引

axis=0

axis=0

When concatenating Seriesobjects along axis=0(extending the index).

连接Series对象时axis=0(扩展索引)。

Those keys, become a new initial level of a MultiIndexobject in the index attribute.

这些键成为MultiIndex索引属性中对象的新初始级别。

 #           length 3             length 3           #         length 2        length 2
 #          /--------\         /-----------\         #          /----\         /------\
 pd.concat([s1, s2, s3], keys=['A', 'B', 'C'])       pd.concat([s1, s2], keys=['A', 'B'])
----------------------------------------------      -------------------------------------
A  2    1                                           A  2    1
   3    2                                              3    2
B  1    3                                           B  1    3
   2    4                                              2    4
C  1    5                                           dtype: int64
   3    6
dtype: int64

However, we can use more than scalar values in the keysargument to create an even deeper MultiIndex. Here we pass tuplesof length 2 the prepend two new levels of a MultiIndex:

但是,我们可以在keys参数中使用多个标量值来创建更深的MultiIndex. 这里我们传递tuples长度为 2 的 a 的两个新级别MultiIndex

 pd.concat(
     [s1, s2, s3],
     keys=[('A', 'X'), ('A', 'Y'), ('B', 'X')])
-----------------------------------------------
A  X  2    1
      3    2
   Y  1    3
      2    4
B  X  1    5
      3    6
dtype: int64

axis=1

axis=1

It's a bit different when extending along columns. When we used axis=0(see above) our keysacted as MultiIndexlevels in addition to the existing index. For axis=1, we are referring to an axis that Seriesobjects don't have, namely the columnsattribute.

沿列延伸时有点不同。当我们使用axis=0(见上文)我们keys作为MultiIndex现有指数之外的级别。对于axis=1,我们指的是Series对象没有的轴,即columns属性。

两个的变化SeriesSerieswtihaxis=1axis=1

Notice that naming the s1and s2matters so long as no keysare passed, but it gets overridden if keysare passed.

请注意,只要没有通过,命名s1和就很重要,但如果通过,它就会被覆盖。s2keyskeys

               |                       |                        |  pd.concat(
               |  pd.concat(           |  pd.concat(            |      [s1.rename('U'),
 pd.concat(    |      [s1, s2],        |      [s1.rename('U'),  |       s2.rename('V')],
     [s1, s2], |      axis=1,          |       s2.rename('V')], |       axis=1,
     axis=1)   |      keys=['X', 'Y']) |       axis=1)          |       keys=['X', 'Y'])
-------------- | --------------------- | ---------------------- | ----------------------
     0    1    |      X    Y           |      U    V            |      X    Y
1  NaN  3.0    | 1  NaN  3.0           | 1  NaN  3.0            | 1  NaN  3.0
2  1.0  4.0    | 2  1.0  4.0           | 2  1.0  4.0            | 2  1.0  4.0
3  2.0  NaN    | 3  2.0  NaN           | 3  2.0  NaN            | 3  2.0  NaN
MultiIndexMultiIndexSeriesSeriesaxis=1axis=1
 pd.concat(
     [s1, s2],
     axis=1,
     keys=[('W', 'X'), ('W', 'Y')])
-----------------------------------
     W
     X    Y
1  NaN  3.0
2  1.0  4.0
3  2.0  NaN
两个DataFrameDataFrameaxis=1axis=1

As with the axis=0examples, keysadd levels to a MultiIndex, but this time to the object stored in the columnsattribute.

axis=0示例一样,keys将级别添加到 a MultiIndex,但这次添加到存储在columns属性中的对象。

 pd.concat(                     |  pd.concat(
     [d1, d2],                  |      [d1, d2],
     axis=1,                    |      axis=1,
     keys=['X', 'Y'])           |      keys=[('First', 'X'), ('Second', 'X')])
------------------------------- | --------------------------------------------
     X              Y           |   First           Second
     A    B    C    B    C    D |       X                X
1  NaN  NaN  NaN  0.4  0.5  0.6 |       A    B    C      B    C    D
2  0.1  0.2  0.3  0.4  0.5  0.6 | 1   NaN  NaN  NaN    0.4  0.5  0.6
3  0.1  0.2  0.3  NaN  NaN  NaN | 2   0.1  0.2  0.3    0.4  0.5  0.6
                                | 3   0.1  0.2  0.3    NaN  NaN  NaN
SeriesSeriesDataFrameDataFrameaxis=1axis=1

This is tricky. In this case, a scalar key value cannot act as the only level of index for the Seriesobject when it becomes a column while also acting as the first level of a MultiIndexfor the DataFrame. So Pandas will again use the nameattribute of the Seriesobject as the source of the column name.

这很棘手。在这种情况下,标量密钥值不能充当索引为唯一的水平Series时,它成为一列,同时还充当的第一级对象MultiIndexDataFrame。所以 Pandas 会再次使用对象的name属性Series作为列名的来源。

 pd.concat(           |  pd.concat(
     [s1, d1],        |      [s1.rename('Z'), d1],
     axis=1,          |      axis=1,
     keys=['X', 'Y']) |      keys=['X', 'Y'])
--------------------- | --------------------------
   X    Y             |    X    Y
   0    A    B    C   |    Z    A    B    C
2  1  0.1  0.2  0.3   | 2  1  0.1  0.2  0.3
3  2  0.1  0.2  0.3   | 3  2  0.1  0.2  0.3
限制keyskeysMultiIndexMultiIndex推理。

Pandas only seems to infer column names from Seriesname, but it will not fill in the blanks when doing an analogous concatenation among data frames with a different number of column levels.

Pandas 似乎只能从Series名称中推断出列名,但在具有不同列级别数的数据帧之间进行类似连接时,它不会填充空白。

d1_ = pd.concat(
    [d1], axis=1,
    keys=['One'])
d1_

   One
     A    B    C
2  0.1  0.2  0.3
3  0.1  0.2  0.3

Then concatenate this with another data frame with only one level in the columns object and Pandas will refuse to try and make tuples of the MultiIndexobject and combine all data frames as if a single level of objects, scalars and tuples.

然后将它与另一个在列对象中只有一个级别的数据框连接起来,Pandas 将拒绝尝试创建MultiIndex对象的元组并将所有数据框组合起来,就好像单个级别的对象、标量和元组一样。

pd.concat([d1_, d2], axis=1)

   (One, A)  (One, B)  (One, C)    B    C    D
1       NaN       NaN       NaN  0.4  0.5  0.6
2       0.1       0.2       0.3  0.4  0.5  0.6
3       0.1       0.2       0.3  NaN  NaN  NaN

Passing a dictinstead of a list

传递一个dict而不是一个list

When passing a dictionary, pandas.concatwill use the keys from the dictionary as the keysparameter.

传递字典时,pandas.concat将使用字典中的键作为keys参数。

 # axis=0               |  # axis=1
 pd.concat(             |  pd.concat(
     {0: d1, 1: d2})    |      {0: d1, 1: d2}, axis=1)
----------------------- | -------------------------------
       A    B    C    D |      0              1
0 2  0.1  0.2  0.3  NaN |      A    B    C    B    C    D
  3  0.1  0.2  0.3  NaN | 1  NaN  NaN  NaN  0.4  0.5  0.6
1 1  NaN  0.4  0.5  0.6 | 2  0.1  0.2  0.3  0.4  0.5  0.6
  2  NaN  0.4  0.5  0.6 | 3  0.1  0.2  0.3  NaN  NaN  NaN


levels

levels

This is used in conjunction with the keysargument.When levelsis left as its default value of None, Pandas will take the unique values of each level of the resulting MultiIndexand use that as the object used in the resulting index.levelsattribute.

这与keys参数结合使用。当levels保留为默认值时None,Pandas 将采用结果的每个级别的唯一值MultiIndex,并将其用作结果index.levels属性中使用的对象。

levels: list of sequences, default None
Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.

levels: 序列列表,默认 None
用于构建 MultiIndex 的特定级别(唯一值)。否则,它们将从密钥中推断出来。

If Pandas already infers what these levels should be, what advantage is there to specify it ourselves? I'll show one example and leave it up to you to think up other reasons why this might be useful.

如果 Pandas 已经推断出这些级别应该是什么,那么自己指定它有什么好处?我将展示一个示例,让您自行思考为什么这可能有用的其他原因。

Example

例子

Per the documentation, the levelsargument is a list of sequences. This means that we can use another pandas.Indexas one of those sequences.

根据文档,levels参数是一个序列列表。这意味着我们可以使用另一个pandas.Index作为这些序列之一。

Consider the data frame dfthat is the concatenation of d1, d2and d3:

考虑dfd1,d2和串联而成的数据框d3

df = pd.concat(
    [d1, d2, d3], axis=1,
    keys=['First', 'Second', 'Fourth'])

df

  First           Second           Fourth
      A    B    C      B    C    D      A    B    D
1   NaN  NaN  NaN    0.4  0.5  0.6    0.7  0.8  0.9
2   0.1  0.2  0.3    0.4  0.5  0.6    NaN  NaN  NaN
3   0.1  0.2  0.3    NaN  NaN  NaN    0.7  0.8  0.9

The levels of the columns object are:

列对象的级别是:

print(df, *df.columns.levels, sep='\n')

Index(['First', 'Second', 'Fourth'], dtype='object')
Index(['A', 'B', 'C', 'D'], dtype='object')

If we use sumwithin a groupbywe get:

如果我们sum在 a 中使用,groupby我们会得到:

df.groupby(axis=1, level=0).sum()

   First  Fourth  Second
1    0.0     2.4     1.5
2    0.6     0.0     1.5
3    0.6     2.4     0.0

But what if instead of ['First', 'Second', 'Fourth']there were another missing categories named Thirdand Fifth? And I wanted them included in the results of a groupbyaggregation? We can do this if we had a pandas.CategoricalIndex. And we can specify that ahead of time with the levelsargument.

但是,如果不是['First', 'Second', 'Fourth']还有另一个名为Thirdand 的缺失类别Fifth呢?我希望它们包含在groupby聚合结果中?如果我们有一个pandas.CategoricalIndex. 我们可以用levels参数提前指定。

So instead, let's define dfas:

所以相反,让我们定义df为:

cats = ['First', 'Second', 'Third', 'Fourth', 'Fifth']
lvl = pd.CategoricalIndex(cats, categories=cats, ordered=True)

df = pd.concat(
    [d1, d2, d3], axis=1,
    keys=['First', 'Second', 'Fourth'],
    levels=[lvl]
)

df

   First  Fourth  Second
1    0.0     2.4     1.5
2    0.6     0.0     1.5
3    0.6     2.4     0.0

But the first level of the columns object is:

但是列对象的第一级是:

df.columns.levels[0]

CategoricalIndex(
    ['First', 'Second', 'Third', 'Fourth', 'Fifth'],
    categories=['First', 'Second', 'Third', 'Fourth', 'Fifth'],
    ordered=True, dtype='category')

And our groupbysummation looks like:

我们的groupby总结如下:

df.groupby(axis=1, level=0).sum()

   First  Second  Third  Fourth  Fifth
1    0.0     1.5    0.0     2.4    0.0
2    0.6     1.5    0.0     0.0    0.0
3    0.6     0.0    0.0     2.4    0.0


names

names

This is used to name the levels of a resulting MultiIndex. The length of the nameslist should match the number of levels in the resulting MultiIndex.

这用于命名结果的级别MultiIndexnames列表的长度应与结果中的级别数相匹配MultiIndex

names: list, default None
Names for the levels in the resulting hierarchical index

名称:列表,默认为无
结果分层索引中级别的名称

 # axis=0                     |  # axis=1
 pd.concat(                   |  pd.concat(
     [d1, d2],                |      [d1, d2],
     keys=[0, 1],             |      axis=1, keys=[0, 1],
     names=['lvl0', 'lvl1'])  |      names=['lvl0', 'lvl1'])
----------------------------- | ----------------------------------
             A    B    C    D | lvl0    0              1
lvl0 lvl1                     | lvl1    A    B    C    B    C    D
0    2     0.1  0.2  0.3  NaN | 1     NaN  NaN  NaN  0.4  0.5  0.6
     3     0.1  0.2  0.3  NaN | 2     0.1  0.2  0.3  0.4  0.5  0.6
1    1     NaN  0.4  0.5  0.6 | 3     0.1  0.2  0.3  NaN  NaN  NaN
     2     NaN  0.4  0.5  0.6 |


verify_integrity

verify_integrity

Self explanatory documentation

不言自明的文件

verify_integrity: boolean, default False
Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.

verify_integrity: boolean, default False
检查新的连接轴是否包含重复项。相对于实际的数据串联,这可能非常昂贵。

Because the resulting index from concatenating d1and d2is not unique, it would fail the integrity check.

因为从串联结果索引d1d2不唯一,它会失败的完整性检查。

pd.concat([d1, d2])

     A    B    C    D
2  0.1  0.2  0.3  NaN
3  0.1  0.2  0.3  NaN
1  NaN  0.4  0.5  0.6
2  NaN  0.4  0.5  0.6

And

pd.concat([d1, d2], verify_integrity=True)

> ValueError: Indexes have overlapping values: [2]

> ValueError:索引有重叠值:[2]