Pandas 的 concat 函数中的“级别”、“键”和名称参数是什么?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49620538/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What are the 'levels', 'keys', and names arguments for in Pandas' concat function?
提问by piRSquared
Questions
问题
- How do I use
pd.concat
? - What is the
levels
argument for? - What is the
keys
argument for? - Are there a bunch of examples to help explain how to use all the arguments?
- 我如何使用
pd.concat
? - 什么是
levels
对的说法? - 什么是
keys
对的说法? - 是否有一堆示例可以帮助解释如何使用所有参数?
Pandas' concat
function is the Swiss Army knifeof the merging utilities. The variety of situations in which it is useful are numerous. The existing documentation leaves out a few details on some of the optional arguments. Among them are the levels
and keys
arguments. I set out to figure out what those arguments do.
Pandas 的concat
功能是合并实用程序的瑞士军刀。它适用的各种情况很多。现有文档遗漏了一些可选参数的一些细节。其中包括levels
和keys
参数。我开始弄清楚这些论点的作用。
I'll pose a question that will act as a gateway into many aspects of pd.concat
.
我将提出一个问题,作为进入pd.concat
.
Consider the data frames d1
, d2
, and d3
:
考虑数据帧d1
,d2
以及d3
:
import pandas as pd
d1 = pd.DataFrame(dict(A=.1, B=.2, C=.3), [2, 3])
d2 = pd.DataFrame(dict(B=.4, C=.5, D=.6), [1, 2])
d3 = pd.DataFrame(dict(A=.7, B=.8, D=.9), [1, 3])
If I were to concatenate these together with
如果我将这些连接在一起
pd.concat([d1, d2, d3], keys=['d1', 'd2', 'd3'])
I get the expected result with a pandas.MultiIndex
for my columns
object:
我得到了pandas.MultiIndex
我的columns
对象的预期结果:
A B C D
d1 2 0.1 0.2 0.3 NaN
3 0.1 0.2 0.3 NaN
d2 1 NaN 0.4 0.5 0.6
2 NaN 0.4 0.5 0.6
d3 1 0.7 0.8 NaN 0.9
3 0.7 0.8 NaN 0.9
However, I wanted to use the levels
argument documentation:
但是,我想使用levels
参数文档:
levels: list of sequences, default None. Specific levels (unique values) to use for constructing a MultiIndex. Otherwise, they will be inferred from the keys.
级别:序列列表,默认无。用于构建 MultiIndex 的特定级别(唯一值)。否则,它们将从密钥中推断出来。
So I passed
所以我通过了
pd.concat([d1, d2, d3], keys=['d1', 'd2', 'd3'], levels=[['d1', 'd2']])
And get a KeyError
并得到一个 KeyError
ValueError: Key d3 not in level Index(['d1', 'd2'], dtype='object')
ValueError: Key d3 not in level Index(['d1', 'd2'], dtype='object')
This made sense. The levels I passed were inadequate to describe the necessary levels indicated by the keys. Had I not passed anything, as I did above, the levels are inferred (as stated in the documentation). But how else can I use this argument to better effect?
这是有道理的。我通过的级别不足以描述按键指示的必要级别。如果我没有通过任何东西,就像我上面所做的那样,可以推断出级别(如文档中所述)。但是我还能如何使用这个论点来获得更好的效果呢?
If I tried this instead:
如果我尝试这样做:
pd.concat([d1, d2, d3], keys=['d1', 'd2', 'd3'], levels=[['d1', 'd2', 'd3']])
I and got the same results as above. But when I add one more value to the levels,
我得到了与上面相同的结果。但是当我在关卡中再增加一个值时,
df = pd.concat([d1, d2, d3], keys=['d1', 'd2', 'd3'], levels=[['d1', 'd2', 'd3', 'd4']])
I end up with the same looking data frame, but the resulting MultiIndex
has an unused level.
我最终得到了相同的数据框,但结果MultiIndex
有一个未使用的级别。
df.index.levels[0]
Index(['d1', 'd2', 'd3', 'd4'], dtype='object')
So what is the point of the level
argument and should I be using keys
differently?
那么level
争论的重点是什么,我应该以keys
不同的方式使用吗?
I'm using Python 3.6 and Pandas 0.22.
我使用的是 Python 3.6 和 Pandas 0.22。
回答by piRSquared
In the process of answering this question for myself, I learned many things, and I wanted to put together a catalog of examples and some explanation.
自己在回答这个问题的过程中,学到了很多东西,想整理一个例子目录和一些解释。
The specific answer to the point of the levels
argument will come towards the end.
levels
争论点的具体答案将接近尾声。
pandas.concat
: The Missing Manual
pandas.concat
: 丢失的手册
Imports and defining objects
导入和定义对象
import pandas as pd
d1 = pd.DataFrame(dict(A=.1, B=.2, C=.3), index=[2, 3])
d2 = pd.DataFrame(dict(B=.4, C=.5, D=.6), index=[1, 2])
d3 = pd.DataFrame(dict(A=.7, B=.8, D=.9), index=[1, 3])
s1 = pd.Series([1, 2], index=[2, 3])
s2 = pd.Series([3, 4], index=[1, 2])
s3 = pd.Series([5, 6], index=[1, 3])
Arguments
参数
objs
objs
The first argument we come across is objs
:
我们遇到的第一个论点是objs
:
objs: a sequence or mapping of Series, DataFrame, or Panel objects If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised
objs: Series、DataFrame 或 Panel 对象的序列或映射如果传递 dict,则排序的键将用作键参数,除非传递,在这种情况下将选择值(见下文)。任何 None 对象都将被静默删除,除非它们都是 None 在这种情况下将引发 ValueError
- We typically see this used with a list of
Series
orDataFrame
objects. - I'll show that
dict
can be very useful as well. - Generators may also be used and can be useful when using
map
as inmap(f, list_of_df)
- 我们通常看到这与
Series
或DataFrame
对象列表一起使用。 - 我会证明这
dict
也非常有用。 - 发电机也可以被使用和使用时可以是有用的
map
,如map(f, list_of_df)
For now, we'll stick with a list of some of the DataFrame
and Series
objects defined above.
I'll show how dictionaries can be leveraged to give very useful MultiIndex
results later.
现在,我们将坚持使用上面定义的一些DataFrame
和Series
对象的列表。稍后我将展示如何利用字典来提供非常有用的MultiIndex
结果。
pd.concat([d1, d2])
A B C D
2 0.1 0.2 0.3 NaN
3 0.1 0.2 0.3 NaN
1 NaN 0.4 0.5 0.6
2 NaN 0.4 0.5 0.6
axis
axis
The second argument we encounter is axis
whose default value is 0
:
我们遇到的第二个参数是axis
其默认值是0
:
axis: {0/'index', 1/'columns'}, default 0 The axis to concatenate along.
轴:{0/'index', 1/'columns'},默认值 0 要连接的轴。
Two DataFrame
s with axis=0
(stacked)
两个DataFrame
s 与axis=0
(堆叠)
For values of 0
or index
we mean to say: "Align along the columns and add to the index".
对于0
或 的值,index
我们的意思是说:“沿列对齐并添加到索引”。
As shown above where we used axis=0
, because 0
is the default value, and we see that the index of d2
extends the index of d1
despite there being overlap of the value 2
:
如上所示,我们使用axis=0
, 因为0
是默认值,我们看到尽管值重叠,但我们看到索引d2
扩展了索引:d1
2
pd.concat([d1, d2], axis=0)
A B C D
2 0.1 0.2 0.3 NaN
3 0.1 0.2 0.3 NaN
1 NaN 0.4 0.5 0.6
2 NaN 0.4 0.5 0.6
Two DataFrame
s with axis=1
(side by side)
两个DataFrame
s with axis=1
(并排)
For values 1
or columns
we mean to say: "Align along the index and add to the columns",
对于值1
或者columns
我们的意思是说:“沿着索引对齐并添加到列中”,
pd.concat([d1, d2], axis=1)
A B C B C D
1 NaN NaN NaN 0.4 0.5 0.6
2 0.1 0.2 0.3 0.4 0.5 0.6
3 0.1 0.2 0.3 NaN NaN NaN
We can see that the resulting index is the union of indices and the resulting columns are the extension of columns from d1
by the columns of d2
.
我们可以看到,最终得到的指数是指数的工会所得列列从延伸d1
通过的列d2
。
Two (or Three) Series
with axis=0
(stacked)
两个(或三个)Series
与axis=0
(堆叠)
When combining pandas.Series
along axis=0
, we get back a pandas.Series
. The name of the resulting Series
will be None
unless all Series
being combined have the same name. Pay attention to the 'Name: A'
when we print out the resulting Series
. When it isn't present, we can assume the Series
name is None
.
当合并pandas.Series
时axis=0
,我们得到一个pandas.Series
. 结果的名称Series
将是,None
除非所有Series
被组合的名称都相同。注意'Name: A'
我们打印结果的时候Series
。当它不存在时,我们可以假设Series
名称是None
.
| | | pd.concat(
| pd.concat( | pd.concat( | [s1.rename('A'),
pd.concat( | [s1.rename('A'), | [s1.rename('A'), | s2.rename('B'),
[s1, s2]) | s2]) | s2.rename('A')]) | s3.rename('A')])
-------------- | --------------------- | ---------------------- | ----------------------
2 1 | 2 1 | 2 1 | 2 1
3 2 | 3 2 | 3 2 | 3 2
1 3 | 1 3 | 1 3 | 1 3
2 4 | 2 4 | 2 4 | 2 4
dtype: int64 | dtype: int64 | Name: A, dtype: int64 | 1 5
| | | 3 6
| | | dtype: int64
Two (or Three) Series
with axis=1
(side by side)
二(或三)Series
与axis=1
(并排)
When combining pandas.Series
along axis=1
, it is the name
attribute that we refer to in order to infer a column name in the resulting pandas.DataFrame
.
组合pandas.Series
时axis=1
,name
我们引用该属性以推断结果中的列名pandas.DataFrame
。
| | pd.concat(
| pd.concat( | [s1.rename('X'),
pd.concat( | [s1.rename('X'), | s2.rename('Y'),
[s1, s2], axis=1) | s2], axis=1) | s3.rename('Z')], axis=1)
---------------------- | --------------------- | ------------------------------
0 1 | X 0 | X Y Z
1 NaN 3.0 | 1 NaN 3.0 | 1 NaN 3.0 5.0
2 1.0 4.0 | 2 1.0 4.0 | 2 1.0 4.0 NaN
3 2.0 NaN | 3 2.0 NaN | 3 2.0 NaN 6.0
Mixed Series
and DataFrame
with axis=0
(stacked)
混合Series
并DataFrame
用axis=0
(堆叠)
When performing a concatenation of a Series
and DataFrame
along axis=0
, we convert all Series
to single column DataFrame
s.
当执行 aSeries
和DataFrame
沿的串联时axis=0
,我们将 all 转换Series
为单列DataFrame
s。
Take special note that this is a concatenation along axis=0
; that means extending the index (rows) while aligning the columns. In the examples below, we see the index becomes [2, 3, 2, 3]
which is an indiscriminate appending of indices. The columns do not overlap unless I force the naming of the Series
column with the argument to to_frame
:
请特别注意,这是一个串联axis=0
; 这意味着在对齐列的同时扩展索引(行)。在下面的示例中,我们看到索引变成[2, 3, 2, 3]
了不加选择地附加索引。除非我强制Series
使用参数命名列,否则列不会重叠to_frame
:
pd.concat( |
[s1.to_frame(), d1]) | pd.concat([s1, d1])
------------------------- | ---------------------
0 A B C | 0 A B C
2 1.0 NaN NaN NaN | 2 1.0 NaN NaN NaN
3 2.0 NaN NaN NaN | 3 2.0 NaN NaN NaN
2 NaN 0.1 0.2 0.3 | 2 NaN 0.1 0.2 0.3
3 NaN 0.1 0.2 0.3 | 3 NaN 0.1 0.2 0.3
You can see the results of pd.concat([s1, d1])
are the same as if I had perfromed the to_frame
myself.
你可以看到结果和pd.concat([s1, d1])
我自己表演的一样to_frame
。
However, I can control the name of the resulting column with a parameter to to_frame
. Renaming the Series
with the rename
method does notcontrol the column name in the resulting DataFrame
.
但是,我可以使用 to 参数控制结果列的名称to_frame
。Series
使用rename
方法重命名不会控制结果中的列名DataFrame
。
# Effectively renames | |
# `s1` but does not align | # Does not rename. So | # Renames to something
# with columns in `d1` | # Pandas defaults to `0` | # that does align with `d1`
pd.concat( | pd.concat( | pd.concat(
[s1.to_frame('X'), d1]) | [s1.rename('X'), d1]) | [s1.to_frame('B'), d1])
---------------------------- | -------------------------- | ----------------------------
A B C X | 0 A B C | A B C
2 NaN NaN NaN 1.0 | 2 1.0 NaN NaN NaN | 2 NaN 1.0 NaN
3 NaN NaN NaN 2.0 | 3 2.0 NaN NaN NaN | 3 NaN 2.0 NaN
2 0.1 0.2 0.3 NaN | 2 NaN 0.1 0.2 0.3 | 2 0.1 0.2 0.3
3 0.1 0.2 0.3 NaN | 3 NaN 0.1 0.2 0.3 | 3 0.1 0.2 0.3
Mixed Series
and DataFrame
with axis=1
(side by side)
混合Series
并DataFrame
用axis=1
(并排)
This is fairly intuitive. Series
column name defaults to an enumeration of such Series
objects when a name
attribute is not available.
这是相当直观的。当属性不可用时,Series
列名默认为此类Series
对象的枚举name
。
| pd.concat(
pd.concat( | [s1.rename('X'),
[s1, d1], | s2, s3, d1],
axis=1) | axis=1)
------------------- | -------------------------------
0 A B C | X 0 1 A B C
2 1 0.1 0.2 0.3 | 1 NaN 3.0 5.0 NaN NaN NaN
3 2 0.1 0.2 0.3 | 2 1.0 4.0 NaN 0.1 0.2 0.3
| 3 2.0 NaN 6.0 0.1 0.2 0.3
join
join
The third argument is join
that describes whether the resulting merge should be an outer merge (default) or an inner merge.
第三个参数是join
描述结果合并应该是外部合并(默认)还是内部合并。
join: {‘inner', ‘outer'}, default ‘outer'
How to handle indexes on other axis(es).
join: {'inner', 'outer'}, default 'outer'
如何处理其他轴上的索引。
It turns out, there is no left
or right
option as pd.concat
can handle more than just two objects to merge.
事实证明,没有left
orright
选项pd.concat
可以处理多于两个要合并的对象。
In the case of d1
and d2
, the options look like:
在的情况下,d1
和d2
,选项如下所示:
outer
outer
pd.concat([d1, d2], axis=1, join='outer')
A B C B C D
1 NaN NaN NaN 0.4 0.5 0.6
2 0.1 0.2 0.3 0.4 0.5 0.6
3 0.1 0.2 0.3 NaN NaN NaN
inner
inner
pd.concat([d1, d2], axis=1, join='inner')
A B C B C D
2 0.1 0.2 0.3 0.4 0.5 0.6
join_axes
join_axes
Fourth argument is the thing that allows us to do our left
merge and more.
第四个参数是允许我们进行left
合并等的事情。
join_axes: list of Index objects
Specific indexes to use for the other n - 1 axes instead of performing inner/outer set logic.
join_axes:索引对象列表
用于其他 n - 1 轴的特定索引,而不是执行内部/外部集合逻辑。
Left Merge
左合并
pd.concat([d1, d2, d3], axis=1, join_axes=[d1.index])
A B C B C D A B D
2 0.1 0.2 0.3 0.4 0.5 0.6 NaN NaN NaN
3 0.1 0.2 0.3 NaN NaN NaN 0.7 0.8 0.9
Right Merge
右合并
pd.concat([d1, d2, d3], axis=1, join_axes=[d3.index])
A B C B C D A B D
1 NaN NaN NaN 0.4 0.5 0.6 0.7 0.8 0.9
3 0.1 0.2 0.3 NaN NaN NaN 0.7 0.8 0.9
ignore_index
ignore_index
ignore_index: boolean, default False
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
ignore_index:布尔值,默认为 False
如果为 True,则不使用沿串联轴的索引值。结果轴将被标记为 0, ..., n - 1。如果您在连接轴没有有意义的索引信息的情况下连接对象,这将非常有用。请注意其他轴上的索引值在连接中仍然有效。
Like when I stack d1
on top of d2
, if I don't care about the index values, I could reset them or ignore them.
就像当我堆叠d1
在 之上时d2
,如果我不关心索引值,我可以重置它们或忽略它们。
| pd.concat( | pd.concat(
| [d1, d2], | [d1, d2]
pd.concat([d1, d2]) | ignore_index=True) | ).reset_index(drop=True)
--------------------- | ----------------------- | -------------------------
A B C D | A B C D | A B C D
2 0.1 0.2 0.3 NaN | 0 0.1 0.2 0.3 NaN | 0 0.1 0.2 0.3 NaN
3 0.1 0.2 0.3 NaN | 1 0.1 0.2 0.3 NaN | 1 0.1 0.2 0.3 NaN
1 NaN 0.4 0.5 0.6 | 2 NaN 0.4 0.5 0.6 | 2 NaN 0.4 0.5 0.6
2 NaN 0.4 0.5 0.6 | 3 NaN 0.4 0.5 0.6 | 3 NaN 0.4 0.5 0.6
And when using axis=1
:
当使用axis=1
:
| pd.concat(
| [d1, d2], axis=1,
pd.concat([d1, d2], axis=1) | ignore_index=True)
------------------------------- | -------------------------------
A B C B C D | 0 1 2 3 4 5
1 NaN NaN NaN 0.4 0.5 0.6 | 1 NaN NaN NaN 0.4 0.5 0.6
2 0.1 0.2 0.3 0.4 0.5 0.6 | 2 0.1 0.2 0.3 0.4 0.5 0.6
3 0.1 0.2 0.3 NaN NaN NaN | 3 0.1 0.2 0.3 NaN NaN NaN
keys
keys
We can pass a list of scalar values or tuples in order to assign tuple or scalar values to corresponding MultiIndex. The length of the passed list must be the same length as the number of items being concatenated.
我们可以传递标量值或元组的列表,以便将元组或标量值分配给相应的 MultiIndex。传递的列表的长度必须与被连接的项目数的长度相同。
keys: sequence, default None
If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level
键:序列,默认无
如果通过多个级别,应包含元组。使用传递的键作为最外层构建分层索引
axis=0
axis=0
When concatenating Series
objects along axis=0
(extending the index).
连接Series
对象时axis=0
(扩展索引)。
Those keys, become a new initial level of a MultiIndex
object in the index attribute.
这些键成为MultiIndex
索引属性中对象的新初始级别。
# length 3 length 3 # length 2 length 2
# /--------\ /-----------\ # /----\ /------\
pd.concat([s1, s2, s3], keys=['A', 'B', 'C']) pd.concat([s1, s2], keys=['A', 'B'])
---------------------------------------------- -------------------------------------
A 2 1 A 2 1
3 2 3 2
B 1 3 B 1 3
2 4 2 4
C 1 5 dtype: int64
3 6
dtype: int64
However, we can use more than scalar values in the keys
argument to create an even deeper MultiIndex
. Here we pass tuples
of length 2 the prepend two new levels of a MultiIndex
:
但是,我们可以在keys
参数中使用多个标量值来创建更深的MultiIndex
. 这里我们传递tuples
长度为 2 的 a 的两个新级别MultiIndex
:
pd.concat(
[s1, s2, s3],
keys=[('A', 'X'), ('A', 'Y'), ('B', 'X')])
-----------------------------------------------
A X 2 1
3 2
Y 1 3
2 4
B X 1 5
3 6
dtype: int64
axis=1
axis=1
It's a bit different when extending along columns. When we used axis=0
(see above) our keys
acted as MultiIndex
levels in addition to the existing index. For axis=1
, we are referring to an axis that Series
objects don't have, namely the columns
attribute.
沿列延伸时有点不同。当我们使用axis=0
(见上文)我们keys
作为MultiIndex
现有指数之外的级别。对于axis=1
,我们指的是Series
对象没有的轴,即columns
属性。
Series
Series
wtihaxis=1
axis=1
Notice that naming the s1
and s2
matters so long as no keys
are passed, but it gets overridden if keys
are passed.
请注意,只要没有通过,命名s1
和就很重要,但如果通过,它就会被覆盖。s2
keys
keys
| | | pd.concat(
| pd.concat( | pd.concat( | [s1.rename('U'),
pd.concat( | [s1, s2], | [s1.rename('U'), | s2.rename('V')],
[s1, s2], | axis=1, | s2.rename('V')], | axis=1,
axis=1) | keys=['X', 'Y']) | axis=1) | keys=['X', 'Y'])
-------------- | --------------------- | ---------------------- | ----------------------
0 1 | X Y | U V | X Y
1 NaN 3.0 | 1 NaN 3.0 | 1 NaN 3.0 | 1 NaN 3.0
2 1.0 4.0 | 2 1.0 4.0 | 2 1.0 4.0 | 2 1.0 4.0
3 2.0 NaN | 3 2.0 NaN | 3 2.0 NaN | 3 2.0 NaN
MultiIndex
MultiIndex
与Series
Series
和axis=1
axis=1
pd.concat(
[s1, s2],
axis=1,
keys=[('W', 'X'), ('W', 'Y')])
-----------------------------------
W
X Y
1 NaN 3.0
2 1.0 4.0
3 2.0 NaN
两个DataFrame
DataFrame
与axis=1
axis=1
As with the axis=0
examples, keys
add levels to a MultiIndex
, but this time to the object stored in the columns
attribute.
与axis=0
示例一样,keys
将级别添加到 a MultiIndex
,但这次添加到存储在columns
属性中的对象。
pd.concat( | pd.concat(
[d1, d2], | [d1, d2],
axis=1, | axis=1,
keys=['X', 'Y']) | keys=[('First', 'X'), ('Second', 'X')])
------------------------------- | --------------------------------------------
X Y | First Second
A B C B C D | X X
1 NaN NaN NaN 0.4 0.5 0.6 | A B C B C D
2 0.1 0.2 0.3 0.4 0.5 0.6 | 1 NaN NaN NaN 0.4 0.5 0.6
3 0.1 0.2 0.3 NaN NaN NaN | 2 0.1 0.2 0.3 0.4 0.5 0.6
| 3 0.1 0.2 0.3 NaN NaN NaN
Series
Series
并DataFrame
DataFrame
与axis=1
axis=1
This is tricky. In this case, a scalar key value cannot act as the only level of index for the Series
object when it becomes a column while also acting as the first level of a MultiIndex
for the DataFrame
. So Pandas will again use the name
attribute of the Series
object as the source of the column name.
这很棘手。在这种情况下,标量密钥值不能充当索引为唯一的水平Series
时,它成为一列,同时还充当的第一级对象MultiIndex
的DataFrame
。所以 Pandas 会再次使用对象的name
属性Series
作为列名的来源。
pd.concat( | pd.concat(
[s1, d1], | [s1.rename('Z'), d1],
axis=1, | axis=1,
keys=['X', 'Y']) | keys=['X', 'Y'])
--------------------- | --------------------------
X Y | X Y
0 A B C | Z A B C
2 1 0.1 0.2 0.3 | 2 1 0.1 0.2 0.3
3 2 0.1 0.2 0.3 | 3 2 0.1 0.2 0.3
限制keys
keys
和MultiIndex
MultiIndex
推理。
Pandas only seems to infer column names from Series
name, but it will not fill in the blanks when doing an analogous concatenation among data frames with a different number of column levels.
Pandas 似乎只能从Series
名称中推断出列名,但在具有不同列级别数的数据帧之间进行类似连接时,它不会填充空白。
d1_ = pd.concat(
[d1], axis=1,
keys=['One'])
d1_
One
A B C
2 0.1 0.2 0.3
3 0.1 0.2 0.3
Then concatenate this with another data frame with only one level in the columns object and Pandas will refuse to try and make tuples of the MultiIndex
object and combine all data frames as if a single level of objects, scalars and tuples.
然后将它与另一个在列对象中只有一个级别的数据框连接起来,Pandas 将拒绝尝试创建MultiIndex
对象的元组并将所有数据框组合起来,就好像单个级别的对象、标量和元组一样。
pd.concat([d1_, d2], axis=1)
(One, A) (One, B) (One, C) B C D
1 NaN NaN NaN 0.4 0.5 0.6
2 0.1 0.2 0.3 0.4 0.5 0.6
3 0.1 0.2 0.3 NaN NaN NaN
Passing a dict
instead of a list
传递一个dict
而不是一个list
When passing a dictionary, pandas.concat
will use the keys from the dictionary as the keys
parameter.
传递字典时,pandas.concat
将使用字典中的键作为keys
参数。
# axis=0 | # axis=1
pd.concat( | pd.concat(
{0: d1, 1: d2}) | {0: d1, 1: d2}, axis=1)
----------------------- | -------------------------------
A B C D | 0 1
0 2 0.1 0.2 0.3 NaN | A B C B C D
3 0.1 0.2 0.3 NaN | 1 NaN NaN NaN 0.4 0.5 0.6
1 1 NaN 0.4 0.5 0.6 | 2 0.1 0.2 0.3 0.4 0.5 0.6
2 NaN 0.4 0.5 0.6 | 3 0.1 0.2 0.3 NaN NaN NaN
levels
levels
This is used in conjunction with the keys
argument.When levels
is left as its default value of None
, Pandas will take the unique values of each level of the resulting MultiIndex
and use that as the object used in the resulting index.levels
attribute.
这与keys
参数结合使用。当levels
保留为默认值时None
,Pandas 将采用结果的每个级别的唯一值MultiIndex
,并将其用作结果index.levels
属性中使用的对象。
levels: list of sequences, default None
Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.
levels: 序列列表,默认 None
用于构建 MultiIndex 的特定级别(唯一值)。否则,它们将从密钥中推断出来。
If Pandas already infers what these levels should be, what advantage is there to specify it ourselves? I'll show one example and leave it up to you to think up other reasons why this might be useful.
如果 Pandas 已经推断出这些级别应该是什么,那么自己指定它有什么好处?我将展示一个示例,让您自行思考为什么这可能有用的其他原因。
Example
例子
Per the documentation, the levels
argument is a list of sequences. This means that we can use another pandas.Index
as one of those sequences.
根据文档,levels
参数是一个序列列表。这意味着我们可以使用另一个pandas.Index
作为这些序列之一。
Consider the data frame df
that is the concatenation of d1
, d2
and d3
:
考虑df
由d1
,d2
和串联而成的数据框d3
:
df = pd.concat(
[d1, d2, d3], axis=1,
keys=['First', 'Second', 'Fourth'])
df
First Second Fourth
A B C B C D A B D
1 NaN NaN NaN 0.4 0.5 0.6 0.7 0.8 0.9
2 0.1 0.2 0.3 0.4 0.5 0.6 NaN NaN NaN
3 0.1 0.2 0.3 NaN NaN NaN 0.7 0.8 0.9
The levels of the columns object are:
列对象的级别是:
print(df, *df.columns.levels, sep='\n')
Index(['First', 'Second', 'Fourth'], dtype='object')
Index(['A', 'B', 'C', 'D'], dtype='object')
If we use sum
within a groupby
we get:
如果我们sum
在 a 中使用,groupby
我们会得到:
df.groupby(axis=1, level=0).sum()
First Fourth Second
1 0.0 2.4 1.5
2 0.6 0.0 1.5
3 0.6 2.4 0.0
But what if instead of ['First', 'Second', 'Fourth']
there were another missing categories named Third
and Fifth
? And I wanted them included in the results of a groupby
aggregation? We can do this if we had a pandas.CategoricalIndex
. And we can specify that ahead of time with the levels
argument.
但是,如果不是['First', 'Second', 'Fourth']
还有另一个名为Third
and 的缺失类别Fifth
呢?我希望它们包含在groupby
聚合结果中?如果我们有一个pandas.CategoricalIndex
. 我们可以用levels
参数提前指定。
So instead, let's define df
as:
所以相反,让我们定义df
为:
cats = ['First', 'Second', 'Third', 'Fourth', 'Fifth']
lvl = pd.CategoricalIndex(cats, categories=cats, ordered=True)
df = pd.concat(
[d1, d2, d3], axis=1,
keys=['First', 'Second', 'Fourth'],
levels=[lvl]
)
df
First Fourth Second
1 0.0 2.4 1.5
2 0.6 0.0 1.5
3 0.6 2.4 0.0
But the first level of the columns object is:
但是列对象的第一级是:
df.columns.levels[0]
CategoricalIndex(
['First', 'Second', 'Third', 'Fourth', 'Fifth'],
categories=['First', 'Second', 'Third', 'Fourth', 'Fifth'],
ordered=True, dtype='category')
And our groupby
summation looks like:
我们的groupby
总结如下:
df.groupby(axis=1, level=0).sum()
First Second Third Fourth Fifth
1 0.0 1.5 0.0 2.4 0.0
2 0.6 1.5 0.0 0.0 0.0
3 0.6 0.0 0.0 2.4 0.0
names
names
This is used to name the levels of a resulting MultiIndex
. The length of the names
list should match the number of levels in the resulting MultiIndex
.
这用于命名结果的级别MultiIndex
。names
列表的长度应与结果中的级别数相匹配MultiIndex
。
names: list, default None
Names for the levels in the resulting hierarchical index
名称:列表,默认为无
结果分层索引中级别的名称
# axis=0 | # axis=1
pd.concat( | pd.concat(
[d1, d2], | [d1, d2],
keys=[0, 1], | axis=1, keys=[0, 1],
names=['lvl0', 'lvl1']) | names=['lvl0', 'lvl1'])
----------------------------- | ----------------------------------
A B C D | lvl0 0 1
lvl0 lvl1 | lvl1 A B C B C D
0 2 0.1 0.2 0.3 NaN | 1 NaN NaN NaN 0.4 0.5 0.6
3 0.1 0.2 0.3 NaN | 2 0.1 0.2 0.3 0.4 0.5 0.6
1 1 NaN 0.4 0.5 0.6 | 3 0.1 0.2 0.3 NaN NaN NaN
2 NaN 0.4 0.5 0.6 |
verify_integrity
verify_integrity
Self explanatory documentation
不言自明的文件
verify_integrity: boolean, default False
Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.
verify_integrity: boolean, default False
检查新的连接轴是否包含重复项。相对于实际的数据串联,这可能非常昂贵。
Because the resulting index from concatenating d1
and d2
is not unique, it would fail the integrity check.
因为从串联结果索引d1
和d2
不唯一,它会失败的完整性检查。
pd.concat([d1, d2])
A B C D
2 0.1 0.2 0.3 NaN
3 0.1 0.2 0.3 NaN
1 NaN 0.4 0.5 0.6
2 NaN 0.4 0.5 0.6
And
和
pd.concat([d1, d2], verify_integrity=True)
> ValueError: Indexes have overlapping values: [2]
> ValueError:索引有重叠值:[2]