Python 如何旋转数据框
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/47152691/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to pivot a dataframe
提问by piRSquared
- What is pivot?
- How do I pivot?
- Is this a pivot?
- Long format to wide format?
- 什么是枢轴?
- 我如何旋转?
- 这是一个支点吗?
- 长格式到宽格式?
I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting....
我见过很多关于数据透视表的问题。即使他们不知道他们在询问数据透视表,他们通常也是。几乎不可能写出一个规范的问题和答案,涵盖旋转的所有方面......
... But I'm going to give it a go.
......但我要试一试。
The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)
现有问题和答案的问题在于,问题通常集中在 OP 难以概括的细微差别上,以便使用许多现有的好答案。但是,没有一个答案试图给出全面的解释(因为这是一项艰巨的任务)
Look a few examples from my google search
从我的谷歌搜索中看一些例子
- How to pivot a dataframe in Pandas?
- Good question and answer. But the answer only answers the specific question with little explanation.
- pandas pivot table to data frame
- In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.
- pandas pivoting a dataframe, duplicate rows
- Another decent question but the answer focuses on one method, namely
pd.DataFrame.pivot
- Another decent question but the answer focuses on one method, namely
- 如何在 Pandas 中透视数据框?
- 很好的问答。但答案只回答了具体问题,几乎没有解释。
- 熊猫数据透视表到数据框
- 在这个问题中,OP 与枢轴的输出有关。即列的外观。OP 希望它看起来像 R。这对 Pandas 用户不是很有帮助。
- 大熊猫旋转数据框,重复行
- 另一个不错的问题,但答案集中在一种方法上,即
pd.DataFrame.pivot
- 另一个不错的问题,但答案集中在一种方法上,即
So whenever someone searches for pivot
they get sporadic results that are likely not going to answer their specific question.
因此,每当有人搜索时,pivot
他们都会得到零星的结果,这些结果可能不会回答他们的具体问题。
Setup
设置
You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below.
您可能会注意到,我显着地命名了我的列和相关列值,以对应我将如何在下面的答案中进行透视。
import numpy as np
import pandas as pd
from numpy.core.defchararray import add
np.random.seed([3,1415])
n = 20
cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)
df = pd.DataFrame(
add(cols, arr1), columns=cols
).join(
pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)
key row item col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
5 key1 row2 item2 col4 0.13 0.88
6 key2 row4 item1 col3 0.88 0.39
7 key1 row4 item1 col1 0.10 0.07
8 key1 row0 item2 col4 0.65 0.02
9 key1 row2 item0 col2 0.35 0.61
10 key2 row0 item2 col1 0.40 0.85
11 key2 row4 item1 col2 0.64 0.25
12 key0 row2 item2 col3 0.50 0.44
13 key0 row4 item1 col4 0.24 0.46
14 key1 row3 item2 col3 0.28 0.11
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70
Question(s)
问题)
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
How do I pivot
df
such that thecol
values are columns,row
values are the index, and mean ofval0
are the values?col col0 col1 col2 col3 col4 row row0 0.77 0.605 NaN 0.860 0.65 row2 0.13 NaN 0.395 0.500 0.25 row3 NaN 0.310 NaN 0.545 NaN row4 NaN 0.100 0.395 0.760 0.24
How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?col col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.100 0.395 0.760 0.24
Can I get something other than
mean
, like maybesum
?col col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 row2 0.13 0.00 0.79 0.50 0.50 row3 0.00 0.31 0.00 1.09 0.00 row4 0.00 0.10 0.79 1.52 0.24
Can I do more that one aggregation at a time?
sum mean col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24
Can I aggregate over multiple value columns?
val0 val1 col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02 row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79 row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00 row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
Can Subdivide by multiple columns?
item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 row row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65 row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00 row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
Or
item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 key row key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00 row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00 row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00 key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65 row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13 row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
col col0 col1 col2 col3 col4 row row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1
How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,
np.random.seed([3, 1415]) df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)}) df2 A B 0 a 0 1 a 11 2 a 2 3 a 11 4 b 10 5 b 10 6 b 14 7 c 7
The expected should would look something like
a b c 0 0.0 10.0 7.0 1 11.0 10.0 NaN 2 2.0 14.0 NaN 3 11.0 NaN NaN
How do I flatten the multiple index to single index after
pivot
From
1 2 1 1 2 a 2 1 1 b 2 1 0 c 1 0 0
To
1|1 2|1 2|2 a 2 1 1 b 2 1 0 c 1 0 0
为什么我得到
ValueError: Index contains duplicate entries, cannot reshape
我如何旋转
df
,使得col
值是列,row
值是索引,而值是平均值val0
?col col0 col1 col2 col3 col4 row row0 0.77 0.605 NaN 0.860 0.65 row2 0.13 NaN 0.395 0.500 0.25 row3 NaN 0.310 NaN 0.545 NaN row4 NaN 0.100 0.395 0.760 0.24
我如何旋转
df
,使得col
值是列,row
值是索引,平均值val0
是值,缺失值是0
?col col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.100 0.395 0.760 0.24
我能得到其他东西吗
mean
,比如也许sum
?col col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 row2 0.13 0.00 0.79 0.50 0.50 row3 0.00 0.31 0.00 1.09 0.00 row4 0.00 0.10 0.79 1.52 0.24
我可以一次做更多的聚合吗?
sum mean col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24
我可以聚合多个值列吗?
val0 val1 col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02 row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79 row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00 row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
可以按多列细分吗?
item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 row row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65 row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00 row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
或者
item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 key row key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00 row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00 row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00 key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65 row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13 row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
我可以汇总列和行一起出现的频率,也就是“交叉表”吗?
col col0 col1 col2 col3 col4 row row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1
如何通过仅旋转两列将 DataFrame 从长转换为宽?鉴于,
np.random.seed([3, 1415]) df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)}) df2 A B 0 a 0 1 a 11 2 a 2 3 a 11 4 b 10 5 b 10 6 b 14 7 c 7
预期应该看起来像
a b c 0 0.0 10.0 7.0 1 11.0 10.0 NaN 2 2.0 14.0 NaN 3 11.0 NaN NaN
之后如何将多个索引展平为单个索引
pivot
从
1 2 1 1 2 a 2 1 1 b 2 1 0 c 1 0 0
到
1|1 2|1 2|2 a 2 1 1 b 2 1 0 c 1 0 0
采纳答案by piRSquared
We start by answering the first question:
我们首先回答第一个问题:
Question 1
问题 1
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
为什么我得到
ValueError: Index contains duplicate entries, cannot reshape
This occurs because pandas is attempting to reindex either a columns
or index
object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot
. I know there are duplicate entries that share the row
and col
values:
发生这种情况是因为 Pandas 试图用重复的条目重新索引 acolumns
或index
object。可以使用不同的方法来执行数据透视。其中一些不太适合当要求它在其中旋转的键有重复时。例如。考虑pd.DataFrame.pivot
。我知道有重复的条目共享row
和col
值:
df.duplicated(['row', 'col']).any()
True
So when I pivot
using
所以当我pivot
使用
df.pivot(index='row', columns='col', values='val0')
I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:
我收到上面提到的错误。事实上,当我尝试执行相同的任务时,我遇到了同样的错误:
df.set_index(['row', 'col'])['val0'].unstack()
Here is a list of idioms we can use to pivot
这是我们可以用来旋转的习语列表
pd.DataFrame.groupby
+pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you
unstack
the levels that you want to be in the column index.
pd.DataFrame.pivot_table
- A glorified version of
groupby
with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers. - Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
- A glorified version of
pd.DataFrame.set_index
+pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the
groupby
paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We thenunstack
the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
pd.DataFrame.pivot
- Very similar to
set_index
in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values forindex
,columns
,values
. - Similar to the
pivot_table
method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
- Very similar to
pd.crosstab
- This a specialized version of
pivot_table
and in it's purest form is the most intuitive way to perform several tasks.
- This a specialized version of
pd.factorize
+np.bincount
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
pd.get_dummies
+pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.
pd.DataFrame.groupby
+pd.DataFrame.unstack
- 几乎可以进行任何类型的枢轴操作的良好通用方法
- 您指定将构成一组中的透视行级别和列级别的所有列。您可以通过选择要聚合的剩余列和要执行聚合的函数来遵循此操作。最后,您
unstack
希望在列索引中的级别。
pd.DataFrame.pivot_table
groupby
具有更直观 API 的美化版本。对于许多人来说,这是首选方法。并且是开发人员的预期方法。- 指定行级别、列级别、要聚合的值以及执行聚合的函数。
pd.DataFrame.set_index
+pd.DataFrame.unstack
- 对某些人(包括我自己)来说方便且直观。无法处理重复的分组键。
- 与
groupby
范式类似,我们指定最终将成为行级或列级的所有列,并将它们设置为索引。然后unstack
我们在列中设置我们想要的级别。如果剩余的索引级别或列级别不是唯一的,则此方法将失败。
pd.DataFrame.pivot
- 非常相似
set_index
,因为它共享重复密钥限制。API 也非常有限。它只需要index
,columns
, 的标量值values
。 - 类似于
pivot_table
我们选择要旋转的行、列和值的方法。但是,我们无法聚合,如果行或列不是唯一的,则此方法将失败。
- 非常相似
pd.crosstab
- 这是一个专门的版本,
pivot_table
它最纯粹的形式是执行多项任务的最直观的方式。
- 这是一个专门的版本,
pd.factorize
+np.bincount
- 这是一种非常先进的技术,非常晦涩,但速度非常快。它不是在所有情况下都可以使用的,但是当它可以使用并且您使用起来很舒服时,您将获得性能奖励。
pd.get_dummies
+pd.DataFrame.dot
- 我用它来巧妙地执行交叉制表。
Examples
例子
What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table
. Then I'll provide alternatives to perform the same task.
对于每个后续的答案和问题,我要做的是使用pd.DataFrame.pivot_table
. 然后我将提供执行相同任务的替代方案。
Question 3
问题 3
How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?
我如何旋转
df
,使得col
值是列,row
值是索引,平均值val0
是值,缺失值是0
?
pd.DataFrame.pivot_table
fill_value
is not set by default. I tend to set it appropriately. In this case I set it to0
. Notice I skipped question 2as it's the same as this answer without thefill_value
aggfunc='mean'
is the default and I didn't have to set it. I included it to be explicit.df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc='mean') col col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
pd.crosstab
pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc='mean').fillna(0)
pd.DataFrame.pivot_table
fill_value
默认情况下未设置。我倾向于适当地设置它。在这种情况下,我将其设置为0
. 请注意,我跳过了问题 2,因为它与没有fill_value
aggfunc='mean'
是默认值,我不必设置它。我包括它是为了明确。df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc='mean') col col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 row2 0.13 0.000 0.395 0.500 0.25 row3 0.00 0.310 0.000 0.545 0.00 row4 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
pd.crosstab
pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc='mean').fillna(0)
Question 4
问题 4
Can I get something other than
mean
, like maybesum
?
我能得到其他东西吗
mean
,比如也许sum
?
pd.DataFrame.pivot_table
df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc='sum') col col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 row2 0.13 0.00 0.79 0.50 0.50 row3 0.00 0.31 0.00 1.09 0.00 row4 0.00 0.10 0.79 1.52 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
pd.crosstab
pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc='sum').fillna(0)
pd.DataFrame.pivot_table
df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc='sum') col col0 col1 col2 col3 col4 row row0 0.77 1.21 0.00 0.86 0.65 row2 0.13 0.00 0.79 0.50 0.50 row3 0.00 0.31 0.00 1.09 0.00 row4 0.00 0.10 0.79 1.52 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
pd.crosstab
pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc='sum').fillna(0)
Question 5
问题 5
Can I do more that one aggregation at a time?
我可以一次做更多的聚合吗?
Notice that for pivot_table
and crosstab
I needed to pass list of callables. On the other hand, groupby.agg
is able to take strings for a limited number of special functions. groupby.agg
would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
请注意, forpivot_table
和crosstab
I 需要传递可调用列表。另一方面,groupby.agg
能够为有限数量的特殊函数取字符串。 groupby.agg
也会采用我们传递给其他函数的相同可调用对象,但利用字符串函数名称通常更有效,因为可以提高效率。
pd.DataFrame.pivot_table
df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc=[np.size, np.mean]) size mean col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65 row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25 row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00 row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
pd.crosstab
pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
pd.DataFrame.pivot_table
df.pivot_table( values='val0', index='row', columns='col', fill_value=0, aggfunc=[np.size, np.mean]) size mean col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65 row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25 row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00 row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
pd.crosstab
pd.crosstab( index=df['row'], columns=df['col'], values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
Question 6
问题 6
Can I aggregate over multiple value columns?
我可以聚合多个值列吗?
pd.DataFrame.pivot_table
we passvalues=['val0', 'val1']
but we could've left that off completelydf.pivot_table( values=['val0', 'val1'], index='row', columns='col', fill_value=0, aggfunc='mean') val0 val1 col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02 row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79 row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00 row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
pd.DataFrame.pivot_table
我们通过了,values=['val0', 'val1']
但我们可以完全忽略它df.pivot_table( values=['val0', 'val1'], index='row', columns='col', fill_value=0, aggfunc='mean') val0 val1 col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4 row row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02 row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79 row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00 row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
Question 7
问题 7
Can Subdivide by multiple columns?
可以按多列细分吗?
pd.DataFrame.pivot_table
df.pivot_table( values='val0', index='row', columns=['item', 'col'], fill_value=0, aggfunc='mean') item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 row row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65 row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00 row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
pd.DataFrame.groupby
df.groupby( ['row', 'item', 'col'] )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
pd.DataFrame.pivot_table
df.pivot_table( values='val0', index='row', columns=['item', 'col'], fill_value=0, aggfunc='mean') item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 row row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65 row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00 row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00
pd.DataFrame.groupby
df.groupby( ['row', 'item', 'col'] )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
Question 8
问题 8
Can Subdivide by multiple columns?
可以按多列细分吗?
pd.DataFrame.pivot_table
df.pivot_table( values='val0', index=['key', 'row'], columns=['item', 'col'], fill_value=0, aggfunc='mean') item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 key row key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00 row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00 row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00 key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65 row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13 row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
pd.DataFrame.groupby
df.groupby( ['key', 'row', 'item', 'col'] )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
pd.DataFrame.set_index
because the set of keys are unique for both rows and columnsdf.set_index( ['key', 'row', 'item', 'col'] )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
pd.DataFrame.pivot_table
df.pivot_table( values='val0', index=['key', 'row'], columns=['item', 'col'], fill_value=0, aggfunc='mean') item item0 item1 item2 col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4 key row key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00 row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00 row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00 key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65 row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13 row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00 row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00
pd.DataFrame.groupby
df.groupby( ['key', 'row', 'item', 'col'] )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
pd.DataFrame.set_index
因为这组键对于行和列都是唯一的df.set_index( ['key', 'row', 'item', 'col'] )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
Question 9
问题 9
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
我可以汇总列和行一起出现的频率,也就是“交叉表”吗?
pd.DataFrame.pivot_table
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size') col col0 col1 col2 col3 col4 row row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
pd.crosstab
pd.crosstab(df['row'], df['col'])
pd.factorize
+np.bincount
# get integer factorization `i` and unique values `r` # for column `'row'` i, r = pd.factorize(df['row'].values) # get integer factorization `j` and unique values `c` # for column `'col'` j, c = pd.factorize(df['col'].values) # `n` will be the number of rows # `m` will be the number of columns n, m = r.size, c.size # `i * m + j` is a clever way of counting the # factorization bins assuming a flat array of length # `n * m`. Which is why we subsequently reshape as `(n, m)` b = np.bincount(i * m + j, minlength=n * m).reshape(n, m) # BTW, whenever I read this, I think 'Bean, Rice, and Cheese' pd.DataFrame(b, r, c) col3 col2 col0 col1 col4 row3 2 0 0 1 0 row2 1 2 1 0 2 row0 1 0 1 2 1 row4 2 2 0 1 1
pd.get_dummies
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col'])) col0 col1 col2 col3 col4 row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1
pd.DataFrame.pivot_table
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size') col col0 col1 col2 col3 col4 row row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
pd.crosstab
pd.crosstab(df['row'], df['col'])
pd.factorize
+np.bincount
# get integer factorization `i` and unique values `r` # for column `'row'` i, r = pd.factorize(df['row'].values) # get integer factorization `j` and unique values `c` # for column `'col'` j, c = pd.factorize(df['col'].values) # `n` will be the number of rows # `m` will be the number of columns n, m = r.size, c.size # `i * m + j` is a clever way of counting the # factorization bins assuming a flat array of length # `n * m`. Which is why we subsequently reshape as `(n, m)` b = np.bincount(i * m + j, minlength=n * m).reshape(n, m) # BTW, whenever I read this, I think 'Bean, Rice, and Cheese' pd.DataFrame(b, r, c) col3 col2 col0 col1 col4 row3 2 0 0 1 0 row2 1 2 1 0 2 row0 1 0 1 2 1 row4 2 2 0 1 1
pd.get_dummies
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col'])) col0 col1 col2 col3 col4 row0 1 2 0 1 1 row2 1 0 2 1 2 row3 0 1 0 2 0 row4 0 1 2 2 1
Question 10
问题 10
How do I convert a DataFrame from long to wide by pivoting on ONLY two columns?
如何通过仅旋转两列将 DataFrame 从长转换为宽?
The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount
:
第一步是为每一行分配一个数字——这个数字将是旋转结果中该值的行索引。这是使用GroupBy.cumcount
以下方法完成的:
df2.insert(0, 'count', df.groupby('A').cumcount())
df2
count A B
0 0 a 0
1 1 a 11
2 2 a 2
3 3 a 11
4 0 b 10
5 1 b 10
6 2 b 14
7 0 c 7
The second step is to use the newly created column as the index to call DataFrame.pivot
.
第二步是使用新创建的列作为索引来调用DataFrame.pivot
。
df2.pivot(*df)
# df.pivot(index='count', columns='A', values='B')
A a b c
count
0 0.0 10.0 7.0
1 11.0 10.0 NaN
2 2.0 14.0 NaN
3 11.0 NaN NaN
Question 11
问题 11
How do I flatten the multiple index to single index after
pivot
之后如何将多个索引展平为单个索引
pivot
If columns
type object
with string join
如果columns
输入object
字符串join
df.columns = df.columns.map('|'.join)
else format
别的 format
df.columns = df.columns.map('{0[0]}|{0[1]}'.format)