Python pandas:如何使用多索引运行数据透视表?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35414625/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas: how to run a pivot with a multi-index?
提问by Pythonista anonymous
I would like to run a pivot on a pandas DataFrame
, with the index being two columns, not one. For example, one field for the year, one for the month, an 'item' field which shows 'item 1' and 'item 2' and a 'value' field with numerical values. I want the index to be year + month.
我想在 pandas 上运行一个数据透视DataFrame
表,索引是两列,而不是一列。例如,一个字段用于年份,一个用于月份,一个“item”字段显示“item 1”和“item 2”,以及一个带有数值的“value”字段。我希望索引为年 + 月。
The only way I managed to get this to work was to combine the two fields into one, then separate them again. is there a better way?
我设法让它起作用的唯一方法是将两个字段合并为一个,然后再次将它们分开。有没有更好的办法?
Minimal code copied below. Thanks a lot!
下面复制了最少的代码。非常感谢!
PS Yes, I am aware there are other questions with the keywords 'pivot' and 'multi-index', but I did not understand if/how they can help me with this question.
PS 是的,我知道关键字“枢轴”和“多索引”还有其他问题,但我不明白它们是否/如何帮助我解决这个问题。
import pandas as pd
import numpy as np
df= pd.DataFrame()
month = np.arange(1, 13)
values1 = np.random.randint(0, 100, 12)
values2 = np.random.randint(200, 300, 12)
df['month'] = np.hstack((month, month))
df['year'] = 2004
df['value'] = np.hstack((values1, values2))
df['item'] = np.hstack((np.repeat('item 1', 12), np.repeat('item 2', 12)))
# This doesn't work:
# ValueError: Wrong number of items passed 24, placement implies 2
# mypiv = df.pivot(['year', 'month'], 'item', 'value')
# This doesn't work, either:
# df.set_index(['year', 'month'], inplace=True)
# ValueError: cannot label index with a null key
# mypiv = df.pivot(columns='item', values='value')
# This below works but is not ideal:
# I have to first concatenate then separate the fields I need
df['new field'] = df['year'] * 100 + df['month']
mypiv = df.pivot('new field', 'item', 'value').reset_index()
mypiv['year'] = mypiv['new field'].apply( lambda x: int(x) / 100)
mypiv['month'] = mypiv['new field'] % 100
采纳答案by Alexander
You can group and then unstack.
您可以先分组,然后再取消堆叠。
>>> df.groupby(['year', 'month', 'item'])['value'].sum().unstack('item')
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
Or use pivot_table
:
或使用pivot_table
:
>>> df.pivot_table(
values='value',
index=['year', 'month'],
columns='item',
aggfunc=np.sum)
item item 1 item 2
year month
2004 1 33 250
2 44 224
3 41 268
4 29 232
5 57 252
6 61 255
7 28 254
8 15 229
9 29 258
10 49 207
11 36 254
12 23 209
回答by Ajean
I believe if you include item
in your MultiIndex, then you can just unstack:
我相信如果你包含item
在你的 MultiIndex 中,那么你就可以解除堆栈:
df.set_index(['year', 'month', 'item']).unstack(level=-1)
This yields:
这产生:
value
item item 1 item 2
year month
2004 1 21 277
2 43 244
3 12 262
4 80 201
5 22 287
6 52 284
7 90 249
8 14 229
9 52 205
10 76 207
11 88 259
12 90 200
It's a bit faster than using pivot_table
, and about the same speed or slightly slower than using groupby
.
它比使用 快一点pivot_table
,并且与使用groupby
.
回答by moshevi
thanks to gmoutso commentyou can use this:
感谢 gmoutso评论你可以使用这个:
def multiindex_pivot(df, index=None, columns=None, values=None):
if index is None:
names = list(df.index.names)
df = df.reset_index()
else:
names = index
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
return df
usage:
用法:
df.pipe(multiindex_pivot, index=['idx_column1', 'idx_column2'], columns='foo', values='bar')
You might want to have a simple flat column structure and have columns to be of their intended type, simply add this:
您可能想要一个简单的扁平列结构,并且列是其预期类型,只需添加以下内容:
(df
.infer_objects() # coerce to the intended column type
.rename_axis(None, axis=1)) # flatten column headers