Python pandas:如何使用多索引运行数据透视表?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/35414625/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 16:25:05  来源:igfitidea点击:

pandas: how to run a pivot with a multi-index?

pythonpandaspivotmulti-index

提问by Pythonista anonymous

I would like to run a pivot on a pandas DataFrame, with the index being two columns, not one. For example, one field for the year, one for the month, an 'item' field which shows 'item 1' and 'item 2' and a 'value' field with numerical values. I want the index to be year + month.

我想在 pandas 上运行一个数据透视DataFrame表,索引是两列,而不是一列。例如,一个字段用于年份,一个用于月份,一个“item”字段显示“item 1”和“item 2”,以及一个带有数值的“value”字段。我希望索引为年 + 月。

The only way I managed to get this to work was to combine the two fields into one, then separate them again. is there a better way?

我设法让它起作用的唯一方法是将两个字段合并为一个,然后再次将它们分开。有没有更好的办法?

Minimal code copied below. Thanks a lot!

下面复制了最少的代码。非常感谢!

PS Yes, I am aware there are other questions with the keywords 'pivot' and 'multi-index', but I did not understand if/how they can help me with this question.

PS 是的,我知道关键字“枢轴”和“多索引”还有其他问题,但我不明白它们是否/如何帮助我解决这个问题。

import pandas as pd
import numpy as np

df= pd.DataFrame()
month = np.arange(1, 13)
values1 = np.random.randint(0, 100, 12)
values2 = np.random.randint(200, 300, 12)


df['month'] = np.hstack((month, month))
df['year'] = 2004
df['value'] = np.hstack((values1, values2))
df['item'] = np.hstack((np.repeat('item 1', 12), np.repeat('item 2', 12)))

# This doesn't work: 
# ValueError: Wrong number of items passed 24, placement implies 2
# mypiv = df.pivot(['year', 'month'], 'item', 'value')

# This doesn't work, either:
# df.set_index(['year', 'month'], inplace=True)
# ValueError: cannot label index with a null key
# mypiv = df.pivot(columns='item', values='value')

# This below works but is not ideal: 
# I have to first concatenate then separate the fields I need
df['new field'] = df['year'] * 100 + df['month']

mypiv = df.pivot('new field', 'item', 'value').reset_index()
mypiv['year'] = mypiv['new field'].apply( lambda x: int(x) / 100)  
mypiv['month'] = mypiv['new field'] % 100

采纳答案by Alexander

You can group and then unstack.

您可以先分组,然后再取消堆叠。

>>> df.groupby(['year', 'month', 'item'])['value'].sum().unstack('item')
item        item 1  item 2
year month                
2004 1          33     250
     2          44     224
     3          41     268
     4          29     232
     5          57     252
     6          61     255
     7          28     254
     8          15     229
     9          29     258
     10         49     207
     11         36     254
     12         23     209

Or use pivot_table:

或使用pivot_table

>>> df.pivot_table(
        values='value', 
        index=['year', 'month'], 
        columns='item', 
        aggfunc=np.sum)
item        item 1  item 2
year month                
2004 1          33     250
     2          44     224
     3          41     268
     4          29     232
     5          57     252
     6          61     255
     7          28     254
     8          15     229
     9          29     258
     10         49     207
     11         36     254
     12         23     209

回答by Ajean

I believe if you include itemin your MultiIndex, then you can just unstack:

我相信如果你包含item在你的 MultiIndex 中,那么你就可以解除堆栈:

df.set_index(['year', 'month', 'item']).unstack(level=-1)

This yields:

这产生:

                value      
item       item 1 item 2
year month              
2004 1         21    277
     2         43    244
     3         12    262
     4         80    201
     5         22    287
     6         52    284
     7         90    249
     8         14    229
     9         52    205
     10        76    207
     11        88    259
     12        90    200

It's a bit faster than using pivot_table, and about the same speed or slightly slower than using groupby.

它比使用 快一点pivot_table,并且与使用groupby.

回答by moshevi

thanks to gmoutso commentyou can use this:

感谢 gmoutso评论你可以使用这个:

def multiindex_pivot(df, index=None, columns=None, values=None):
    if index is None:
        names = list(df.index.names)
        df = df.reset_index()
    else:
        names = index
    list_index = df[names].values
    tuples_index = [tuple(i) for i in list_index] # hashable
    df = df.assign(tuples_index=tuples_index)
    df = df.pivot(index="tuples_index", columns=columns, values=values)
    tuples_index = df.index  # reduced
    index = pd.MultiIndex.from_tuples(tuples_index, names=names)
    df.index = index
    return df

usage:

用法:

df.pipe(multiindex_pivot, index=['idx_column1', 'idx_column2'], columns='foo', values='bar')

You might want to have a simple flat column structure and have columns to be of their intended type, simply add this:

您可能想要一个简单的扁平列结构,并且列是其预期类型,只需添加以下内容:

(df
   .infer_objects()  # coerce to the intended column type
   .rename_axis(None, axis=1))  # flatten column headers