Python 使用多索引列展平 DataFrame

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27576795/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:57:53  来源:igfitidea点击:

Flatten DataFrame with multi-index columns

pythonpandaspivot-table

提问by orange

I'd like to convert a Pandas DataFrame that is derived from a pivot table into a row representation as shown below.

我想将从数据透视表派生的 Pandas DataFrame 转换为行表示,如下所示。

This is where I'm at:

这是我所在的位置:

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'goods': ['a', 'a', 'b', 'b', 'b'],
    'stock': [5, 10, 30, 40, 10],
    'category': ['c1', 'c2', 'c1', 'c2', 'c1'],
    'date': pd.to_datetime(['2014-01-01', '2014-02-01', '2014-01-06', '2014-02-09', '2014-03-09'])
})
# we don't care about year in this example
df['month'] = df['date'].map(lambda x: x.month)
piv = df.pivot_table(["stock"], "month", ["goods", "category"], aggfunc="sum")
piv = piv.reindex(np.arange(piv.index[0], piv.index[-1] + 1))
piv = piv.ffill(axis=0)
piv = piv.fillna(0)
print piv

which results in

这导致

stock            
goods        a       b    
category    c1  c2  c1  c2
month                     
1            5   0  30   0
2            5  10  30  40
3            5  10  10  40

And this is where I want to get to.

这就是我想要到达的地方。

goods category month stock
    a       c1     1     5
    a       c1     2     0
    a       c1     3     0
    a       c2     1     0
    a       c2     2    10
    a       c2     3     0
    b       c1     1    30
    b       c1     2     0
    b       c1     3    10
    b       c2     1     0
    b       c2     2    40
    b       c2     3     0

Previously, I used

以前,我用

piv = piv.stack()
piv = piv.reset_index()
print piv

to get rid of the multi-indexes, but this results in this because I pivot now on two columns (["goods", "category"]):

摆脱多索引,但这会导致这一点,因为我现在以两列 ( ["goods", "category"]) 为中心:

      month category stock    
goods                    a   b
0         1       c1     5  30
1         1       c2     0   0
2         2       c1     5  30
3         2       c2    10  40
4         3       c1     5  10
5         3       c2    10  40

Does anyone know how I can get rid of the multi-index in the column and get the result into a DataFrame of the exemplified format?

有谁知道我如何摆脱列中的多索引并将结果放入示例格式的 DataFrame 中?

采纳答案by behzad.nouri

>>> piv.unstack().reset_index().drop('level_0', axis=1)
   goods category  month   0
0      a       c1      1   5
1      a       c1      2   5
2      a       c1      3   5
3      a       c2      1   0
4      a       c2      2  10
5      a       c2      3  10
6      b       c1      1  30
7      b       c1      2  30
8      b       c1      3  10
9      b       c2      1   0
10     b       c2      2  40
11     b       c2      3  40

then all you need is to change last column name from 0to stock.

那么您所需要的只是将最后一列名称从 更改0stock

回答by Andy Hayden

It seems to me that melt(aka unpivot)is very close to what you want to do:

在我看来melt(又名 unpivot)与您想要做的非常接近:

In [11]: pd.melt(piv)
Out[11]:
      NaN goods category  value
0   stock     a       c1      5
1   stock     a       c1      5
2   stock     a       c1      5
3   stock     a       c2      0
4   stock     a       c2     10
5   stock     a       c2     10
6   stock     b       c1     30
7   stock     b       c1     30
8   stock     b       c1     10
9   stock     b       c2      0
10  stock     b       c2     40
11  stock     b       c2     40

There's a rogue column (stock), that appears here that column header is constant in piv. If we drop it first the melt works OOTB:

有一个流氓列(股票),这里显示列标题在 piv 中是不变的。如果我们先放下它,熔体就会起作用:

In [12]: piv.columns = piv.columns.droplevel(0)

In [13]: pd.melt(piv)
Out[13]:
   goods category  value
0      a       c1      5
1      a       c1      5
2      a       c1      5
3      a       c2      0
4      a       c2     10
5      a       c2     10
6      b       c1     30
7      b       c1     30
8      b       c1     10
9      b       c2      0
10     b       c2     40
11     b       c2     40


Edit: The above actually drops the index, you need to make it a column with reset_index:

编辑:上面实际上删除了索引,您需要将其设置为具有reset_index以下内容的列:

In [21]: pd.melt(piv.reset_index(), id_vars=['month'], value_name='stock')
Out[21]:
    month goods category  stock
0       1     a       c1      5
1       2     a       c1      5
2       3     a       c1      5
3       1     a       c2      0
4       2     a       c2     10
5       3     a       c2     10
6       1     b       c1     30
7       2     b       c1     30
8       3     b       c1     10
9       1     b       c2      0
10      2     b       c2     40
11      3     b       c2     40

回答by Philipe Riskalla Leal

I know that the question has already been answered, but for my dataset multiindex column problem, the provided solution was unefficient. So here I am posting another solution for unpivoting multiindex columns using pandas.

我知道问题已经得到解答,但是对于我的数据集多索引列问题,提供的解决方案效率低下。因此,我在这里发布了另一种使用 Pandas 反转多索引列的解决方案。

Here is the problem I had:

这是我遇到的问题:

enter image description here

在此处输入图片说明

As one can see, the dataframe is composed of 3 multiindex, and two levels of multiindex columns.

可以看出,数据框由 3 个多索引和两级多索引列组成。

The desired dataframe format was:

所需的数据帧格式是:

enter image description here

在此处输入图片说明

When I tried the options given above, the pd.melt function didn't allow to have more than one column in the var_name attribute. Therefore, every time that I tried a melt, I would end up losing some attribute from my table.

当我尝试上面给出的选项时,pd.melt 函数不允许在 var_name 属性中有超过一列。因此,每次我尝试熔化时,我最终都会从我的表中丢失一些属性。

The solution I found was to apply a double stacking function over my dataframe.

我找到的解决方案是在我的数据帧上应用双堆叠函数。

Before the coding, it is worth notice that the desired var_name for my unpivoted table column was "Populacao residente em domicilios particulares ocupados" (see in the code below). Therefore, for all my value entries, they should be stacked in this newly created var_name new column.

在编码之前,值得注意的是,我的非透视表列所需的 var_name 是“Populacao residente em domicilios speciales ocupados”(参见下面的代码)。因此,对于我所有的值条目,它们应该堆叠在这个新创建的 var_name 新列中。

Here is a snippet code:

这是一个片段代码:

import pandas as pd

# reading my table

df = pd.read_excel(r'my_table.xls', sep=',', header=[2,3], encoding='latin3', 
               index_col=[0,1,2], na_values=['-', ' ', '*'], squeeze=True).fillna(0)

df.index.names = ['COD_MUNIC_7', 'NOME_MUN', 'TIPO']
df.columns.names = ['sexo', 'faixa_etaria']


df.head()


# making the stacking:

df = pd.DataFrame(pd.Series(df.stack(level=0).stack(), name='Populacao residente em domicilios particulares ocupados')).reset_index()


df.head()

Another solution I found was to first apply a stacking function over the dataframe and then apply the melt.

我发现的另一个解决方案是首先在数据帧上应用堆叠函数,然后应用熔体。

Here is an alternative code:

这是一个替代代码:

df = df.stack('faixa_etaria').reset_index().melt(id_vars=['COD_MUNIC_7', 'NOME_MUN','TIPO', 'faixa_etaria'],
                  value_vars=['Homens', 'Mulheres'],
                  value_name='Populacao residente em domicilios particulares ocupados', 
                  var_name='sexo')

df.head()

Sincerely yours,

您忠诚的,

Philipe Riskalla Leal

菲利普·瑞斯卡拉·莱尔