使用 Pandas 拆分数据

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20847508/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:29:48  来源:igfitidea点击:

unstacking data with Pandas

pythonpandas

提问by JD Long

I have some data that I'm taking from 'long' to 'wide'. I have no problem using unstackto make the data wide, but then I end up with what looks like an index which I can't get rid of. Here's a dummy example:

我有一些从“长”到“宽”的数据。我使用unstack宽泛的数据没有问题,但是我最终得到了一个我无法摆脱的索引。这是一个虚拟示例:

## set up some dummy data
import pandas as pd
d = {'state'  : ['a','b','a','b','a','b','a','b'],
     'year' : [1,1,1,1,2,2,2,2],
     'description'  : ['thing1','thing1','thing1','thing2','thing2','thing2','thing1','thing2'],
     'value' : [1., 2., 3., 4.,1., 2., 3., 4.]}
df = pd.DataFrame(d)
## now that we have dummy data do the long to wide conversion

dfGrouped = df.groupby(['state','year', 'description']).value.sum() 

dfUnstacked = dfGrouped.unstack('description')
print dfUnstacked


description  thing1  thing2
state year                 
a     1           4     NaN
      2           3       1
b     1           2       4
      2         NaN       6

So that looks like what I would expect. Now I'd like an unindexed data frame with columns 'state', 'year', 'thing1', 'thing2'. So it seems I should do thus:

所以这看起来像我所期望的。现在我想要一个未索引的数据框,其中包含“state”、“year”、“thing1”、“thing2”列。所以看来我应该这样做:

dfUnstackedNoIndex = dfUnstacked.reset_index()
print dfUnstackedNoIndex

description state  year  thing1  thing2
0               a     1       4     NaN
1               a     2       3       1
2               b     1       2       4
3               b     2     NaN       6

Ok, that's close. But I don't want description carried forward. So let's select out only the columns I want:

好的,差不多了。但我不想继续进行描述。所以让我们只选择我想要的列:

print dfUnstackedNoIndex[['state','year','thing1','thing2']]

description state  year  thing1  thing2
0               a     1       4     NaN
1               a     2       3       1
2               b     1       2       4
3               b     2     NaN       6

So what's up with 'description'? Why does it hang out even though I reset the index and selected only a few columns? Clearly I'm not groking something right.

那么“描述”是怎么回事?为什么即使我重置了索引并只选择了几列,它仍然挂起?显然我不是在摸索正确的东西。

FWIW, my Pandas version is 0.12

FWIW,我的 Pandas 版本是 0.12

采纳答案by unutbu

descriptionis the name of the columns. You can get rid of that like this:

description是列的名称。你可以像这样摆脱它:

In [74]: dfUnstackedNoIndex.columns.name = None

In [75]: dfUnstackedNoIndex
Out[75]: 
  state  year  thing1  thing2
0     a     1       4     NaN
1     a     2       3       1
2     b     1       2       4
3     b     2     NaN       6


The purpose of column names perhaps becomes clearer when you look at what happens when you unstack twice:

当您查看两次取消堆叠时会发生什么时,列名的目的可能会变得更加清晰:

In [107]: dfUnstacked2 = dfUnstacked.unstack('state')
In [108]: dfUnstacked2
Out[108]: 
description  thing1      thing2   
state             a   b       a  b
year                              
1                 4   2     NaN  4
2                 3 NaN       1  6

Now dfUnstacked2.columnsis a MultiIndex. Each levelhas a namewhich corresponds to the name of the index level that has been converted into a column level.

现在dfUnstacked2.columns是一个MultiIndex. 每个level都有一个name对应于已转换为列级别的索引级别的名称。

In [111]: dfUnstacked2.columns
Out[111]: 
MultiIndex(levels=[[u'thing1', u'thing2'], [u'a', u'b']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=[u'description', u'state'])

Column names and index names show up in the same place in the string representation of DataFrames, so it can be hard to know which is which. You can figure it out by inspecting df.index.namesand df.columns.names.

列名和索引名在 DataFrame 的字符串表示中出现在同一位置,因此很难知道哪个是哪个。您可以通过检查df.index.names和来弄清楚df.columns.names