Python 在熊猫中按索引+列分组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/30925079/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 09:10:45  来源:igfitidea点击:

Group by index + column in pandas

pythonpandas

提问by vumaasha

I have a dataframe that has the columns

我有一个包含列的数据框

  1. user_id
  2. item_bought
  1. 用户身份
  2. item_bought

Here user_id is the index of the df. I want to group by both user_id and item_bought and get the item wise count for the user. How do I do that.

这里 user_id 是 df 的索引。我想按 user_id 和 item_bought 进行分组,并为用户获取明智的项目计数。我怎么做。

Thanks

谢谢

回答by howMuchCheeseIsTooMuchCheese

import pandas as pd

import numpy as np

In [11]:

df = pd.DataFrame()

In [12]:

df['user_id'] = ['b','b','b','c']

In [13]:

df['item_bought'] = ['x','x','y','y']

In [14]:

df['ct'] = 1

In [15]:

df

Out[15]:
    user_id     item_bought     ct
0   b   x   1
1   b   x   1
2   b   y   1
3   c   y   1
In [16]:

pd.pivot_table(df,values='ct',index=['user_id','item_bought'],aggfunc=np.sum)

Out[16]:

user_id  item_bought
b        x              2
         y              1
c        y              1

回答by kekert

this should work:

这应该有效:

>>> df = pd.DataFrame(np.random.randint(0,5,(6, 2)), columns=['col1','col2'])
>>> df['ind1'] = list('AAABCC')
>>> df['ind2'] = range(6)
>>> df.set_index(['ind1','ind2'], inplace=True)
>>> df

           col1  col2
ind1 ind2            
A    0        3     2
     1        2     0
     2        2     3
B    3        2     4
C    4        3     1
     5        0     0


>>> df.groupby([df.index.get_level_values(0),'col1']).count()

           col2
ind1 col1      
A    2        2
     3        1
B    2        1
C    0        1
     3        1

I had the same problem using one of the columns from multiindex. with multiindex, you cannot use df.index.levels[0] since it has only distinct values from that particular index level and will be most likely of different size than whole dataframe...

我在使用 multiindex 中的一列时遇到了同样的问题。对于多索引,您不能使用 df.index.levels[0] 因为它只有与该特定索引级别不同的值,并且很可能与整个数据帧的大小不同......

check http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html- get_level_values "Return vector of label values for requested level, equal to the length of the index"

检查http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html- get_level_values “返回请求级别的标签值向量,等于索引的长度”

回答by jezrael

From version 0.20.1it is simplier:

0.20.1版本开始,它更简单:

Strings passed to DataFrame.groupby()as the byparameter may now reference either column names or index level names

作为by参数传递给DataFrame.groupby() 的字符串现在可以引用列名称或索引级别名称

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
                   'B': np.arange(8)}, index=index)

print (df)

              A  B
first second      
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7

print (df.groupby(['second', 'A']).sum())
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

回答by Burgertron

I had the same problem- imported a bunch of data and I wanted to groupby a field that was the index. I didn't have a multi-index or any of that jazz and nor do you.

我遇到了同样的问题- 导入了一堆数据,我想对一个作为索引的字段进行分组。我没有多索引或任何爵士乐,你也没有。

I figured the problem is that the field I want is the index, so at first I just reset the index - but this gives me a useless index field that I don't want. So now I do the following (two levels of grouping):

我认为问题在于我想要的字段是索引,所以起初我只是重置索引 - 但这给了我一个我不想要的无用的索引字段。所以现在我执行以下操作(两个级别的分组):

grouped = df.reset_index().groupby(by=['Field1','Field2'])

then I can use 'grouped' in a bunch of ways for different reports

然后我可以对不同的报告以多种方式使用“分组”

grouped[['Field3','Field4']].agg([np.mean, np.std])

(which was what I wanted, giving me Field4 and Field3 averages, grouped by Field1 (the index) and Field2

(这是我想要的,给我 Field4 和 Field3 平均值,按 Field1(索引)和 Field2 分组

For you, if you just want to do the count of items per user, in one simple line using groupby, the code could be

对你来说,如果你只想计算每个用户的项目数,在一个简单的行中使用 groupby,代码可能是

df.reset_index().groupby(by=['user_id']).count()

If you want to do more things then you can (like me) create 'grouped' and then use that. As a beginner, I find it easier to follow that way.

如果你想做更多的事情,那么你可以(像我一样)创建“分组”然后使用它。作为初学者,我发现遵循这种方式更容易。

Please note, that the "reset_index" is not 'in place' and so will not mess up your original dataframe

请注意,“reset_index”不是“到位”,因此不会弄乱您的原始数据帧