Python 在熊猫中按索引+列分组

Question

提问by vumaasha

I have a dataframe that has the columns

我有一个包含列的数据框

user_id
item_bought

用户身份
item_bought

Here user_id is the index of the df. I want to group by both user_id and item_bought and get the item wise count for the user. How do I do that.

这里 user_id 是 df 的索引。我想按 user_id 和 item_bought 进行分组，并为用户获取明智的项目计数。我怎么做。

Thanks

谢谢

Answer 1

回答by howMuchCheeseIsTooMuchCheese

import pandas as pd

import numpy as np

In [11]:

df = pd.DataFrame()

In [12]:

df['user_id'] = ['b','b','b','c']

In [13]:

df['item_bought'] = ['x','x','y','y']

In [14]:

df['ct'] = 1

In [15]:

df

Out[15]:
    user_id     item_bought     ct
0   b   x   1
1   b   x   1
2   b   y   1
3   c   y   1
In [16]:

pd.pivot_table(df,values='ct',index=['user_id','item_bought'],aggfunc=np.sum)

Out[16]:

user_id  item_bought
b        x              2
         y              1
c        y              1

Answer 2

回答by kekert

this should work:

这应该有效：

>>> df = pd.DataFrame(np.random.randint(0,5,(6, 2)), columns=['col1','col2'])
>>> df['ind1'] = list('AAABCC')
>>> df['ind2'] = range(6)
>>> df.set_index(['ind1','ind2'], inplace=True)
>>> df

           col1  col2
ind1 ind2            
A    0        3     2
     1        2     0
     2        2     3
B    3        2     4
C    4        3     1
     5        0     0


>>> df.groupby([df.index.get_level_values(0),'col1']).count()

           col2
ind1 col1      
A    2        2
     3        1
B    2        1
C    0        1
     3        1

I had the same problem using one of the columns from multiindex. with multiindex, you cannot use df.index.levels[0] since it has only distinct values from that particular index level and will be most likely of different size than whole dataframe...

我在使用 multiindex 中的一列时遇到了同样的问题。对于多索引，您不能使用 df.index.levels[0] 因为它只有与该特定索引级别不同的值，并且很可能与整个数据帧的大小不同......

check http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html- get_level_values "Return vector of label values for requested level, equal to the length of the index"

检查http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html- get_level_values “返回请求级别的标签值向量，等于索引的长度”

Answer 3

回答by jezrael

From version 0.20.1it is simplier:

从0.20.1版本开始，它更简单：

Strings passed to DataFrame.groupby()as the byparameter may now reference either column names or index level names

作为by参数传递给DataFrame.groupby() 的字符串现在可以引用列名称或索引级别名称

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
                   'B': np.arange(8)}, index=index)

print (df)

              A  B
first second      
bar   one     1  0
      two     1  1
baz   one     1  2
      two     1  3
foo   one     2  4
      two     2  5
qux   one     3  6
      two     3  7

print (df.groupby(['second', 'A']).sum())
          B
second A   
one    1  2
       2  4
       3  6
two    1  4
       2  5
       3  7

Answer 4

回答by Burgertron

I had the same problem- imported a bunch of data and I wanted to groupby a field that was the index. I didn't have a multi-index or any of that jazz and nor do you.

我遇到了同样的问题- 导入了一堆数据，我想对一个作为索引的字段进行分组。我没有多索引或任何爵士乐，你也没有。

I figured the problem is that the field I want is the index, so at first I just reset the index - but this gives me a useless index field that I don't want. So now I do the following (two levels of grouping):

我认为问题在于我想要的字段是索引，所以起初我只是重置索引 - 但这给了我一个我不想要的无用的索引字段。所以现在我执行以下操作（两个级别的分组）：

grouped = df.reset_index().groupby(by=['Field1','Field2'])

then I can use 'grouped' in a bunch of ways for different reports

然后我可以对不同的报告以多种方式使用“分组”

grouped[['Field3','Field4']].agg([np.mean, np.std])

(which was what I wanted, giving me Field4 and Field3 averages, grouped by Field1 (the index) and Field2

（这是我想要的，给我 Field4 和 Field3 平均值，按 Field1（索引）和 Field2 分组

For you, if you just want to do the count of items per user, in one simple line using groupby, the code could be

对你来说，如果你只想计算每个用户的项目数，在一个简单的行中使用 groupby，代码可能是

df.reset_index().groupby(by=['user_id']).count()

If you want to do more things then you can (like me) create 'grouped' and then use that. As a beginner, I find it easier to follow that way.

如果你想做更多的事情，那么你可以（像我一样）创建“分组”然后使用它。作为初学者，我发现遵循这种方式更容易。

Please note, that the "reset_index" is not 'in place' and so will not mess up your original dataframe

请注意，“reset_index”不是“到位”，因此不会弄乱您的原始数据帧

Python 在熊猫中按索引+列分组

提问by vumaasha

回答by howMuchCheeseIsTooMuchCheese

回答by kekert

回答by jezrael

回答by Burgertron

相关推荐

最近更新

标签

Python 在熊猫中按索引+列分组

提问by vumaasha

回答by howMuchCheeseIsTooMuchCheese

回答by kekert

回答by jezrael

回答by Burgertron

相关推荐

Python Errno 22:invalid mode('rb') or filename:' ' 使用 pyinstaller 运行规范文件时

Python：EOFError：读取一行时的EOF

我想通过（Python）为解压缩（.tar.gz）文件创建一个脚本

Python 2：SMTPServerDisconnected：连接意外关闭

相关推荐

最近更新

标签