Pandas groupby 和 value_counts

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/51799818/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:55:35  来源:igfitidea点击:

Pandas groupby and value_counts

pythonpandaspandas-groupby

提问by Susensio

I want to count distinct values per column (with pd.value_countsI guess) grouping data by some level in MultiIndex. The multiindex is taken care of with groupby(level=parameter, but applyraises a ValueError

我想计算每列的不同值(pd.value_counts我猜)按 MultiIndex 中的某个级别对数据进行分组。多索引由groupby(level=参数处理,但apply引发了ValueError

Original dataframe:

原始数据框:

>>> df = pd.DataFrame(np.random.choice(list('ABC'), size=(10,5)),
                 columns=['c1','c2','c3','c4','c5'], 
                 index=pd.MultiIndex.from_product([['foo', 'bar'], 
                                                   ['w','y','x','y','z']]))



      c1 c2 c3 c4 c5
foo w  C  C  B  A  A
    y  A  A  C  B  A
    x  A  B  C  C  C
    y  A  B  C  C  C
    z  A  C  B  C  B
bar w  B  C  C  A  C
    y  A  A  C  A  A
    x  A  B  B  B  A
    y  A  A  C  A  B
    z  A  B  B  C  B

What I want:

我想要的是:

       c1  c2  c3  c4  c5
foo A   4   2   0   3   2
    B   1   2   2   1   2
    C   0   1   3   1   1
bar A   4   1   0   1   2
    B   0   2   2   1   1
    C   1   2   3   3   2

I try to do:

我尝试做:

>>> df.groupby(level=0).apply(pd.value_counts)

ValueError: could not broadcast input array from shape (5,5) into shape (5)

I can do it myself manually, but I think it must be a more obvious way.

我可以自己手动完成,但我认为它必须是一种更明显的方式。

groups = [g.apply(pd.value_counts).fillna(0) for n, g in df.groupby(level=0)]
index = df.index.get_level_values(0).unique()
correct_result = pd.concat(groups, keys=index)   # THIS WORKS AS EXPECTED

I mean, this isn't that long to write, but I feel like I'm reinventing the wheel. Aren't this kind of operations done by groupby function?

我的意思是,写这篇文章的时间并不长,但我觉得我正在重新发明轮子。这种操作不是groupby函数完成的吗?

Is there a more straightforward way of doing this, other than doing the split-apply-combine myself?

除了自己进行 split-apply-combine 之外,有没有更直接的方法来做到这一点?

回答by jezrael

Use stackfor MultiIndex Series, then SeriesGroupBy.value_countsand last unstackfor DataFrame:

使用stackfor MultiIndex Series, thenSeriesGroupBy.value_counts和 last unstackfor DataFrame

np.random.seed(123)

df = pd.DataFrame(np.random.choice(list('ABC'), size=(10,5)),
                 columns=['c1','c2','c3','c4','c5'], 
                 index=pd.MultiIndex.from_product([['foo', 'bar'], 
                                                   ['w','y','x','y','z']]))
print (df)
      c1 c2 c3 c4 c5
foo w  C  B  C  C  A
    y  C  C  B  C  B
    x  C  B  A  B  C
    y  B  A  C  A  B
    z  C  B  A  A  A
bar w  A  B  C  A  C
    y  A  A  B  A  B
    x  A  A  A  C  B
    y  B  C  C  C  B
    z  A  A  C  B  A

df1 = df.stack().groupby(level=[0,2]).value_counts().unstack(1, fill_value=0)
print (df1)
       c1  c2  c3  c4  c5
bar A   4   3   1   2   1
    B   1   1   1   1   3
    C   0   1   3   2   1
foo A   0   1   2   2   2
    B   1   3   1   1   2
    C   4   1   2   2   1