Pandas groupby 和 value_counts
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/51799818/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas groupby and value_counts
提问by Susensio
I want to count distinct values per column (with pd.value_counts
I guess) grouping data by some level in MultiIndex. The multiindex is taken care of with groupby(level=
parameter, but apply
raises a ValueError
我想计算每列的不同值(pd.value_counts
我猜)按 MultiIndex 中的某个级别对数据进行分组。多索引由groupby(level=
参数处理,但apply
引发了ValueError
Original dataframe:
原始数据框:
>>> df = pd.DataFrame(np.random.choice(list('ABC'), size=(10,5)),
columns=['c1','c2','c3','c4','c5'],
index=pd.MultiIndex.from_product([['foo', 'bar'],
['w','y','x','y','z']]))
c1 c2 c3 c4 c5
foo w C C B A A
y A A C B A
x A B C C C
y A B C C C
z A C B C B
bar w B C C A C
y A A C A A
x A B B B A
y A A C A B
z A B B C B
What I want:
我想要的是:
c1 c2 c3 c4 c5
foo A 4 2 0 3 2
B 1 2 2 1 2
C 0 1 3 1 1
bar A 4 1 0 1 2
B 0 2 2 1 1
C 1 2 3 3 2
I try to do:
我尝试做:
>>> df.groupby(level=0).apply(pd.value_counts)
ValueError: could not broadcast input array from shape (5,5) into shape (5)
I can do it myself manually, but I think it must be a more obvious way.
我可以自己手动完成,但我认为它必须是一种更明显的方式。
groups = [g.apply(pd.value_counts).fillna(0) for n, g in df.groupby(level=0)]
index = df.index.get_level_values(0).unique()
correct_result = pd.concat(groups, keys=index) # THIS WORKS AS EXPECTED
I mean, this isn't that long to write, but I feel like I'm reinventing the wheel. Aren't this kind of operations done by groupby function?
我的意思是,写这篇文章的时间并不长,但我觉得我正在重新发明轮子。这种操作不是groupby函数完成的吗?
Is there a more straightforward way of doing this, other than doing the split-apply-combine myself?
除了自己进行 split-apply-combine 之外,有没有更直接的方法来做到这一点?
回答by jezrael
Use stack
for MultiIndex Series
, then SeriesGroupBy.value_counts
and last unstack
for DataFrame
:
使用stack
for MultiIndex Series
, thenSeriesGroupBy.value_counts
和 last unstack
for DataFrame
:
np.random.seed(123)
df = pd.DataFrame(np.random.choice(list('ABC'), size=(10,5)),
columns=['c1','c2','c3','c4','c5'],
index=pd.MultiIndex.from_product([['foo', 'bar'],
['w','y','x','y','z']]))
print (df)
c1 c2 c3 c4 c5
foo w C B C C A
y C C B C B
x C B A B C
y B A C A B
z C B A A A
bar w A B C A C
y A A B A B
x A A A C B
y B C C C B
z A A C B A
df1 = df.stack().groupby(level=[0,2]).value_counts().unstack(1, fill_value=0)
print (df1)
c1 c2 c3 c4 c5
bar A 4 3 1 2 1
B 1 1 1 1 3
C 0 1 3 2 1
foo A 0 1 2 2 2
B 1 3 1 1 2
C 4 1 2 2 1