Python Groupby 值对数据框熊猫的计数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/39132742/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 21:54:46  来源:igfitidea点击:

Groupby value counts on the dataframe pandas

pythonpandasdataframecrosstabpandas-groupby

提问by Salvador Dali

I have the following dataframe:

我有以下数据框:

df = pd.DataFrame([
    (1, 1, 'term1'),
    (1, 2, 'term2'),
    (1, 1, 'term1'),
    (1, 1, 'term2'),
    (2, 2, 'term3'),
    (2, 3, 'term1'),
    (2, 2, 'term1')
], columns=['id', 'group', 'term'])

I want to group it by idand groupand calculate the number of each term for this id, group pair.

我把它通过想组idgroup并计算每个词的数量为这个ID,组对。

So in the end I am going to get something like this:

所以最后我会得到这样的东西:

enter image description here

在此处输入图片说明

I was able to achieve what I want by looping over all the rows with df.iterrows()and creating a new dataframe, but this is clearly inefficient. (If it helps, I know the list of all terms beforehand and there are ~10 of them).

我能够通过循环遍历所有行df.iterrows()并创建一个新的数据框来实现我想要的,但这显然是低效的。(如果有帮助,我事先知道所有术语的列表,其中大约有 10 个)。

It looks like I have to group by and then count values, so I tried that with df.groupby(['id', 'group']).value_counts()which does not work because value_countsoperates on the groupby series and not a dataframe.

看起来我必须分组然后计算值,所以我尝试了那个df.groupby(['id', 'group']).value_counts()不起作用的方法,因为value_counts对 groupby 系列而不是数据帧进行操作。

Anyway I can achieve this without looping?

无论如何,我可以在不循环的情况下实现这一目标?

回答by piRSquared

I use groupbyand size

我使用groupbysize

df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)

enter image description here

在此处输入图片说明



Timing

定时

enter image description here

在此处输入图片说明

1,000,000 rows

1,000,000 行

df = pd.DataFrame(dict(id=np.random.choice(100, 1000000),
                       group=np.random.choice(20, 1000000),
                       term=np.random.choice(10, 1000000)))

enter image description here

在此处输入图片说明

回答by MaxU

using pivot_table()method:

使用pivot_table()方法:

In [22]: df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
Out[22]:
term      term1  term2  term3
id group
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Timing against 700K rows DF:

针对 700K 行 DF 的计时:

In [24]: df = pd.concat([df] * 10**5, ignore_index=True)

In [25]: df.shape
Out[25]: (700000, 3)

In [3]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
1 loop, best of 3: 226 ms per loop

In [4]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
1 loop, best of 3: 236 ms per loop

In [5]: %timeit pd.crosstab([df.id, df.group], df.term)
1 loop, best of 3: 355 ms per loop

In [6]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int)
1 loop, best of 3: 232 ms per loop

In [7]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1 loop, best of 3: 231 ms per loop

Timing against 7M rows DF:

针对 7M 行 DF 的计时:

In [9]: df = pd.concat([df] * 10, ignore_index=True)

In [10]: df.shape
Out[10]: (7000000, 3)

In [11]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
1 loop, best of 3: 2.27 s per loop

In [12]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
1 loop, best of 3: 2.3 s per loop

In [13]: %timeit pd.crosstab([df.id, df.group], df.term)
1 loop, best of 3: 3.37 s per loop

In [14]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int)
1 loop, best of 3: 2.28 s per loop

In [15]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1 loop, best of 3: 1.89 s per loop

回答by A.Kot

Instead of remembering lengthy solutions, how about the one that pandas has built in for you:

与其记住冗长的解决方案,不如想想 Pandas 为您内置的解决方案:

df.groupby(['id', 'group', 'term']).count()

回答by jezrael

You can use crosstab:

您可以使用crosstab

print (pd.crosstab([df.id, df.group], df.term))
term      term1  term2  term3
id group                     
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Another solution with groupbywith aggregating size, reshaping by unstack:

另一个解决方案 with groupbyaggregating size,通过unstack以下方式重塑:

df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)

term      term1  term2  term3
id group                     
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Timings:

时间

df = pd.concat([df]*10000).reset_index(drop=True)

In [48]: %timeit (df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0))
100 loops, best of 3: 12.4 ms per loop

In [49]: %timeit (df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0))
100 loops, best of 3: 12.2 ms per loop