Python Groupby 值对数据框熊猫的计数

Question

提问by Salvador Dali

I have the following dataframe:

我有以下数据框：

df = pd.DataFrame([
    (1, 1, 'term1'),
    (1, 2, 'term2'),
    (1, 1, 'term1'),
    (1, 1, 'term2'),
    (2, 2, 'term3'),
    (2, 3, 'term1'),
    (2, 2, 'term1')
], columns=['id', 'group', 'term'])

I want to group it by idand groupand calculate the number of each term for this id, group pair.

我把它通过想组id和group并计算每个词的数量为这个ID，组对。

So in the end I am going to get something like this:

所以最后我会得到这样的东西：

I was able to achieve what I want by looping over all the rows with df.iterrows()and creating a new dataframe, but this is clearly inefficient. (If it helps, I know the list of all terms beforehand and there are ~10 of them).

我能够通过循环遍历所有行df.iterrows()并创建一个新的数据框来实现我想要的，但这显然是低效的。（如果有帮助，我事先知道所有术语的列表，其中大约有 10 个）。

It looks like I have to group by and then count values, so I tried that with df.groupby(['id', 'group']).value_counts()which does not work because value_countsoperates on the groupby series and not a dataframe.

看起来我必须分组然后计算值，所以我尝试了那个df.groupby(['id', 'group']).value_counts()不起作用的方法，因为value_counts对 groupby 系列而不是数据帧进行操作。

Anyway I can achieve this without looping?

无论如何，我可以在不循环的情况下实现这一目标？

Answer 1

回答by piRSquared

I use groupbyand size

我使用groupby和size

df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)

Timing

定时

1,000,000 rows

1,000,000 行

df = pd.DataFrame(dict(id=np.random.choice(100, 1000000),
                       group=np.random.choice(20, 1000000),
                       term=np.random.choice(10, 1000000)))

Answer 2

回答by MaxU

using pivot_table()method:

使用pivot_table()方法：

In [22]: df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
Out[22]:
term      term1  term2  term3
id group
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Timing against 700K rows DF:

针对 700K 行 DF 的计时：

In [24]: df = pd.concat([df] * 10**5, ignore_index=True)

In [25]: df.shape
Out[25]: (700000, 3)

In [3]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
1 loop, best of 3: 226 ms per loop

In [4]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
1 loop, best of 3: 236 ms per loop

In [5]: %timeit pd.crosstab([df.id, df.group], df.term)
1 loop, best of 3: 355 ms per loop

In [6]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int)
1 loop, best of 3: 232 ms per loop

In [7]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1 loop, best of 3: 231 ms per loop

Timing against 7M rows DF:

针对 7M 行 DF 的计时：

In [9]: df = pd.concat([df] * 10, ignore_index=True)

In [10]: df.shape
Out[10]: (7000000, 3)

In [11]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)
1 loop, best of 3: 2.27 s per loop

In [12]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0)
1 loop, best of 3: 2.3 s per loop

In [13]: %timeit pd.crosstab([df.id, df.group], df.term)
1 loop, best of 3: 3.37 s per loop

In [14]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int)
1 loop, best of 3: 2.28 s per loop

In [15]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)
1 loop, best of 3: 1.89 s per loop

Answer 3

回答by A.Kot

Instead of remembering lengthy solutions, how about the one that pandas has built in for you:

与其记住冗长的解决方案，不如想想 Pandas 为您内置的解决方案：

df.groupby(['id', 'group', 'term']).count()

Answer 4

回答by jezrael

You can use crosstab:

您可以使用crosstab：

print (pd.crosstab([df.id, df.group], df.term))
term      term1  term2  term3
id group                     
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Another solution with groupbywith aggregating size, reshaping by unstack:

另一个解决方案 with groupbyaggregating size，通过unstack以下方式重塑：

df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0)

term      term1  term2  term3
id group                     
1  1          2      1      0
   2          0      1      0
2  2          1      0      1
   3          1      0      0

Timings:

时间：

df = pd.concat([df]*10000).reset_index(drop=True)

In [48]: %timeit (df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0))
100 loops, best of 3: 12.4 ms per loop

In [49]: %timeit (df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0))
100 loops, best of 3: 12.2 ms per loop

Python Groupby 值对数据框熊猫的计数

提问by Salvador Dali

回答by piRSquared

Timing

定时

回答by MaxU

回答by A.Kot

回答by jezrael

相关推荐

最近更新

标签

Python Groupby 值对数据框熊猫的计数

提问by Salvador Dali

回答by piRSquared

Timing

定时

回答by MaxU

回答by A.Kot

回答by jezrael

相关推荐

Python 如何在windows系统上安装geckodriver

Python 获取熊猫列中的第一和第二高值

Python Pyspark：以表格格式显示火花数据框

Python 如何防止在 sns.countplot 中重叠 x 轴标签

相关推荐

最近更新

标签