pandas 为 DataFrame 中的每个组枚举每一行
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/17228215/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Enumerate each row for each group in a DataFrame
提问by Greg Reda
In pandas, how can I add a new column which enumerates rows based on a given grouping?
在 Pandas 中,如何添加一个新列来枚举基于给定分组的行?
For instance, assume the following DataFrame:
例如,假设以下 DataFrame:
import pandas as pd
import numpy as np
a_list = ['A', 'B', 'C', 'A', 'A', 'C', 'B', 'B', 'A', 'C']
df = pd.DataFrame({'col_a': a_list, 'col_b': range(10)})
df
col_a col_b
0 A 0
1 B 1
2 C 2
3 A 3
4 A 4
5 C 5
6 B 6
7 B 7
8 A 8
9 C 9
I'd like to add a col_cthat gives me the Nth row of the "group" based on a grouping of col_aand sorting of col_b.
我想补充一个col_c,让我根据分组的“组”的第N行col_a和排序col_b。
Desired output:
期望的输出:
col_a col_b col_c
0 A 0 1
3 A 3 2
4 A 4 3
8 A 8 4
1 B 1 1
6 B 6 2
7 B 7 3
2 C 2 1
5 C 5 2
9 C 9 3
I'm struggling to get to col_c. You can get to the proper grouping and sorting with .sort_index(by=['col_a', 'col_b']), it's now a matter of getting to that new column and labeling each row.
我正在努力到达col_c. 您可以使用 进行适当的分组和排序.sort_index(by=['col_a', 'col_b']),现在只需转到该新列并标记每一行。
回答by Andy Hayden
There's cumcount, for precisely this case:
有cumcount,对于这种情况:
df['col_c'] = g.cumcount()
As it says in the docs:
正如文档中所说:
Number each item in each group from 0 to the length of that group - 1.
从 0 到该组的长度 - 1 为每个组中的每个项目编号。
Original answer (before cumcount was defined).
原始答案(在定义 cumcount 之前)。
You could create a helper function to do this:
您可以创建一个辅助函数来执行此操作:
def add_col_c(x):
x['col_c'] = np.arange(len(x))
return x
First sort by column col_a:
首先按列 col_a 排序:
In [11]: df.sort('col_a', inplace=True)
then apply this function across each group:
然后在每个组中应用这个函数:
In [12]: g = df.groupby('col_a', as_index=False)
In [13]: g.apply(add_col_c)
Out[13]:
col_a col_b col_c
3 A 3 0
8 A 8 1
0 A 0 2
4 A 4 3
6 B 6 0
1 B 1 1
7 B 7 2
9 C 9 0
2 C 2 1
5 C 5 2
In order to get 1,2,...you couls use np.arange(1, len(x) + 1).
为了让1,2,...你可以使用np.arange(1, len(x) + 1).
回答by andrew
The given answers both involve calling a python function for each group, and if you have many groups a vectorized approach should be faster (I havent checked).
给出的答案都涉及为每个组调用一个 python 函数,如果你有很多组,矢量化方法应该更快(我还没有检查过)。
Here is my pure numpy suggestion:
这是我纯粹的 numpy 建议:
In [5]: df.sort(['col_a', 'col_b'], inplace=True, ascending=(False, False))
In [6]: sizes = df.groupby('col_a', sort=False).size().values
In [7]: df['col_c'] = np.arange(sizes.sum()) - np.repeat(sizes.cumsum() - sizes, sizes)
In [8]: print df
col_a col_b col_c
9 C 9 0
5 C 5 1
2 C 2 2
7 B 7 0
6 B 6 1
1 B 1 2
8 A 8 0
4 A 4 1
3 A 3 2
0 A 0 3
回答by waitingkuo
You could define your own function to deal with that:
您可以定义自己的函数来处理:
In [58]: def func(x):
....: x['col_c'] = x['col_a'].argsort() + 1
....: return x
....:
In [59]: df.groupby('col_a').apply(func)
Out[59]:
col_a col_b col_c
0 A 0 1
3 A 3 2
4 A 4 3
8 A 8 4
1 B 1 1
6 B 6 2
7 B 7 3
2 C 2 1
5 C 5 2
9 C 9 3

