Pandas:为 groupby 标识的每个组分配一个索引

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/41594703/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 02:46:06  来源:igfitidea点击:

Pandas: assign an index to each group identified by groupby

pythonpandas

提问by user2667066

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indicesin R. For example, if I have

使用 groupby() 时,如何创建一个包含组编号索引的新列的 DataFrame,类似于dplyr::group_indices在 R 中。例如,如果我有

>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
   a  b
0  1  1
1  1  1
2  1  2
3  2  1
4  2  1
5  2  2

How can I get a DataFrame like

我怎样才能得到一个像

   a  b  idx
0  1  1  1
1  1  1  1
2  1  2  2
3  2  1  3
4  2  1  3
5  2  2  4

(the order of the idxindexes doesn't matter)

idx索引的顺序无关紧要)

采纳答案by JohnE

Here's a concise way using drop_duplicatesand mergeto get a unique identifier.

这是使用drop_duplicatesmerge获取唯一标识符的简洁方法。

group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )

   a  b  index
0  1  1      0
1  1  1      0
2  1  2      2
3  2  1      3
4  2  1      3
5  2  2      5

The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).

在这种情况下,标识符变为 0,2,3,5(只是原始索引的残差),但可以通过附加reset_index(drop=True).

Update:Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroupmethod as noted in a comment to the question above by @Constantino and a subsequent answer by @CalumYou. I'll leave this here as an alternate approach but ngroupseems like the better way to do this in most cases.

更新:较新版本的Pandas (0.20.2) 提供了一种更简单的方法来执行此操作,ngroup如@Constantino 对上述问题的评论和@CalumYou 的后续回答中所述。我将把它留在这里作为一种替代方法,但ngroup在大多数情况下似乎是更好的方法。

回答by Calum You

Here is the solution using ngroupfrom a comment above by Constantino, for those still looking for this function (the equivalent of dplyr::group_indicesin R, or egen group()in Stata) if you were trying to search with those keywords like me). This is also about 25% faster than the solution given by maxliving according to my own timing.

下面是使用该解决方案ngroup通过以上君士坦丁的注释,对于那些仍在寻找这个功能(相当于dplyr::group_indices在R,或egen group()在Stata),如果你试图用这些关键字来搜索我一样)。这也比 maxliving 根据我自己的时间给出的解决方案快约 25%。

>>> import pandas as pd
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df['idx'] = df.groupby(['a', 'b']).ngroup()
>>> df
   a  b  idx
0  1  1    0
1  1  1    0
2  1  2    1
3  2  1    2
4  2  1    2
5  2  2    3

>>> %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
1.83 ms ± 67.2 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['idx'] = df.groupby(['a', 'b']).ngroup()
1.38 ms ± 30 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

回答by foglerit

A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categoricaland keep only its labels:

一种简单的方法是连接您的分组列(以便它们的值的每个组合代表一个独特的不同元素),然后将其转换为Pandas Categorical并仅保留其标签:

df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df

    a   b   idx
0   1   1   0
1   1   1   0
2   1   2   1
3   2   1   2
4   2   1   2
5   2   2   3

Edit: changed labelsproperties to codesas the former seem to be deprecated

编辑:将labels属性更改codes为前者似乎已被弃用

Edit2: Added a separator as suggested by Authman Apatira

Edit2:添加了 Authman Apatira 建议的分隔符

回答by Marjan Moderc

Definetely not the most straightforward solution, but here is what I would do (comments in the code):

绝对不是最直接的解决方案,但这是我要做的(代码中的注释):

df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})

#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)

print df

That would generate an unique idx for each combination of aand b.

这将为a和 的每个组合生成一个唯一的 idx b

   a  b idx
0  1  1  11
1  1  1  11
2  1  2  12
3  2  1  21
4  2  1  21
5  2  2  22

But this is still a rather silly index (think about some more complex values in columns aand b. So let's clear the index:

但这仍然是一个相当愚蠢的索引(想想列a和 中一些更复杂的值b。所以让我们清除索引:

# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))

# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}

#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)

print df

That would produce the desired output:

这将产生所需的输出:

   a  b  idx
0  1  1    0
1  1  1    0
2  1  2    1
3  2  1    2
4  2  1    2
5  2  2    3

回答by maxliving

A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):

我认为比当前接受的答案快一个数量级的方法(下面的计时结果):

def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
    df.sort_values(grouping_cols, inplace=True)
    # You could do the following three lines in one, I just thought 
    # this would be clearer as an explanation of what's going on:
    duplicated = df.duplicated(subset=grouping_cols, keep='first')
    new_group = ~duplicated
    return new_group.cumsum()

Timing results:

计时结果:

a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})

In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop

In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop

回答by Ted Petrou

I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.

我不确定这是一个微不足道的问题。这是一个有点复杂的解决方案,它首先对分组列进行排序,然后检查每一行是否与前一行不同,如果不同则累加 1。进一步检查下面的字符串数据答案。

df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)

Output

输出

0    1
1    1
2    2
3    3
4    3
5    4
dtype: int64

So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0)which checks if each row is different than the previous row. Any non-zero entry indicates a new group.

所以把它分解成几个步骤,让我们看看它的输出df.sort_values(['a', 'b']).diff().fillna(0)检查每一行是否与前一行不同。任何非零条目表示一个新组。

     a    b
0  0.0  0.0
1  0.0  0.0
2  0.0  1.0
3  1.0 -1.0
4  0.0  0.0
5  0.0  1.0

A new group only need to have a single column different so this is what .ne(0).any(1)checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.

一个新组只需要有一个不同的列,所以这就是.ne(0).any(1)检查 - 对于任何列都不等于 0。然后只是一个累积总和来跟踪组。

Answer for columns as strings

将列作为字符串的答案

#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])

output of df1

输出 df1

   a  b
0  a  a
1  a  a
4  a  a
3  b  a
2  b  b
5  c  c
6  c  d
8  c  d
7  d  d

Take similar approach by checking if group has changed

通过检查组是否已更改采取类似的方法

df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)

0    1
1    1
4    1
3    2
2    3
5    4
6    5
8    5
7    6