Python Pandas GroupBy 获取组列表

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28844535/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:49:19  来源:igfitidea点击:

Python Pandas GroupBy get list of groups

pythonpandas

提问by user3745115

I have a line of code:

我有一行代码:

g = x.groupby('Color')

The colors are Red, Blue, Green, Yellow, Purple, Orange, and Black. How do I return this list? For similar attributes, I use x.Attribute and it works fine, but x.Color doesn't behave the same way.

颜色是红色、蓝色、绿色、黄色、紫色、橙色和黑色。我如何返回此列表?对于类似的属性,我使用 x.Attribute 并且它工作正常,但 x.Color 的行为方式不同。

采纳答案by Yanqi Ma

There is much easier way of doing it:

有更简单的方法:

g = x.groupby('Color')

g.groups.keys()

By doing groupby()pandas returns you a dict of grouped DFs. You can easily get the key list of this dict by python built in function keys().

通过做groupby()pandas 会返回一个分组 DF 的字典。您可以通过 python 内置函数轻松获取此 dict 的密钥列表keys()

回答by ericmjl

Here's how to do it.

这是如何做到的。

groups = list()
for g, data in x.groupby('Color'):
    print(g, data)
    groups.append(g)

The core idea here is this: if you iterate over a dataframe groupby iterator, you'll get back a two-tuple of (group name, filtered data frame), where filtered data frame contains only records corresponding to that group).

这里的核心思想是:如果您通过迭代器对数据帧进行迭代,您将得到一个二元组(组名,过滤后的数据帧),其中过滤后的数据帧仅包含与该组对应的记录)。

回答by Zythyr

It is my understanding that you have a Data Frame which contains multiples columns. One of the columns is "Color" which has different types of colors. You want to return a list of unique colors that exist.

据我了解,您有一个包含多个列的数据框。其中一列是“颜色”,它具有不同类型的颜色。您想要返回存在的唯一颜色列表。

colorGroups = df.groupby(['Color'])
for c in colorGroups.groups: 
    print c

The above code will give you all the colors that exist without repeating the colors names. Thus, you should get an output such as:

上面的代码将为您提供所有存在的颜色,而无需重复颜色名称。因此,您应该得到如下输出:

Red
Blue
Green
Yellow
Purple
Orange
Black

An alternative is the unique()function which returns an array of all unique values in a Series. Thus to get an array of all unique colors, you would do:

另一种方法是unique()函数,它返回一个系列中所有唯一值的数组。因此,要获得所有唯一颜色的数组,您可以执行以下操作:

df['Color'].unique()

The output is an array, so for example print df['Color'].unique()[3]would give you Yellow.

输出是一个数组,因此例如print df['Color'].unique()[3]会给你Yellow.

回答by Erik Swan

If you do not care about the order of the groups, Yanqi Ma's answer will work fine:

如果你不关心组的顺序,Yanqi Ma 的回答会很好:

g = x.groupby('Color')
g.groups.keys()
list(g.groups) # or this

However, note that g.groupsis a dictionary, so the keys are inherently unordered!This is the case even if you use sort=Trueon the groupbymethod to sort the groups, which is true by default.

但是,请注意这g.groups是一个字典,因此键本质上是无序的!即使您使用sort=Trueongroupby方法对组进行排序,情况也是如此,默认情况下为 true。

This actually bit me hard when it resulted in a different order on two platforms, especially since I was using list(g.groups), so it wasn't obvious at first that g.groupswas a dict.

这实际上咬了我一下,当它导致了不同的顺序在两个平台上,尤其是因为我用list(g.groups),所以起初并不明显g.groups是一个dict

In my opinion, the best way to do this is to take advantage of the fact that the GroupBy object has an iterator, and use a list comprehension to return the groups in the order they exist in the GroupBy object:

在我看来,最好的方法是利用GroupBy 对象有一个 iterator的事实,并使用列表推导以它们在 GroupBy 对象中存在的顺序返回组:

g = x.groupby('Color')
groups = [name for name,unused_df in g]

It's a little less readable, but this will always return the groups in the correct order.

它的可读性稍差,但这将始终以正确的顺序返回组。

回答by Itai Roth

I compared runtime for the solutions above (with my data):

我比较了上述解决方案的运行时间(与我的数据):

In [443]: d = df3.groupby("IND")

In [444]: %timeit groups = [name for name,unused_df in d]
377 ms ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [445]: % timeit  list(d.groups)
1.08 μs ± 47.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [446]: % timeit d.groups.keys()
708 ns ± 7.18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [447]: % timeit df3['IND'].unique()
5.33 ms ± 128 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

it seems that the 'd.groups.keys()' is the best method.

似乎 'd.groups.keys()' 是最好的方法。