Python Pandas GroupBy 获取组列表
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/28844535/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python Pandas GroupBy get list of groups
提问by user3745115
I have a line of code:
我有一行代码:
g = x.groupby('Color')
The colors are Red, Blue, Green, Yellow, Purple, Orange, and Black. How do I return this list? For similar attributes, I use x.Attribute and it works fine, but x.Color doesn't behave the same way.
颜色是红色、蓝色、绿色、黄色、紫色、橙色和黑色。我如何返回此列表?对于类似的属性,我使用 x.Attribute 并且它工作正常,但 x.Color 的行为方式不同。
采纳答案by Yanqi Ma
There is much easier way of doing it:
有更简单的方法:
g = x.groupby('Color')
g.groups.keys()
By doing groupby()
pandas returns you a dict of grouped DFs.
You can easily get the key list of this dict by python built in function keys()
.
通过做groupby()
pandas 会返回一个分组 DF 的字典。您可以通过 python 内置函数轻松获取此 dict 的密钥列表keys()
。
回答by ericmjl
Here's how to do it.
这是如何做到的。
groups = list()
for g, data in x.groupby('Color'):
print(g, data)
groups.append(g)
The core idea here is this: if you iterate over a dataframe groupby iterator, you'll get back a two-tuple of (group name, filtered data frame), where filtered data frame contains only records corresponding to that group).
这里的核心思想是:如果您通过迭代器对数据帧进行迭代,您将得到一个二元组(组名,过滤后的数据帧),其中过滤后的数据帧仅包含与该组对应的记录)。
回答by Zythyr
It is my understanding that you have a Data Frame which contains multiples columns. One of the columns is "Color" which has different types of colors. You want to return a list of unique colors that exist.
据我了解,您有一个包含多个列的数据框。其中一列是“颜色”,它具有不同类型的颜色。您想要返回存在的唯一颜色列表。
colorGroups = df.groupby(['Color'])
for c in colorGroups.groups:
print c
The above code will give you all the colors that exist without repeating the colors names. Thus, you should get an output such as:
上面的代码将为您提供所有存在的颜色,而无需重复颜色名称。因此,您应该得到如下输出:
Red
Blue
Green
Yellow
Purple
Orange
Black
An alternative is the unique()function which returns an array of all unique values in a Series. Thus to get an array of all unique colors, you would do:
另一种方法是unique()函数,它返回一个系列中所有唯一值的数组。因此,要获得所有唯一颜色的数组,您可以执行以下操作:
df['Color'].unique()
The output is an array, so for example print df['Color'].unique()[3]
would give you Yellow
.
输出是一个数组,因此例如print df['Color'].unique()[3]
会给你Yellow
.
回答by Erik Swan
If you do not care about the order of the groups, Yanqi Ma's answer will work fine:
如果你不关心组的顺序,Yanqi Ma 的回答会很好:
g = x.groupby('Color')
g.groups.keys()
list(g.groups) # or this
However, note that g.groups
is a dictionary, so the keys are inherently unordered!This is the case even if you use sort=True
on the groupby
method to sort the groups, which is true by default.
但是,请注意这g.groups
是一个字典,因此键本质上是无序的!即使您使用sort=True
ongroupby
方法对组进行排序,情况也是如此,默认情况下为 true。
This actually bit me hard when it resulted in a different order on two platforms, especially since I was using list(g.groups)
, so it wasn't obvious at first that g.groups
was a dict
.
这实际上咬了我一下,当它导致了不同的顺序在两个平台上,尤其是因为我用list(g.groups)
,所以起初并不明显g.groups
是一个dict
。
In my opinion, the best way to do this is to take advantage of the fact that the GroupBy object has an iterator, and use a list comprehension to return the groups in the order they exist in the GroupBy object:
在我看来,最好的方法是利用GroupBy 对象有一个 iterator的事实,并使用列表推导以它们在 GroupBy 对象中存在的顺序返回组:
g = x.groupby('Color')
groups = [name for name,unused_df in g]
It's a little less readable, but this will always return the groups in the correct order.
它的可读性稍差,但这将始终以正确的顺序返回组。
回答by Itai Roth
I compared runtime for the solutions above (with my data):
我比较了上述解决方案的运行时间(与我的数据):
In [443]: d = df3.groupby("IND")
In [444]: %timeit groups = [name for name,unused_df in d]
377 ms ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [445]: % timeit list(d.groups)
1.08 μs ± 47.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [446]: % timeit d.groups.keys()
708 ns ± 7.18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [447]: % timeit df3['IND'].unique()
5.33 ms ± 128 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
it seems that the 'd.groups.keys()' is the best method.
似乎 'd.groups.keys()' 是最好的方法。