pandas groupby 分组和亚组级别分析
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35095128/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas groupby group and subgroup level analysis
提问by Siraj S.
On a multi-column groupbyobject, how do I access only the outer column? For e.g. below, i can access the inner column (entertainment content) through: df.get_group(('media', 'entertainment content'))
command, I desire to be able to also access something like: df.get_group(('media')) but it throws an error: "ValueError: must supply a tuple to get_group with multiple grouping keys"
在多列groupby对象上,如何仅访问外列?例如下面,我可以通过:df.get_group(('media', 'entertainment content'))
命令访问内列(娱乐内容),我希望也能够访问类似的东西:df.get_group(('media')) 但它抛出一个错误:“ValueError: must supply带有多个分组键的 get_group 元组”
[('media', 'entertainment content'),('media', 'internet media')]
df.get_group(('media', 'entertainment content'))
lasts vol prev ticker
industry sub_industry
media entertainment content 379.200012 1828139 354.000000 suntv
entertainment content 420.049988 2675741 404.600006 z
temp.get_group(('media'))
ValueError: must supply a tuple to get_group with multiple grouping keys
回答by Siraj S.
If you just want to access 'media', you don't need the extra set of parentheses when you call get_group
. So it'd just be get_group('media')
.
如果您只想访问“媒体”,则在调用get_group
. 所以它只是get_group('media')
。
If you wanted to retrieve multiple groups, that's when you would use an extra set of parentheses, which would create the tuple. For instance: get_group(('media','pizza'))
如果您想检索多个组,那么您将使用一组额外的括号,这将创建元组。例如:get_group(('media','pizza'))
回答by Alberto
As with pandas.get_group
it looks like it is not possible to access a single key after grouping by more than one key, I suggest the following alternative method.
由于pandas.get_group
看起来不可能在按多个键分组后访问单个键,我建议使用以下替代方法。
Generating the data frame:
生成数据框:
import pandas as pd
import numpy as np
rand = np.random.RandomState(1)
df = pd.DataFrame({'A': ['foo', 'bar'] * 12,
'B': rand.randn(24),
'C': rand.randint(0, 20, 24),
'D': ['aaa','bbb','ccc'] * 8})
Grouping by multiple keys ('A' and 'D') and using pandas.ngroup
to assign a group number, storing it in a new column:
按多个键('A' 和 'D')分组并pandas.ngroup
用于分配组号,将其存储在新列中:
df["grouping_AandD"] = df.groupby(["A", "D"]).ngroup()
Using the just created column to display all combinations in a loop but show only those containing the 'wanted key' ('foo' in this case):
使用刚刚创建的列在循环中显示所有组合,但只显示那些包含“想要的键”(在本例中为“foo”)的组合:
wanted_key = "foo"
for i in range(0, df.grouping_AandD.nunique()):
grouped_df = df[df.grouping_AandD == i]
if (grouped_df.A.all() == wanted_key):
print(grouped_df)
回答by Mike Müller
Just do what the error message says and use a tuple:
只需按照错误消息的说明操作并使用元组:
temp.get_group(('media',))
Note the trailing comma.
注意结尾的逗号。
回答by Arjun Varshney
I was trying to do something similar (creating columns for each subgroups). But, as far as I know, the approach below suited me and would help you as well. I tried to find the solution in the cookbook pandas documentation has provided, but it didn't help. Here is the way, I would suggest,
我试图做一些类似的事情(为每个子组创建列)。但是,据我所知,下面的方法适合我,也会对你有所帮助。我试图在Pandas文档提供的食谱中找到解决方案,但没有帮助。这是我建议的方法,
grp = df.groupby('industry', 'sub_industry')
values = []
grp = df.groupby('industry', 'sub_industry')
values = []
for sub_ind in (df.sub_industry.unique()):
values.append(grp.get_group(('media', sub_ind)))
for sub_ind in (df.sub_industry.unique()):
values.append(grp.get_group(('media', sub_ind)))