在 Pandas 中是否有类似 GroupBy.get_group 的东西,但有一个可选的默认值?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/19804282/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
In pandas is there something like a GroupBy.get_group, but with an optional default value?
提问by Zach Dwiel
I've got a DataFrame df, which I've 'groupby'ed. I'm looking for a function which is similar to get_group(name) except that rather than throwing a KeyError if the name doesn't exist, returns an empty DataFrame (or some other value), similar to how dict.get works:
我有一个 DataFrame df,我已经“分组”了。我正在寻找一个类似于 get_group(name) 的函数,除了如果名称不存在则不会抛出 KeyError 而是返回一个空的 DataFrame (或其他一些值),类似于 dict.get 的工作方式:
g = df.groupby('x')
# doesn't work, but would be nice:
i = g.get_group(1, default=[])
# does work, but is hard to read:
i = g.obj.take(g.indices.get(1, []), g.axis)
Is there already a function which provides this?
是否已经有提供此功能的功能?
Edit:
编辑:
In many ways, the GroupBy object is represented by a dict (.indicies, .groups), and this 'get with default' functionality was core enough to the concept of a dict that it is included in the Python language itself. It seemed that if a dict-like thing doesn't have a get with default, maybe I'm not understanding it correctly? Why would a dict like thing not have a 'get with default'?
在许多方面,GroupBy 对象由一个 dict(.indicies、.groups)表示,并且这种“默认获取”功能对于 dict 的概念来说足够核心了,它包含在 Python 语言本身中。似乎如果一个类似 dict 的东西没有默认值,也许我没有正确理解它?为什么像 dict 这样的东西没有“默认获取”?
An abbreviated example of what I want to do is:
我想要做的一个简短的例子是:
df1_bymid = df1.groupby('mid')
df2_bymid = df2.groupby('mid')
for mid in set(df1_bymid.groups) | set(df2_bymid.groups) :
rows1 = df1_bymid.get_group(mid, [])
rows2 = df1_bymid.get_group(mid, [])
for row1, row2 in itertools.product(rows1, rows2) :
yield row1, row2
Of course I could creating a function, and I might, it just seemed that if I have to go this far out of my way, maybe I'm not using the GroupBy object the way it was intended:
当然,我可以创建一个函数,而且我可能会,似乎如果我必须走这么远,也许我没有按照预期的方式使用 GroupBy 对象:
def get_group(df, name, obj=None, default=None) :
if obj is None :
obj = df.obj
try :
inds = df.indices[name]
except KeyError, e :
if default is None :
raise e
inds = default
return df.obj.take(inds, df.axis)
回答by waitingkuo
I might define my own get_group()as following
我可能会定义我自己get_group()如下
In [55]: def get_group(g, key):
....: if key in g.groups: return g.get_group(key)
....: return pd.DataFrame()
....:
In [52]: get_group(g, 's1')
Out[52]:
Mt Sp Value count
0 s1 a 1 3
1 s1 b 2 2
In [54]: get_group(g, 's4')
Out[54]:
Empty DataFrame
Columns: []
Index: []
回答by Mike
It is not as pretty but you could do something like this:
它不是那么漂亮,但你可以做这样的事情:
setup:
设置:
>>> df = pandas.DataFrame([[1,2,3],[4,5,6],[1,8,9]], columns=['a','b','c'])
>>> df
a b c
0 1 2 3
1 4 5 6
2 1 8 9
>>> g = df.groupby('a')
Now g.get_grouprequires that the key passed exist in the underlying groupsdict, but you could access that member yourself, and in fact it is a normal python dict. It takes the group value to the collection of indices:
现在g.get_group要求传递的密钥存在于底层groupsdict 中,但您可以自己访问该成员,实际上它是一个普通的 python dict。它将组值带到索引集合中:
>>> g.groups
{1: Int64Index([0, 2], dtype='int64'), 4: Int64Index([1], dtype='int64')}
>>> type(g.groups)
<type 'dict'>
If you use these returned indices in the index location function of the dataframe, you can get your groups out the same way get_groupwould:
如果您在数据框的索引位置函数中使用这些返回的索引,您可以以相同的方式获取您的组get_group:
>>> df.loc[g.groups[1]]
a b c
0 1 2 3
2 1 8 9
Since groupsis a dictyou can use the getmethod. Without supplying a default value, this will return None, which will cause locto raise an exception. But it will accept an empty list:
由于groups是一个dict你可以使用的get方法。如果不提供默认值,这将返回None,这将导致loc引发异常。但它会接受一个空列表:
>>> df.loc[g.groups.get(1, [])]
a b c
0 1 2 3
2 1 8 9
>>> df.loc[g.groups.get(2, [])]
Empty DataFrame
Columns: [a, b, c]
Index: []
It is not as clean as supplying a default value to get_group(maybe they should add that feature in a future version) but it works.
它不像提供默认值那么干净get_group(也许他们应该在未来的版本中添加该功能),但它有效。
回答by Phil
You can use a defaultdictto achieve this.
您可以使用 adefaultdict来实现这一点。
Let's say you have a groupby object that splits the data on a column being greater than zero. The problem is all the values could be greater or less than zero, meaning you cannot be sure if 1 or 2 dataframes are available in the groupby.
假设您有一个 groupby 对象,它拆分大于零的列上的数据。问题是所有值都可能大于或小于零,这意味着您无法确定 groupby 中是否有 1 或 2 个数据帧可用。
g_df = df.groupby(df.some_column.gt(0))
Then there are 2 approaches
然后有2种方法
df_dict = defaultdict(pd.DataFrame, {i:i_df for i,i_df in g_df} )
df_dict[True]
df_dict[False]
Or:
或者:
df_dict = defaultdict(list, g_df.groups)
df.loc[df_dict[True]]
df.loc[df_dict[False]]
I haven't tested which is more efficient, obviously the second approach only creates a defaultdict on the index not the dataframe - so could well be more efficient.
我还没有测试哪个更有效,显然第二种方法只在索引而不是数据帧上创建一个 defaultdict - 所以很可能更有效。

