Pandas:当列包含 numpy 数组时聚合

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/16975318/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 20:53:09  来源:igfitidea点击:

Pandas: aggregate when column contains numpy arrays

pythonnumpypandasaggregation

提问by pteehan

I'm using a pandas DataFrame in which one column contains numpy arrays. When trying to sum that column via aggregation I get an error stating 'Must produce aggregated value'.

我正在使用 Pandas DataFrame,其中一列包含 numpy 数组。当尝试通过聚合对该列求和时,我收到一条错误消息,指出“必须生成聚合值”。

e.g.

例如

import pandas as pd
import numpy as np

DF = pd.DataFrame([[1,np.array([10,20,30])],
               [1,np.array([40,50,60])], 
               [2,np.array([20,30,40])],], columns=['category','arraydata'])

This works the way I would expect it to:

这以我期望的方式工作:

DF.groupby('category').agg(sum)

output:

输出:

             arraydata
category 1   [50 70 90]
         2   [20 30 40]

However, since my real data frame has multiple numeric columns, arraydata is not chosen as the default column to aggregate on, and I have to select it manually. Here is one approach I tried:

但是,由于我的真实数据框有多个数字列,因此没有选择 arraydata 作为要聚合的默认列,我必须手动选择它。这是我尝试过的一种方法:

g=DF.groupby('category')
g.agg({'arraydata':sum})

Here is another:

这是另一个:

g=DF.groupby('category')
g['arraydata'].agg(sum)

Both give the same output:

两者都给出相同的输出:

Exception: must produce aggregated value

However if I have a column that uses numeric rather than array data, it works fine. I can work around this, but it's confusing and I'm wondering if this is a bug, or if I'm doing something wrong. I feel like the use of arrays here might be a bit of an edge case and indeed wasn't sure if they were supported. Ideas?

但是,如果我有一列使用数字而不是数组数据,它就可以正常工作。我可以解决这个问题,但它令人困惑,我想知道这是一个错误,还是我做错了什么。我觉得在这里使用数组可能有点边缘情况,并且确实不确定它们是否受支持。想法?

Thanks

谢谢

回答by Jeff Tratner

One, perhaps more clunky way to do it would be to iterate over the GroupByobject (it generates (grouping_value, df_subgroup)tuples. For example, to achieve what you want here, you could do:

一种可能更笨拙的方法是迭代GroupBy对象(它生成(grouping_value, df_subgroup)元组。例如,要在这里实现您想要的,您可以执行以下操作:

grouped = DF.groupby("category")
aggregate = list((k, v["arraydata"].sum()) for k, v in grouped)
new_df = pd.DataFrame(aggregate, columns=["category", "arraydata"]).set_index("category")

This is very similar to what pandas is doing under the hood anyways [groupby, then do some aggregation, then merge back in], so you aren't really losing out on much.

这与Pandas在幕后所做的事情非常相似[分组,然后进行一些聚合,然后重新合并],所以你并没有真正失去太多。



Diving into the Internals

潜入内部

The problem here is that pandas is checking explicitly that the output notbe an ndarraybecause it wants to intelligently reshape your array, as you can see in this snippet from _aggregate_namedwhere the error occurs.

这里的问题是,pandas 正在明确检查输出不是an,ndarray因为它想智能地重塑您的数组,正如您在_aggregate_named发生错误的此代码段中看到的那样。

def _aggregate_named(self, func, *args, **kwargs):
    result = {}

    for name, group in self:
        group.name = name
        output = func(group, *args, **kwargs)
        if isinstance(output, np.ndarray):
            raise Exception('Must produce aggregated value')
        result[name] = self._try_cast(output, group)

    return result

My guess is that this happens because groupbyis explicitly set up to try to intelligently put back together a DataFrame with the same indexes and everything aligned nicely. Since it's rare to have nested arrays in a DataFrame like that, it checks for ndarrays to make sure that you are actually using an aggregate function. In my gut, this feels like a job for Panel, but I'm not sure how to transform it perfectly. As an aside, you can sidestep this problem by converting your output to a list, like this:

我的猜测是,发生这种情况是因为groupby明确设置为尝试智能地将具有相同索引的 DataFrame 重新组合在一起,并且所有内容都很好地对齐。由于在 DataFrame 中很少有这样的嵌套数组,它会检查 ndarrays 以确保您实际上使用的是聚合函数。在我的直觉中,这感觉像是 的工作Panel,但我不确定如何完美地转换它。顺便说一句,您可以通过将输出转换为列表来回避这个问题,如下所示:

DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})

Pandas doesn't complain, because now you have an array of Python objects. [but this is really just cheating around the typecheck]. And if you want to convert back to array, just apply np.arrayto it.

Pandas 没有抱怨,因为现在你有一个 Python 对象数组。[但这实际上只是在类型检查中作弊]。如果你想转换回数组,只需应用np.array它。

result = DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})
result["arraydata"] = result["arraydata"].apply(np.array)

How you want to resolve this issue really depends on whyyou have columns of ndarrayand whether you want to aggregate anything else at the same time. That said, you can always iterate over GroupBylike I've shown above.

您想如何解决此问题实际上取决于您为什么有列ndarray以及是否要同时聚合其他任何内容。也就是说,你总是可以GroupBy像我上面展示的那样迭代。

回答by Andy Hayden

Pandas works much more efficiently if you don'tdo this (e.g using numeric data, as you suggest). Another alternative is to use a Panelobject for this kind of multidimensional data.

如果您这样做(例如,按照您的建议使用数字数据),Pandas 的工作效率会更高。另一种选择是对这种多维数据使用Panel对象。

Saying that, this looks like a bug, the Exception is being raised purely because the result is an array:

话虽如此,这看起来像一个错误,引发异常纯粹是因为结果是一个数组:

Exception: Must produce aggregated value

In [11]: %debug
> /Users/234BroadWalk/pandas/pandas/core/groupby.py(1511)_aggregate_named()
   1510             if isinstance(output, np.ndarray):
-> 1511                 raise Exception('Must produce aggregated value')
   1512             result[name] = self._try_cast(output, group)

ipdb> output
array([50, 70, 90])

If you were to recklessly remove these two lines from the source code it works as expected:

如果你不顾一切地从源代码中删除这两行,它会按预期工作:

In [99]: g.agg(sum)
Out[99]:
             arraydata
category
1         [50, 70, 90]
2         [20, 30, 40]

Note: They're almost certainly in there for a reason...

注意:他们几乎可以肯定在那里是有原因的......