Pandas Groupby Agg 函数不减少

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27439023/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:45:40  来源:igfitidea点击:

Pandas Groupby Agg Function Does Not Reduce

pythonpandas

提问by Woody Pride

I am using an aggregation function that I have used in my work for a long time now. The idea is that if the Series passed to the function is of length 1 (i.e. the group only has one observation) then that observations is returned. If the length of the Series passed is greater than one, then the observations are returned in a list.

我正在使用我在工作中使用了很长时间的聚合函数。这个想法是,如果传递给函数的 Series 长度为 1(即该组只有一个观察),则返回该观察。如果传递的系列的长度大于 1,则观察结果以列表形式返回。

This may seem odd to some, but this is not an X,Y problem, I have good reason for wanting to do this that is not relevant to this question.

这对某些人来说可能看起来很奇怪,但这不是 X、Y 问题,我有充分的理由想要这样做与这个问题无关。

This is the function that I have been using:

这是我一直在使用的功能:

def MakeList(x):
    """ This function is used to aggregate data that needs to be kept distinc within multi day 
        observations for later use and transformation. It makes a list of the data and if the list is of length 1
        then there is only one line/day observation in that group so the single element of the list is returned. 
        If the list is longer than one then there are multiple line/day observations and the list itself is 
        returned."""
    L = x.tolist()
    if len(L) > 1:
        return L
    else:
        return L[0]

Now for some reason, with the current data set I am working on I get a ValueError stating that the function does not reduce. Here is some test data and the remaining steps I am using:

现在出于某种原因,使用我正在处理的当前数据集,我收到一个 ValueError,表明该函数不会减少。这是一些测试数据和我正在使用的其余步骤:

import pandas as pd
DF = pd.DataFrame({'date': ['2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02'],
                    'line_code':   ['401101',
                                    '401101',
                                    '401102',
                                    '401103',
                                    '401104',
                                    '401105',
                                    '401105',
                                    '401106',
                                    '401106',
                                    '401107'],
                    's.m.v.': [ 7.760,
                                25.564,
                                25.564,
                                9.550,
                                4.870,
                                7.760,
                                25.564,
                                5.282,
                                25.564,
                                5.282]})
DFGrouped = DF.groupby(['date', 'line_code'], as_index = False)
DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})

In trying to debug this, I put a print statement to the effect of print Land print x.indexand the output was as follows:

在尝试调试这一点,我把一个print语句的效果print Lprint x.index输出功率为如下:

[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')
[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')

For some reason it appears that aggis passing the Series twice to the function. This as far as I know is not normal at all, and is presumably the reason why my function is not reducing.

出于某种原因,它似乎agg将系列两次传递给函数。据我所知,这根本不正常,大概是我的功能没有减少的原因。

For example if I write a function like this:

例如,如果我写一个这样的函数:

def test_func(x):
    print x.index
    return x.iloc[0]

This runs without problem and the print statements are:

这运行没有问题,打印语句是:

DF_Agg = DFGrouped.agg({'s.m.v.' : test_func})

Int64Index([0, 1], dtype='int64')
Int64Index([2], dtype='int64')
Int64Index([3], dtype='int64')
Int64Index([4], dtype='int64')
Int64Index([5, 6], dtype='int64')
Int64Index([7, 8], dtype='int64')
Int64Index([9], dtype='int64')

Which indicates that each group is only being passed once as a Series to the function.

这表明每个组仅作为系列传递给函数一次。

Can anyone help me understand why this is failing? I have used this function with success in many many data sets I work with....

谁能帮我理解为什么这会失败?我已经在我使用的许多数据集中成功地使用了这个函数......

Thanks

谢谢

回答by paulo.filip3

I can't really explain you why, but from my experience listin pandas.DataFramedon't work all that well.

我真的不能解释你为什么,但是从我的经验,listpandas.DataFrame不工作那么好。

I usually use tupleinstead. That will work:

我通常用它tuple来代替。那可行:

def MakeList(x):
    T = tuple(x)
    if len(T) > 1:
        return T
    else:
        return T[0]

DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})

     date line_code           s.m.v.
0  2013-04-02    401101   (7.76, 25.564)
1  2013-04-02    401102           25.564
2  2013-04-02    401103             9.55
3  2013-04-02    401104             4.87
4  2013-04-02    401105   (7.76, 25.564)
5  2013-04-02    401106  (5.282, 25.564)
6  2013-04-02    401107            5.282

回答by Nik Bates-Haus

This is a misfeature in DataFrame. If the aggregator returns a list for the first group, it will fail with the error you mention; if it returns a non-list (non-Series) for the first group, it will work fine. The broken code is in groupby.py:

这是 DataFrame 中的一个错误特征。如果聚合器返回第一个组的列表,它将因您提到的错误而失败;如果它为第一组返回一个非列表(非系列),它将正常工作。损坏的代码在 groupby.py 中:

def _aggregate_series_pure_python(self, obj, func):

    group_index, _, ngroups = self.group_info

    counts = np.zeros(ngroups, dtype=int)
    result = None

    splitter = get_splitter(obj, group_index, ngroups, axis=self.axis)

    for label, group in splitter:
        res = func(group)
        if result is None:
            if (isinstance(res, (Series, Index, np.ndarray)) or
                    isinstance(res, list)):
                raise ValueError('Function does not reduce')
            result = np.empty(ngroups, dtype='O')

        counts[label] = group.shape[0]
        result[label] = res

Notice that if result is Noneand isinstance(res, list. Your options are:

请注意,if result is Noneisinstance(res, list。您的选择是:

  1. Fake out groupby().agg(), so it doesn't see a list for the first group, or

  2. Do the aggregation yourself, using code like that above but without the erroneous test.

  1. 伪造 groupby().agg(),所以它看不到第一组的列表,或者

  2. 自己进行聚合,使用类似上面的代码但没有错误的测试。