Pandas Groupby Agg 函数不减少

Question

提问by Woody Pride

I am using an aggregation function that I have used in my work for a long time now. The idea is that if the Series passed to the function is of length 1 (i.e. the group only has one observation) then that observations is returned. If the length of the Series passed is greater than one, then the observations are returned in a list.

我正在使用我在工作中使用了很长时间的聚合函数。这个想法是，如果传递给函数的 Series 长度为 1（即该组只有一个观察），则返回该观察。如果传递的系列的长度大于 1，则观察结果以列表形式返回。

This may seem odd to some, but this is not an X,Y problem, I have good reason for wanting to do this that is not relevant to this question.

这对某些人来说可能看起来很奇怪，但这不是 X、Y 问题，我有充分的理由想要这样做与这个问题无关。

This is the function that I have been using:

这是我一直在使用的功能：

def MakeList(x):
    """ This function is used to aggregate data that needs to be kept distinc within multi day 
        observations for later use and transformation. It makes a list of the data and if the list is of length 1
        then there is only one line/day observation in that group so the single element of the list is returned. 
        If the list is longer than one then there are multiple line/day observations and the list itself is 
        returned."""
    L = x.tolist()
    if len(L) > 1:
        return L
    else:
        return L[0]

Now for some reason, with the current data set I am working on I get a ValueError stating that the function does not reduce. Here is some test data and the remaining steps I am using:

现在出于某种原因，使用我正在处理的当前数据集，我收到一个 ValueError，表明该函数不会减少。这是一些测试数据和我正在使用的其余步骤：

import pandas as pd
DF = pd.DataFrame({'date': ['2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02',
                            '2013-04-02'],
                    'line_code':   ['401101',
                                    '401101',
                                    '401102',
                                    '401103',
                                    '401104',
                                    '401105',
                                    '401105',
                                    '401106',
                                    '401106',
                                    '401107'],
                    's.m.v.': [ 7.760,
                                25.564,
                                25.564,
                                9.550,
                                4.870,
                                7.760,
                                25.564,
                                5.282,
                                25.564,
                                5.282]})
DFGrouped = DF.groupby(['date', 'line_code'], as_index = False)
DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})

In trying to debug this, I put a print statement to the effect of print Land print x.indexand the output was as follows:

在尝试调试这一点，我把一个print语句的效果print L和print x.index输出功率为如下：

[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')
[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')

For some reason it appears that aggis passing the Series twice to the function. This as far as I know is not normal at all, and is presumably the reason why my function is not reducing.

出于某种原因，它似乎agg将系列两次传递给函数。据我所知，这根本不正常，大概是我的功能没有减少的原因。

For example if I write a function like this:

例如，如果我写一个这样的函数：

def test_func(x):
    print x.index
    return x.iloc[0]

This runs without problem and the print statements are:

这运行没有问题，打印语句是：

DF_Agg = DFGrouped.agg({'s.m.v.' : test_func})

Int64Index([0, 1], dtype='int64')
Int64Index([2], dtype='int64')
Int64Index([3], dtype='int64')
Int64Index([4], dtype='int64')
Int64Index([5, 6], dtype='int64')
Int64Index([7, 8], dtype='int64')
Int64Index([9], dtype='int64')

Which indicates that each group is only being passed once as a Series to the function.

这表明每个组仅作为系列传递给函数一次。

Can anyone help me understand why this is failing? I have used this function with success in many many data sets I work with....

谁能帮我理解为什么这会失败？我已经在我使用的许多数据集中成功地使用了这个函数......

Thanks

谢谢

Answer 1

回答by paulo.filip3

I can't really explain you why, but from my experience listin pandas.DataFramedon't work all that well.

我真的不能解释你为什么，但是从我的经验，list在pandas.DataFrame不工作那么好。

I usually use tupleinstead. That will work:

我通常用它tuple来代替。那可行：

def MakeList(x):
    T = tuple(x)
    if len(T) > 1:
        return T
    else:
        return T[0]

DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})

     date line_code           s.m.v.
0  2013-04-02    401101   (7.76, 25.564)
1  2013-04-02    401102           25.564
2  2013-04-02    401103             9.55
3  2013-04-02    401104             4.87
4  2013-04-02    401105   (7.76, 25.564)
5  2013-04-02    401106  (5.282, 25.564)
6  2013-04-02    401107            5.282

Answer 2

回答by Nik Bates-Haus

This is a misfeature in DataFrame. If the aggregator returns a list for the first group, it will fail with the error you mention; if it returns a non-list (non-Series) for the first group, it will work fine. The broken code is in groupby.py:

这是 DataFrame 中的一个错误特征。如果聚合器返回第一个组的列表，它将因您提到的错误而失败；如果它为第一组返回一个非列表（非系列），它将正常工作。损坏的代码在 groupby.py 中：

def _aggregate_series_pure_python(self, obj, func):

    group_index, _, ngroups = self.group_info

    counts = np.zeros(ngroups, dtype=int)
    result = None

    splitter = get_splitter(obj, group_index, ngroups, axis=self.axis)

    for label, group in splitter:
        res = func(group)
        if result is None:
            if (isinstance(res, (Series, Index, np.ndarray)) or
                    isinstance(res, list)):
                raise ValueError('Function does not reduce')
            result = np.empty(ngroups, dtype='O')

        counts[label] = group.shape[0]
        result[label] = res

Notice that if result is Noneand isinstance(res, list. Your options are:

请注意，if result is None和isinstance(res, list。您的选择是：

Fake out groupby().agg(), so it doesn't see a list for the first group, or
Do the aggregation yourself, using code like that above but without the erroneous test.

伪造 groupby().agg()，所以它看不到第一组的列表，或者
自己进行聚合，使用类似上面的代码但没有错误的测试。

Pandas Groupby Agg 函数不减少

提问by Woody Pride

回答by paulo.filip3

回答by Nik Bates-Haus

相关推荐

最近更新

标签

Pandas Groupby Agg 函数不减少

提问by Woody Pride

回答by paulo.filip3

回答by Nik Bates-Haus

相关推荐

pandas 比较熊猫数据框中的行值

pandas Python：可视化数据直方图上的正态曲线

pandas 多索引数据帧的 lexsort_depth 究竟是什么？

将 Pandas DataFrame 中的列组合到 DataFrame 中的一列列表

相关推荐

最近更新

标签