pandas 使用groupby后在Pandas中计算np.diff导致意外结果

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20670726/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:28:15  来源:igfitidea点击:

Computing np.diff in Pandas after using groupby leads to unexpected result

pythonpandasdiff

提问by 8one6

I've got a dataframe, and I'm trying to append a column of sequential differences to it. I have found a method that I like a lot (and generalizes well for my use case). But I noticed one weird thing along the way. Can you help me make sense of it?

我有一个数据框,我正在尝试向它附加一列顺序差异。我找到了一种我非常喜欢的方法(并且很好地概括了我的用例)。但我在途中注意到了一件奇怪的事情。你能帮我理解一下吗?

Here is some data that has the right structure (code modeled on an answer here):

以下是一些具有正确结构的数据(以此处的答案为模型的代码):

import pandas as pd
import numpy as np
import random
from itertools import product

random.seed(1)       # so you can play along at home
np.random.seed(2)    # ditto

# make a list of dates for a few periods
dates = pd.date_range(start='2013-10-01', periods=4).to_native_types()
# make a list of tickers
tickers = ['ticker_%d' % i for i in range(3)]
# make a list of all the possible (date, ticker) tuples
pairs = list(product(dates, tickers))
# put them in a random order
random.shuffle(pairs)
# exclude a few possible pairs
pairs = pairs[:-3]
# make some data for all of our selected (date, ticker) tuples
values = np.random.rand(len(pairs))

mydates, mytickers = zip(*pairs)
data = pd.DataFrame({'date': mydates, 'ticker': mytickers, 'value':values})

Ok, great. This gives me a frame like so:

太好了。这给了我一个这样的框架:

     date        ticker      value
0    2013-10-03  ticker_2    0.435995
1    2013-10-04  ticker_2    0.025926
2    2013-10-02  ticker_1    0.549662
3    2013-10-01  ticker_0    0.435322
4    2013-10-02  ticker_2    0.420368
5    2013-10-03  ticker_0    0.330335
6    2013-10-04  ticker_1    0.204649
7    2013-10-02  ticker_0    0.619271
8    2013-10-01  ticker_2    0.299655

My goal is to add a new column to this dataframe that will contain sequential changes. The data needs to be in order to do this, but the ordering and the differencing needs to be done "ticker-wise" so that gaps in another ticker don't cause NA's for a given ticker. I want to do this without perturbing the dataframe in any other way (i.e. I do not want the resulting DataFrame to be reordered based on what was necessary to do the differencing). The following code works:

我的目标是向此数据框中添加一个新列,该列将包含顺序更改。数据需要是为了做到这一点,但排序和差异需要“按股票行情”完成,以便另一个股票代码中的差距不会导致给定股票代码的 NA。我想在不以任何其他方式干扰数据帧的情况下执行此操作(即,我不希望根据进行差分所需的内容对结果数据帧进行重新排序)。以下代码有效:

data1 = data.copy() #let's leave the original data alone for later experiments
data1.sort(['ticker', 'date'], inplace=True)
data1['diffs'] = data1.groupby(['ticker'])['value'].transform(lambda x: x.diff())
data1.sort_index(inplace=True)
data1

and returns:

并返回:

     date        ticker      value       diffs
0    2013-10-03  ticker_2    0.435995    0.015627
1    2013-10-04  ticker_2    0.025926   -0.410069
2    2013-10-02  ticker_1    0.549662    NaN
3    2013-10-01  ticker_0    0.435322    NaN
4    2013-10-02  ticker_2    0.420368    0.120713
5    2013-10-03  ticker_0    0.330335   -0.288936
6    2013-10-04  ticker_1    0.204649   -0.345014
7    2013-10-02  ticker_0    0.619271    0.183949
8    2013-10-01  ticker_2    0.299655    NaN

So far, so good. If I replace the middle line above with the more concise code shown here, everything still works:

到现在为止还挺好。如果我用这里显示的更简洁的代码替换上面的中间行,一切仍然有效:

data2 = data.copy()
data2.sort(['ticker', 'date'], inplace=True)
data2['diffs'] = data2.groupby('ticker')['value'].diff()
data2.sort_index(inplace=True)
data2

A quick check shows that, in fact, data1is equal to data2. However, if I do this:

快速检查表明,实际上data1等于data2。但是,如果我这样做:

data3 = data.copy()
data3.sort(['ticker', 'date'], inplace=True)
data3['diffs'] = data3.groupby('ticker')['value'].transform(np.diff)
data3.sort_index(inplace=True)
data3

I get a strange result:

我得到一个奇怪的结果:

     date        ticker     value       diffs
0    2013-10-03  ticker_2    0.435995    0
1    2013-10-04  ticker_2    0.025926   NaN
2    2013-10-02  ticker_1    0.549662   NaN
3    2013-10-01  ticker_0    0.435322   NaN
4    2013-10-02  ticker_2    0.420368   NaN
5    2013-10-03  ticker_0    0.330335    0
6    2013-10-04  ticker_1    0.204649   NaN
7    2013-10-02  ticker_0    0.619271   NaN
8    2013-10-01  ticker_2    0.299655    0

What's going on here? When you call the .diffmethod on a Pandas object, is it not just calling np.diff? I know there's a diffmethod on the DataFrameclass, but I couldn't figure out how to pass that to transformwithout the lambdafunction syntax I used to make data1work. Am I missing something? Why is the diffscolumn in data3screwy? How can I have call the Pandas diffmethod within transformwithout needing to write a lambdato do it?

这里发生了什么?当您.diff在 Pandas 对象上调用该方法时,不只是调用np.diff? 我知道类中有一个diff方法DataFrame,但是如果没有我用来工作transformlambda函数语法,我无法弄清楚如何将其传递给data1。我错过了什么吗?为什么diffs柱子是data3螺旋状的?如何difftransform不需要编写 a 的情况下调用 Pandas方法lambda

采纳答案by Jeff

Nice easy to reproduce example!! more questions should be like this!

好容易重现的例子!!更多的问题应该是这样的!

Just pass a lambda to transform (this is tantamount to passing afuncton object, e.g. np.diff (or Series.diff) directly. So this equivalent to data1/data2

只需传递一个 lambda 来转换(这相当于直接传递一个函数对象,例如 np.diff(或 Series.diff)。所以这相当于 data1/data2

In [32]: data3['diffs'] = data3.groupby('ticker')['value'].transform(Series.diff)

In [34]: data3.sort_index(inplace=True)

In [25]: data3
Out[25]: 
         date    ticker     value     diffs
0  2013-10-03  ticker_2  0.435995  0.015627
1  2013-10-04  ticker_2  0.025926 -0.410069
2  2013-10-02  ticker_1  0.549662       NaN
3  2013-10-01  ticker_0  0.435322       NaN
4  2013-10-02  ticker_2  0.420368  0.120713
5  2013-10-03  ticker_0  0.330335 -0.288936
6  2013-10-04  ticker_1  0.204649 -0.345014
7  2013-10-02  ticker_0  0.619271  0.183949
8  2013-10-01  ticker_2  0.299655       NaN

[9 rows x 4 columns]

I believe that np.diffdoesn't follow numpy's own unfunc guidelines to process array inputs (whereby it tries various methods to coerce input and send output, e.g. __array__on input __array_wrap__on output). I am not really sure why, see a bit more info here. So bottom line is that np.diffis not dealing with the index properly and doing its own calculation (which in this case is wrong).

我相信这np.diff并没有遵循 numpy 自己的 unfunc 准则来处理数组输入(由此它尝试各种方法来强制输入和发送输出,例如__array__在输入__array_wrap__上输出)。我不太确定为什么,请在此处查看更多信息。所以底线np.diff是没有正确处理索引并进行自己的计算(在这种情况下是错误的)。

Pandas has a lot of methods where they don't just call the numpy function, mainly because they handle different dtypes, handle nans, and in this case, handle 'special' diffs. e.g. you can pass a time frequency to a datelike-index where it calculates how many n to actually diff.

Pandas 有很多方法,它们不只是调用 numpy 函数,主要是因为它们处理不同的 dtypes,处理 nans,在这种情况下,处理“特殊”差异。例如,您可以将时间频率传递给类似日期的索引,它计算实际差异的 n 数。

回答by Andy Hayden

You can see that the Series .diff()method is different to np.diff():

您可以看到 Series.diff()方法不同于np.diff()

In [11]: data.value.diff()  # Note the NaN
Out[11]: 
0         NaN
1   -0.410069
2    0.523736
3   -0.114340
4   -0.014955
5   -0.090033
6   -0.125686
7    0.414622
8   -0.319616
Name: value, dtype: float64

In [12]: np.diff(data.value.values)  # the values array of the column
Out[12]: 
array([-0.41006867,  0.52373625, -0.11434009, -0.01495459, -0.09003298,
       -0.12568619,  0.41462233, -0.31961629])

In [13]: np.diff(data.value) # on the column (Series)
Out[13]: 
0   NaN
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8   NaN
Name: value, dtype: float64

In [14]: np.diff(data.value.index)  # er... on the index
Out[14]: Int64Index([8], dtype=int64)

In [15]: np.diff(data.value.index.values)
Out[15]: array([1, 1, 1, 1, 1, 1, 1, 1])