Pandas:为什么 pandas.Series.std() 与 numpy.std() 不同

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/25695986/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:25:54  来源:igfitidea点击:

Pandas: why pandas.Series.std() is different from numpy.std()

pythonnumpypandasgroup-bystatistics

提问by osa

Another update: resolved (see comments and my own answer).

另一个更新:已解决(见评论和我自己的回答)。

Update: this is what I am trying to explain.

更新:这就是我想要解释的。

>>> pd.Series([7,20,22,22]).std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575

Answer: this is explained by Bessel's correction, N-1instead of Nin the denominator of the standard deviation formula. I wish Pandas used the same convention as numpy.

答案:这是由贝塞尔修正解释N-1而不是N在标准偏差公式的分母中。我希望 Pandas 使用与 numpy 相同的约定。



There is a related discussion here, but their suggestions do not work either.

有一个相关的讨论在这里,但他们的建议都不能工作。

I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one):

我有许多不同餐馆的数据。这是我的数据框(想象不止一家餐厅,但效果仅用一家重现):

>>> df
restaurant_id  price
id                      
1           10407      7
3           10407     20
6           10407     22
13          10407     22

Question: r.mi.groupby('restaurant_id')['price'].mean()returns price means for each restaurant. I want to get the standard deviations. However, r.mi.groupby('restaurant_id')['price'].std()returns wrong values.

问题:r.mi.groupby('restaurant_id')['price'].mean()返回每个餐厅的价格。我想得到标准偏差。但是,r.mi.groupby('restaurant_id')['price'].std()返回错误的值

As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure:

如您所见,为简单起见,我仅提取了一家包含四个项目的餐厅。我想找到价格的标准差。只想确认一下:

>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575

We can get the same (correct) values with

我们可以获得相同(正确)的值

>>> np.mean(df)
restaurant_id    10407.00
price               17.75
dtype: float64
>>> np.std(df)
restaurant_id    0.000000
price            6.259992
dtype: float64

(Of course, disregard the mean restaurant id.) Obviously, np.std(df)is not a solution when I have more than one restaurant. So I am using groupby.

(当然,忽略平均餐厅 id。)显然,np.std(df)当我拥有不止一家餐厅时,这不是一个解决方案。所以我正在使用groupby.

>>> df.groupby('restaurant_id').agg('std')
                  price
restaurant_id          
10407          7.228416

What?! 7.228416 is not 6.259992.

什么?!7.228416 不是 6.259992。

Let's try again.

让我们再试一次。

>>> df.groupby('restaurant_id').std()

Same thing.

一样。

>>> df.groupby('restaurant_id')['price'].std()

Same thing.

一样。

>>> df.groupby('restaurant_id').apply(lambda x: x.std())

Same thing.

一样。

However, this works:

但是,这有效:

for id, group in df.groupby('restaurant_id'):
  print id, np.std(group['price'])

Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?

问题:是否有适当的方法来聚合数据框,因此我将获得每个餐厅的标准差的新时间序列?

回答by osa

I see. Pandas is using Bessel's correctionby default -- that is, the standard deviation formula with N-1instead of Nin the denominator. As behzad.nouri has pointed out in the comments,

我知道了。Pandas默认使用Bessel 校正——也就是说,标准偏差公式在分母中使用N-1而不是N。正如 behzad.nouri 在评论中指出的那样,

pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])