Pandas:为什么 pandas.Series.std() 与 numpy.std() 不同
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/25695986/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas: why pandas.Series.std() is different from numpy.std()
提问by osa
Another update: resolved (see comments and my own answer).
另一个更新:已解决(见评论和我自己的回答)。
Update: this is what I am trying to explain.
更新:这就是我想要解释的。
>>> pd.Series([7,20,22,22]).std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575
Answer: this is explained by Bessel's correction, N-1instead of Nin the denominator of the standard deviation formula. I wish Pandas used the same convention as numpy.
答案:这是由贝塞尔修正解释的,N-1而不是N在标准偏差公式的分母中。我希望 Pandas 使用与 numpy 相同的约定。
There is a related discussion here, but their suggestions do not work either.
有一个相关的讨论在这里,但他们的建议都不能工作。
I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one):
我有许多不同餐馆的数据。这是我的数据框(想象不止一家餐厅,但效果仅用一家重现):
>>> df
restaurant_id price
id
1 10407 7
3 10407 20
6 10407 22
13 10407 22
Question: r.mi.groupby('restaurant_id')['price'].mean()returns price means for each restaurant. I want to get the standard deviations. However, r.mi.groupby('restaurant_id')['price'].std()returns wrong values.
问题:r.mi.groupby('restaurant_id')['price'].mean()返回每个餐厅的价格。我想得到标准偏差。但是,r.mi.groupby('restaurant_id')['price'].std()返回错误的值。
As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure:
如您所见,为简单起见,我仅提取了一家包含四个项目的餐厅。我想找到价格的标准差。只想确认一下:
>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575
We can get the same (correct) values with
我们可以获得相同(正确)的值
>>> np.mean(df)
restaurant_id 10407.00
price 17.75
dtype: float64
>>> np.std(df)
restaurant_id 0.000000
price 6.259992
dtype: float64
(Of course, disregard the mean restaurant id.) Obviously, np.std(df)is not a solution when I have more than one restaurant. So I am using groupby.
(当然,忽略平均餐厅 id。)显然,np.std(df)当我拥有不止一家餐厅时,这不是一个解决方案。所以我正在使用groupby.
>>> df.groupby('restaurant_id').agg('std')
price
restaurant_id
10407 7.228416
What?! 7.228416 is not 6.259992.
什么?!7.228416 不是 6.259992。
Let's try again.
让我们再试一次。
>>> df.groupby('restaurant_id').std()
Same thing.
一样。
>>> df.groupby('restaurant_id')['price'].std()
Same thing.
一样。
>>> df.groupby('restaurant_id').apply(lambda x: x.std())
Same thing.
一样。
However, this works:
但是,这有效:
for id, group in df.groupby('restaurant_id'):
print id, np.std(group['price'])
Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?
问题:是否有适当的方法来聚合数据框,因此我将获得每个餐厅的标准差的新时间序列?
回答by osa
I see. Pandas is using Bessel's correctionby default -- that is, the standard deviation formula with N-1instead of Nin the denominator. As behzad.nouri has pointed out in the comments,
我知道了。Pandas默认使用Bessel 校正——也就是说,标准偏差公式在分母中使用N-1而不是N。正如 behzad.nouri 在评论中指出的那样,
pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])

