Pandas：为什么 pandas.Series.std() 与 numpy.std() 不同

Question

提问by osa

Another update: resolved (see comments and my own answer).

另一个更新：已解决（见评论和我自己的回答）。

Update: this is what I am trying to explain.

更新：这就是我想要解释的。

>>> pd.Series([7,20,22,22]).std()
7.2284161474004804
>>> np.std([7,20,22,22])
6.2599920127744575

Answer: this is explained by Bessel's correction, N-1instead of Nin the denominator of the standard deviation formula. I wish Pandas used the same convention as numpy.

答案：这是由贝塞尔修正解释的，N-1而不是N在标准偏差公式的分母中。我希望 Pandas 使用与 numpy 相同的约定。

There is a related discussion here, but their suggestions do not work either.

有一个相关的讨论在这里，但他们的建议都不能工作。

I have data about many different restaurants. Here is my dataframe (imagine more than one restaurant, but the effect is reproduced with just one):

我有许多不同餐馆的数据。这是我的数据框（想象不止一家餐厅，但效果仅用一家重现）：

>>> df
restaurant_id  price
id                      
1           10407      7
3           10407     20
6           10407     22
13          10407     22

Question: r.mi.groupby('restaurant_id')['price'].mean()returns price means for each restaurant. I want to get the standard deviations. However, r.mi.groupby('restaurant_id')['price'].std()returns wrong values.

问题：r.mi.groupby('restaurant_id')['price'].mean()返回每个餐厅的价格。我想得到标准偏差。但是，r.mi.groupby('restaurant_id')['price'].std()返回错误的值。

As you can see, for simplicity I have extracted just one restaurant with four items. I want to find the standard deviation of the price. Just to make sure:

如您所见，为简单起见，我仅提取了一家包含四个项目的餐厅。我想找到价格的标准差。只想确认一下：

>>> np.mean([7,20,22,22])
17.75
>>> np.std([7,20,22,22])
6.2599920127744575

We can get the same (correct) values with

我们可以获得相同（正确）的值

>>> np.mean(df)
restaurant_id    10407.00
price               17.75
dtype: float64
>>> np.std(df)
restaurant_id    0.000000
price            6.259992
dtype: float64

(Of course, disregard the mean restaurant id.) Obviously, np.std(df)is not a solution when I have more than one restaurant. So I am using groupby.

（当然，忽略平均餐厅 id。）显然，np.std(df)当我拥有不止一家餐厅时，这不是一个解决方案。所以我正在使用groupby.

>>> df.groupby('restaurant_id').agg('std')
                  price
restaurant_id          
10407          7.228416

What?! 7.228416 is not 6.259992.

什么？！7.228416 不是 6.259992。

Let's try again.

让我们再试一次。

>>> df.groupby('restaurant_id').std()

Same thing.

一样。

>>> df.groupby('restaurant_id')['price'].std()

Same thing.

一样。

>>> df.groupby('restaurant_id').apply(lambda x: x.std())

Same thing.

一样。

However, this works:

但是，这有效：

for id, group in df.groupby('restaurant_id'):
  print id, np.std(group['price'])

Question: is there a proper way to aggregate the dataframe, so I will get a new time series with the standard deviations for each restaurant?

问题：是否有适当的方法来聚合数据框，因此我将获得每个餐厅的标准差的新时间序列？

Answer 1

回答by osa

I see. Pandas is using Bessel's correctionby default -- that is, the standard deviation formula with N-1instead of Nin the denominator. As behzad.nouri has pointed out in the comments,

我知道了。Pandas默认使用Bessel 校正——也就是说，标准偏差公式在分母中使用N-1而不是N。正如 behzad.nouri 在评论中指出的那样，

pd.Series([7,20,22,22]).std(ddof=0)==np.std([7,20,22,22])

Pandas：为什么 pandas.Series.std() 与 numpy.std() 不同

提问by osa

回答by osa

相关推荐

最近更新

标签

Pandas：为什么 pandas.Series.std() 与 numpy.std() 不同

提问by osa

回答by osa

相关推荐

pandas 将“现在”时间戳列添加到熊猫 df

使用 GroupBy 获取 Pandas 的平均值 - 获取数据错误：没有要聚合的数字类型 -

pandas 保留 NaN 值并删除非缺失值

pandas 比较 2 个熊猫系列时会发生什么

相关推荐

最近更新

标签