pandas.DataFrame.describe() 与 numpy.percentile() NaN 处理

Question

提问by tnknepp

I noticed a difference in how pandas.DataFrame.describe() and numpy.percentile() handle NaN values. e.g.

我注意到 pandas.DataFrame.describe() 和 numpy.percentile() 处理 NaN 值的方式有所不同。例如

import numpy as np
import pandas as pd

a = pd.DataFrame(np.random.rand(100000),columns=['A'])

>>> a.describe()           
              A
count  100000.000000
mean        0.499713
std         0.288722
min         0.000009
25%         0.249372
50%         0.498889
75%         0.749249
max         0.999991

>>> np.percentile(a,[25,50,75])
[0.24937217017643742, 0.49888913303316823, 0.74924862428575034] # Same as a.describe()

# Add in NaN values
a.ix[1:99999:3] = pd.np.NaN

>>> a.describe()
                  A
count  66667.000000
mean       0.499698
std        0.288825
min        0.000031
25%        0.249285
50%        0.500110
75%        0.750201
max        0.999991

>>> np.percentile(a,[25,50,75])
[0.37341740173545901, 0.75020053461424419, nan] # Not the same as a.describe()

# Remove NaN's
b = a[pd.notnull(a.A)]

>>> np.percentile(b,[25,50,75])
[0.2492848255776256, 0.50010992119477615, 0.75020053461424419] # Now in agreement with describe()

Pandas neglects NaN values in percentile calculations, while numpy does not. Is there any compelling reason to include NaN's in percentile calculations? It seesm Pandas handles this correctly, so I wonder why numpy would not make a similar implementation.

Pandas 在百分位计算中忽略 NaN 值，而 numpy 则不会。是否有任何令人信服的理由在百分位数计算中包含 NaN？看来 Pandas 正确地处理了这个问题，所以我想知道为什么 numpy 不会做出类似的实现。

Begin Edit

开始编辑

per Jeff's comment, this becomes an issue when resampling data. If I have a time series that contains NaN values and want to resample to percentiles (per this post)

根据杰夫的评论，这在重新采样数据时成为一个问题。如果我有一个包含 NaN 值的时间序列并想重新采样到百分位数（根据这篇文章）

upper = df.resample('1A',how=lambda x: np.percentile(x,q=75))

will include NaN values in calculation (as numpy does). To avoid this, you must instead put

将在计算中包含 NaN 值（就像 numpy 一样）。为避免这种情况，您必须改为

upper = tmp.resample('1A',how=lambda x: np.percentile(x[pd.notnull(x.sample_value)],q=75))

Perhaps a numpy request is in order. Personally, I do not see any reason to include NaNs in percentile calculations. pd.describe() and np.percentile should, in my opinion, return the exact same values (I think this is the expected behavior), but the fact that they do not can be easily missed (this is not mentioned in the documentation for np.percentile), which can skew the stats. That is my concern.

也许一个 numpy 请求是有序的。就个人而言，我认为没有任何理由在百分位数计算中包含 NaN。在我看来，pd.describe() 和 np.percentile 应该返回完全相同的值（我认为这是预期的行为），但它们不会被轻易忽略的事实（这在文档中没有提到） np.percentile），这可能会扭曲统计数据。这是我关心的问题。

End Edit

结束编辑

Answer 1

回答by DSM

For your edited use case, I think I'd stay in pandasand use Series.quantileinstead of np.percentile:

对于您编辑的用例，我想我会留在pandas并使用Series.quantile而不是np.percentile：

>>> df = pd.DataFrame(np.random.rand(100000),columns=['A'], 
...                   index=pd.date_range("Jan 1 2013", freq="H", periods=100000))
>>> df.iloc[1:99999:3] = np.nan
>>> 
>>> upper_np = df.resample('1A',how=lambda x: np.percentile(x,q=75))
>>> upper_np.describe()
        A
count   0
mean  NaN
std   NaN
min   NaN
25%   NaN
50%   NaN
75%   NaN
max   NaN

[8 rows x 1 columns]
>>> upper_pd = df.resample('1A',how=lambda x: x.quantile(0.75))
>>> upper_pd.describe()
               A
count  12.000000
mean    0.745648
std     0.004889
min     0.735160
25%     0.744723
50%     0.747492
75%     0.748965
max     0.750341

[8 rows x 1 columns]

pandas.DataFrame.describe() 与 numpy.percentile() NaN 处理

提问by tnknepp

回答by DSM

相关推荐

最近更新

标签

pandas.DataFrame.describe() 与 numpy.percentile() NaN 处理

提问by tnknepp

回答by DSM

相关推荐

使用距离矩阵计算 Pandas Dataframe 中行之间的距离

pandas 使用字符串标题将 numpy 数组保存到 csv

pandas 熊猫，数据框，groupby，std

pandas 从午夜以外的时间开始重新采样每日熊猫时间序列

相关推荐

最近更新

标签