pandas.DataFrame.describe() 与 numpy.percentile() NaN 处理

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20614536/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 21:27:20  来源:igfitidea点击:

pandas.DataFrame.describe() vs numpy.percentile() NaN handling

python-2.7numpypandaspercentile

提问by tnknepp

I noticed a difference in how pandas.DataFrame.describe() and numpy.percentile() handle NaN values. e.g.

我注意到 pandas.DataFrame.describe() 和 numpy.percentile() 处理 NaN 值的方式有所不同。例如

import numpy as np
import pandas as pd

a = pd.DataFrame(np.random.rand(100000),columns=['A'])

>>> a.describe()           
              A
count  100000.000000
mean        0.499713
std         0.288722
min         0.000009
25%         0.249372
50%         0.498889
75%         0.749249
max         0.999991

>>> np.percentile(a,[25,50,75])
[0.24937217017643742, 0.49888913303316823, 0.74924862428575034] # Same as a.describe()

# Add in NaN values
a.ix[1:99999:3] = pd.np.NaN

>>> a.describe()
                  A
count  66667.000000
mean       0.499698
std        0.288825
min        0.000031
25%        0.249285
50%        0.500110
75%        0.750201
max        0.999991

>>> np.percentile(a,[25,50,75])
[0.37341740173545901, 0.75020053461424419, nan] # Not the same as a.describe()

# Remove NaN's
b = a[pd.notnull(a.A)]

>>> np.percentile(b,[25,50,75])
[0.2492848255776256, 0.50010992119477615, 0.75020053461424419] # Now in agreement with describe()

Pandas neglects NaN values in percentile calculations, while numpy does not. Is there any compelling reason to include NaN's in percentile calculations? It seesm Pandas handles this correctly, so I wonder why numpy would not make a similar implementation.

Pandas 在百分位计算中忽略 NaN 值,而 numpy 则不会。是否有任何令人信服的理由在百分位数计算中包含 NaN?看来 Pandas 正确地处理了这个问题,所以我想知道为什么 numpy 不会做出类似的实现。

Begin Edit

开始编辑

per Jeff's comment, this becomes an issue when resampling data. If I have a time series that contains NaN values and want to resample to percentiles (per this post)

根据杰夫的评论,这在重新采样数据时成为一个问题。如果我有一个包含 NaN 值的时间序列并想重新采样到百分位数(根据这篇文章

upper = df.resample('1A',how=lambda x: np.percentile(x,q=75))

will include NaN values in calculation (as numpy does). To avoid this, you must instead put

将在计算中包含 NaN 值(就像 numpy 一样)。为避免这种情况,您必须改为

upper = tmp.resample('1A',how=lambda x: np.percentile(x[pd.notnull(x.sample_value)],q=75))

Perhaps a numpy request is in order. Personally, I do not see any reason to include NaNs in percentile calculations. pd.describe() and np.percentile should, in my opinion, return the exact same values (I think this is the expected behavior), but the fact that they do not can be easily missed (this is not mentioned in the documentation for np.percentile), which can skew the stats. That is my concern.

也许一个 numpy 请求是有序的。就个人而言,我认为没有任何理由在百分位数计算中包含 NaN。在我看来,pd.describe() 和 np.percentile 应该返回完全相同的值(我认为这是预期的行为),但它们不会被轻易忽略的事实(这在文档中没有提到) np.percentile),这可能会扭曲统计数据。这是我关心的问题。

End Edit

结束编辑

回答by DSM

For your edited use case, I think I'd stay in pandasand use Series.quantileinstead of np.percentile:

对于您编辑的用例,我想我会留在pandas并使用Series.quantile而不是np.percentile

>>> df = pd.DataFrame(np.random.rand(100000),columns=['A'], 
...                   index=pd.date_range("Jan 1 2013", freq="H", periods=100000))
>>> df.iloc[1:99999:3] = np.nan
>>> 
>>> upper_np = df.resample('1A',how=lambda x: np.percentile(x,q=75))
>>> upper_np.describe()
        A
count   0
mean  NaN
std   NaN
min   NaN
25%   NaN
50%   NaN
75%   NaN
max   NaN

[8 rows x 1 columns]
>>> upper_pd = df.resample('1A',how=lambda x: x.quantile(0.75))
>>> upper_pd.describe()
               A
count  12.000000
mean    0.745648
std     0.004889
min     0.735160
25%     0.744723
50%     0.747492
75%     0.748965
max     0.750341

[8 rows x 1 columns]