pandas.DataFrame.describe() 与 numpy.percentile() NaN 处理
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20614536/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
pandas.DataFrame.describe() vs numpy.percentile() NaN handling
提问by tnknepp
I noticed a difference in how pandas.DataFrame.describe() and numpy.percentile() handle NaN values. e.g.
我注意到 pandas.DataFrame.describe() 和 numpy.percentile() 处理 NaN 值的方式有所不同。例如
import numpy as np
import pandas as pd
a = pd.DataFrame(np.random.rand(100000),columns=['A'])
>>> a.describe()
A
count 100000.000000
mean 0.499713
std 0.288722
min 0.000009
25% 0.249372
50% 0.498889
75% 0.749249
max 0.999991
>>> np.percentile(a,[25,50,75])
[0.24937217017643742, 0.49888913303316823, 0.74924862428575034] # Same as a.describe()
# Add in NaN values
a.ix[1:99999:3] = pd.np.NaN
>>> a.describe()
A
count 66667.000000
mean 0.499698
std 0.288825
min 0.000031
25% 0.249285
50% 0.500110
75% 0.750201
max 0.999991
>>> np.percentile(a,[25,50,75])
[0.37341740173545901, 0.75020053461424419, nan] # Not the same as a.describe()
# Remove NaN's
b = a[pd.notnull(a.A)]
>>> np.percentile(b,[25,50,75])
[0.2492848255776256, 0.50010992119477615, 0.75020053461424419] # Now in agreement with describe()
Pandas neglects NaN values in percentile calculations, while numpy does not. Is there any compelling reason to include NaN's in percentile calculations? It seesm Pandas handles this correctly, so I wonder why numpy would not make a similar implementation.
Pandas 在百分位计算中忽略 NaN 值,而 numpy 则不会。是否有任何令人信服的理由在百分位数计算中包含 NaN?看来 Pandas 正确地处理了这个问题,所以我想知道为什么 numpy 不会做出类似的实现。
Begin Edit
开始编辑
per Jeff's comment, this becomes an issue when resampling data. If I have a time series that contains NaN values and want to resample to percentiles (per this post)
根据杰夫的评论,这在重新采样数据时成为一个问题。如果我有一个包含 NaN 值的时间序列并想重新采样到百分位数(根据这篇文章)
upper = df.resample('1A',how=lambda x: np.percentile(x,q=75))
will include NaN values in calculation (as numpy does). To avoid this, you must instead put
将在计算中包含 NaN 值(就像 numpy 一样)。为避免这种情况,您必须改为
upper = tmp.resample('1A',how=lambda x: np.percentile(x[pd.notnull(x.sample_value)],q=75))
Perhaps a numpy request is in order. Personally, I do not see any reason to include NaNs in percentile calculations. pd.describe() and np.percentile should, in my opinion, return the exact same values (I think this is the expected behavior), but the fact that they do not can be easily missed (this is not mentioned in the documentation for np.percentile), which can skew the stats. That is my concern.
也许一个 numpy 请求是有序的。就个人而言,我认为没有任何理由在百分位数计算中包含 NaN。在我看来,pd.describe() 和 np.percentile 应该返回完全相同的值(我认为这是预期的行为),但它们不会被轻易忽略的事实(这在文档中没有提到) np.percentile),这可能会扭曲统计数据。这是我关心的问题。
End Edit
结束编辑
回答by DSM
For your edited use case, I think I'd stay in pandasand use Series.quantileinstead of np.percentile:
对于您编辑的用例,我想我会留在pandas并使用Series.quantile而不是np.percentile:
>>> df = pd.DataFrame(np.random.rand(100000),columns=['A'],
... index=pd.date_range("Jan 1 2013", freq="H", periods=100000))
>>> df.iloc[1:99999:3] = np.nan
>>>
>>> upper_np = df.resample('1A',how=lambda x: np.percentile(x,q=75))
>>> upper_np.describe()
A
count 0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
[8 rows x 1 columns]
>>> upper_pd = df.resample('1A',how=lambda x: x.quantile(0.75))
>>> upper_pd.describe()
A
count 12.000000
mean 0.745648
std 0.004889
min 0.735160
25% 0.744723
50% 0.747492
75% 0.748965
max 0.750341
[8 rows x 1 columns]

