Pandas 分位数因 NaN 的存在而失败

Question

提问by tnknepp

I've encountered an interesting situation while calculating the inter-quartile range. Assuming we have a dataframe such as:

我在计算四分位距时遇到了一个有趣的情况。假设我们有一个数据框，例如：

import pandas as pd
index=pd.date_range('2014 01 01',periods=10,freq='D')
data=pd.np.random.randint(0,100,(10,5))
data = pd.DataFrame(index=index,data=data)

data
Out[90]: 
             0   1   2   3   4
2014-01-01  33  31  82   3  26
2014-01-02  46  59   0  34  48
2014-01-03  71   2  56  67  54
2014-01-04  90  18  71  12   2
2014-01-05  71  53   5  56  65
2014-01-06  42  78  34  54  40
2014-01-07  80   5  76  12  90
2014-01-08  60  90  84  55  78
2014-01-09  33  11  66  90   8
2014-01-10  40   8  35  36  98

# test for q1 values (this works)
data.quantile(0.25)
Out[111]: 
0    40.50
1     8.75
2    34.25
3    17.50
4    29.50

# break it by inserting row of nans
data.iloc[-1] = pd.np.NaN

data.quantile(0.25)
Out[115]: 
0    42
1    11
2    34
3    12
4    26

The first quartile can be calculated by taking the median of values in the dataframe that fall below the overall median, so we can see what data.quantile(0.25) should have yielded. e.g.

第一个四分位数可以通过取数据框中低于整体中位数的值的中位数来计算，因此我们可以看到 data.quantile(0.25) 应该产生什么。例如

med = data.median()
q1  = data[data<med].median()
q1
Out[119]: 
0    37.5
1     8.0
2    19.5
3    12.0
4    17.0

It seems that quantile is failing to provide an appropriate representation of q1 etc. since it is not doing a good job of handling the NaN values (i.e. it works without NaNs, but not with NaNs).

似乎分位数未能提供 q1 等的适当表示，因为它在处理 NaN 值方面做得不好（即它在没有 NaN 的情况下工作，但不适用于 NaN）。

I thought this may not be a "NaN" issue, rather it might be quantile failing to handle even-numbered data sets (i.e. where the median must be calculated as the mean of the two central numbers). However, after testing with dataframes with both even and odd-numbers of rows I saw that quantile handled these situations properly. The problem seems to arise only when NaN values are present in the dataframe.

我认为这可能不是“NaN”问题，而是分位数无法处理偶数数据集（即，中位数必须计算为两个中心数的平均值）。但是，在使用偶数行和奇数行的数据帧进行测试后，我发现分位数正确处理了这些情况。只有当数据帧中存在 NaN 值时，问题才会出现。

I would like to use quntile to calculate the rolling q1/q3 values in my dataframe, however, this will not work with NaN's present. Can anyone provide a solution to this issue?

我想使用 quntile 来计算我的数据框中的滚动 q1/q3 值，但是，这不适用于 NaN 的存在。任何人都可以提供解决此问题的方法吗？

Answer 1

回答by TomAugspurger

Internally, quantileuses numpy.percentileover the non-null values. When you change the last row of datato NaNsyou're essentially left with an array array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.])in the first column

在内部，quantile使用numpy.percentile非空值。当您将最后一行更改为时data，NaNs您实际上array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.])在第一列中留下了一个数组

Calculating np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.])gives 42.

计算np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.])得到 42。

From the docstring:

从文档字符串：

Given a vector V of length N, the qth percentile of V is the qth ranked value in a sorted copy of V. A weighted average of the two nearest neighbors is used if the normalized ranking does not match q exactly. The same as the median if q=50, the same as the minimum if q=0and the same as the maximum if q=100.

给定长度为 N 的向量 V，V 的第 q 个百分位数是 V 的排序副本中的第 q 个排名值。如果归一化排名与 q 不完全匹配，则使用两个最近邻的加权平均值。与中值 ifq=50相同，与最小 ifq=0相同，与最大 if 相同q=100。

Pandas 分位数因 NaN 的存在而失败

提问by tnknepp

回答by TomAugspurger

相关推荐

最近更新

标签

Pandas 分位数因 NaN 的存在而失败

提问by tnknepp

回答by TomAugspurger

相关推荐

这是带有 notnull() 的 Pandas 错误还是我的根本误解（可能是误解）

pandas 如何在DataFrame中增加groupby中的行数

pandas pivot_table 多个 aggfunc

Pandas DataFrame 到 Hive 表

相关推荐

最近更新

标签