pandas 熊猫标准偏差返回 NaN

Question

提问by edesz

I have the following Pandas Dataframe in Python 2.7.

我在 Python 2.7 中有以下 Pandas Dataframe。

CODE:

代码：

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,6),columns=list('ABCDEF'))
df.insert(0,'Category',['A','C','D','D','B','E','F','F','G','H'])
print df.groupby('Category').std()

Here is df:

这是df：

Category         A         B         C         D         E         F
       A  0.500200  0.791039  0.498083  0.360320  0.965992  0.537068
       C  0.295330  0.638823  0.133570  0.272600  0.647285  0.737942
       D  0.912966  0.051288  0.055766  0.906490  0.078384  0.928538
       D  0.416582  0.441684  0.605967  0.516580  0.458814  0.823692
       B  0.714371  0.636975  0.153347  0.936872  0.000649  0.692558
       E  0.639271  0.486151  0.860172  0.870838  0.831571  0.404813
       F  0.375279  0.555228  0.020599  0.120947  0.896505  0.424233
       F  0.952112  0.299520  0.150623  0.341139  0.186734  0.807519
       G  0.384157  0.858391  0.278563  0.677627  0.998458  0.829019
       H  0.109465  0.085861  0.440557  0.925500  0.767791  0.626924

I am looking to perform a GROUP_BYand then calculate the average and standard deviation. The standard deviation is sometimescalculated after grouping over 1 row - this means dividing by N-1will sometimesgive division by 0which will print NaN.

我希望执行 aGROUP_BY然后计算平均值和标准偏差。标准偏差有时超过1行分组后计算-这意味着通过将N-1将有时给除以0它将打印NaN。

Here is the output of the above code:

下面是上面代码的输出：

OUTPUT:

输出：

                A         B         C         D         E         F
Category                                                            
A              NaN       NaN       NaN       NaN       NaN       NaN
B              NaN       NaN       NaN       NaN       NaN       NaN
C              NaN       NaN       NaN       NaN       NaN       NaN
D         0.350996  0.276052  0.389051  0.275708  0.269004  0.074137
E              NaN       NaN       NaN       NaN       NaN       NaN
F         0.407882  0.180813  0.091941  0.155699  0.501884  0.271025
G              NaN       NaN       NaN       NaN       NaN       NaN
H              NaN       NaN       NaN       NaN       NaN       NaN

For the cases where I am performing the GROUP_BYover 1 row, is there a way to skip the Standard Deviation and just return the value itself. For example, I am looking to get this:

对于我执行GROUP_BY超过 1 行的情况，有没有办法跳过标准偏差并只返回值本身。例如，我希望得到这个：

DESIRED OUTPUT

期望的输出

                 A         B         C         D         E         F
Category                                                            
A         0.500200  0.791039  0.498083  0.360320  0.965992  0.537068
B         0.714371  0.636975  0.153347  0.936872  0.000649  0.692558
C         0.295330  0.638823  0.133570  0.272600  0.647285  0.737942
D         0.350996  0.276052  0.389051  0.275708  0.269004  0.074137
E         0.639271  0.486151  0.860172  0.870838  0.831571  0.404813
F         0.407882  0.180813  0.091941  0.155699  0.501884  0.271025
G         0.384157  0.858391  0.278563  0.677627  0.998458  0.829019
H         0.109465  0.085861  0.440557  0.925500  0.767791  0.626924

Is it possible to do this with Pandas?

可以用 Pandas 做到这一点吗？

EDIT: To create the exact Pandas Dataframe above, select it, copy to clipboard and then use this:

编辑：要创建上面的确切 Pandas 数据框，请选择它，复制到剪贴板，然后使用：

import pandas as pd
df = pd.read_clipboard(index_col='Category')
print df
print df.groupby('Category').std()

Answer 1

采纳答案by chrisb

You could fillnato replace the missing values - passing in a DataFramewith the last value of each group.

您可以fillna替换缺失的值 -DataFrame使用每个组的最后一个值传入 a 。

In [86]: (df.groupby('Category').std()
    ...:    .fillna(df.groupby('Category').last()))

Out[86]: 
                 A         B         C         D         E         F
Category                                                            
A         0.500200  0.791039  0.498083  0.360320  0.965992  0.537068
B         0.714371  0.636975  0.153347  0.936872  0.000649  0.692558
C         0.295330  0.638823  0.133570  0.272600  0.647285  0.737942
D         0.350996  0.276052  0.389051  0.275708  0.269005  0.074137
E         0.639271  0.486151  0.860172  0.870838  0.831571  0.404813
F         0.407883  0.180813  0.091941  0.155699  0.501884  0.271024
G         0.384157  0.858391  0.278563  0.677627  0.998458  0.829019
H         0.109465  0.085861  0.440557  0.925500  0.767791  0.626924

Answer 2

回答by Mike T

Not exactly what was asked in the question, but if you wanted to avoid NaNvalues, calculate the population standard deviationwith 0 degrees of freedom(i.e. std(ddof=0)), dividing by just N:

不完全是问题中提出的问题，但如果您想避免使用NaN值，请计算具有 0自由度（即）的总体标准偏差，除以仅：std(ddof=0)N

>>> print(df.groupby('Category').std(ddof=0))
                 A         B         C         D         E         F
Category                                                            
A         0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
B         0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
C         0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
D         0.248192  0.195198  0.275101  0.194955  0.190215  0.052423
E         0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
F         0.288417  0.127854  0.065012  0.110096  0.354885  0.191643
G         0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
H         0.000000  0.000000  0.000000  0.000000  0.000000  0.000000

Zero means no variance, because of only one value in the group or all the same values.

零意味着没有差异，因为组中只有一个值或所有相同的值。

(Note that the default for ddoffor numpy.varis zero, thus different than pandas' default of 1).

（注意，对于默认ddof为numpy.var是零，因此比Pandas的1默认不同）。

pandas 熊猫标准偏差返回 NaN

提问by edesz

采纳答案by chrisb

回答by Mike T

相关推荐

最近更新

标签

pandas 熊猫标准偏差返回 NaN

提问by edesz

采纳答案by chrisb

回答by Mike T

相关推荐

Pandas 使用“更大”的 DataFrames 附加性能连接/附加

Python 和 Pandas：将列组合成一个日期

作为新列附加到 Pandas 中的 DataFrame

pandas 如何按列减少熊猫数据框？

相关推荐

最近更新

标签