pandas 熊猫标准偏差返回 NaN
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/32130954/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas Standard Deviation returns NaN
提问by edesz
I have the following Pandas Dataframe in Python 2.7.
我在 Python 2.7 中有以下 Pandas Dataframe。
CODE:
代码:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,6),columns=list('ABCDEF'))
df.insert(0,'Category',['A','C','D','D','B','E','F','F','G','H'])
print df.groupby('Category').std()
Here is df:
这是df:
Category A B C D E F
A 0.500200 0.791039 0.498083 0.360320 0.965992 0.537068
C 0.295330 0.638823 0.133570 0.272600 0.647285 0.737942
D 0.912966 0.051288 0.055766 0.906490 0.078384 0.928538
D 0.416582 0.441684 0.605967 0.516580 0.458814 0.823692
B 0.714371 0.636975 0.153347 0.936872 0.000649 0.692558
E 0.639271 0.486151 0.860172 0.870838 0.831571 0.404813
F 0.375279 0.555228 0.020599 0.120947 0.896505 0.424233
F 0.952112 0.299520 0.150623 0.341139 0.186734 0.807519
G 0.384157 0.858391 0.278563 0.677627 0.998458 0.829019
H 0.109465 0.085861 0.440557 0.925500 0.767791 0.626924
I am looking to perform a GROUP_BYand then calculate the average and standard deviation. The standard deviation is sometimescalculated after grouping over 1 row - this means dividing by N-1will sometimesgive division by 0which will print NaN.
我希望执行 aGROUP_BY然后计算平均值和标准偏差。标准偏差有时超过1行分组后计算-这意味着通过将N-1将有时给除以0它将打印NaN。
Here is the output of the above code:
下面是上面代码的输出:
OUTPUT:
输出:
A B C D E F
Category
A NaN NaN NaN NaN NaN NaN
B NaN NaN NaN NaN NaN NaN
C NaN NaN NaN NaN NaN NaN
D 0.350996 0.276052 0.389051 0.275708 0.269004 0.074137
E NaN NaN NaN NaN NaN NaN
F 0.407882 0.180813 0.091941 0.155699 0.501884 0.271025
G NaN NaN NaN NaN NaN NaN
H NaN NaN NaN NaN NaN NaN
For the cases where I am performing the GROUP_BYover 1 row, is there a way to skip the Standard Deviation and just return the value itself. For example, I am looking to get this:
对于我执行GROUP_BY超过 1 行的情况,有没有办法跳过标准偏差并只返回值本身。例如,我希望得到这个:
DESIRED OUTPUT
期望的输出
A B C D E F
Category
A 0.500200 0.791039 0.498083 0.360320 0.965992 0.537068
B 0.714371 0.636975 0.153347 0.936872 0.000649 0.692558
C 0.295330 0.638823 0.133570 0.272600 0.647285 0.737942
D 0.350996 0.276052 0.389051 0.275708 0.269004 0.074137
E 0.639271 0.486151 0.860172 0.870838 0.831571 0.404813
F 0.407882 0.180813 0.091941 0.155699 0.501884 0.271025
G 0.384157 0.858391 0.278563 0.677627 0.998458 0.829019
H 0.109465 0.085861 0.440557 0.925500 0.767791 0.626924
Is it possible to do this with Pandas?
可以用 Pandas 做到这一点吗?
EDIT: To create the exact Pandas Dataframe above, select it, copy to clipboard and then use this:
编辑:要创建上面的确切 Pandas 数据框,请选择它,复制到剪贴板,然后使用:
import pandas as pd
df = pd.read_clipboard(index_col='Category')
print df
print df.groupby('Category').std()
采纳答案by chrisb
You could fillnato replace the missing values - passing in a DataFramewith the last value of each group.
您可以fillna替换缺失的值 -DataFrame使用每个组的最后一个值传入 a 。
In [86]: (df.groupby('Category').std()
...: .fillna(df.groupby('Category').last()))
Out[86]:
A B C D E F
Category
A 0.500200 0.791039 0.498083 0.360320 0.965992 0.537068
B 0.714371 0.636975 0.153347 0.936872 0.000649 0.692558
C 0.295330 0.638823 0.133570 0.272600 0.647285 0.737942
D 0.350996 0.276052 0.389051 0.275708 0.269005 0.074137
E 0.639271 0.486151 0.860172 0.870838 0.831571 0.404813
F 0.407883 0.180813 0.091941 0.155699 0.501884 0.271024
G 0.384157 0.858391 0.278563 0.677627 0.998458 0.829019
H 0.109465 0.085861 0.440557 0.925500 0.767791 0.626924
回答by Mike T
Not exactly what was asked in the question, but if you wanted to avoid NaNvalues, calculate the population standard deviationwith 0 degrees of freedom(i.e. std(ddof=0)), dividing by just N:
不完全是问题中提出的问题,但如果您想避免使用NaN值,请计算具有 0自由度(即)的总体标准偏差,除以仅:std(ddof=0)N
>>> print(df.groupby('Category').std(ddof=0))
A B C D E F
Category
A 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
B 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
C 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
D 0.248192 0.195198 0.275101 0.194955 0.190215 0.052423
E 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
F 0.288417 0.127854 0.065012 0.110096 0.354885 0.191643
G 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
H 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Zero means no variance, because of only one value in the group or all the same values.
零意味着没有差异,因为组中只有一个值或所有相同的值。
(Note that the default for ddoffor numpy.varis zero, thus different than pandas' default of 1).
(注意,对于默认ddof为numpy.var是零,因此比Pandas的1默认不同)。

