Python 如何用 nans zscore 规范化熊猫列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23451244/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to zscore normalize pandas column with nans?
提问by
I have a pandas dataframe with a column of real values that I want to zscore normalize:
我有一个 Pandas 数据框,其中有一列我想要 zscore 归一化的真实值:
>> a
array([ nan, 0.0767, 0.4383, 0.7866, 0.8091, 0.1954, 0.6307,
0.6599, 0.1065, 0.0508])
>> df = pandas.DataFrame({"a": a})
The problem is that a single nanvalue makes all the array nan:
问题是单个nan值构成所有数组nan:
>> from scipy.stats import zscore
>> zscore(df["a"])
array([ nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
What's the correct way to apply zscore(or an equivalent function not from scipy) to a column of a pandas dataframe and have it ignore the nanvalues? I'd like it to be same dimension as original column with np.nanfor values that can't be normalized
将zscore(或不是来自 scipy 的等效函数)应用于 Pandas 数据帧的列并让它忽略nan值的正确方法是什么?我希望它与原始列的维度相同,并且np.nan对于无法标准化的值
edit: maybe the best solution is to use scipy.stats.nanmeanand scipy.stats.nanstd? I don't see why the degrees of freedom need to be changed for stdfor this purpose:
编辑:也许最好的解决方案是使用scipy.stats.nanmean和scipy.stats.nanstd?我不明白为什么需要std为此目的更改自由度:
zscore = lambda x: (x - scipy.stats.nanmean(x)) / scipy.stats.nanstd(x)
采纳答案by Karl D.
Well the pandas'versions of meanand stdwill hand the Nanso you could just compute that way (to get the same as scipy zscore I think you need to use ddof=0 on std):
好吧pandas',mean和的版本std将传递,Nan因此您可以以这种方式进行计算(要获得与 scipy zscore 相同的值,我认为您需要在 ddof=0 上使用std):
df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
print df
a zscore
0 NaN NaN
1 0.0767 -1.148329
2 0.4383 0.071478
3 0.7866 1.246419
4 0.8091 1.322320
5 0.1954 -0.747912
6 0.6307 0.720512
7 0.6599 0.819014
8 0.1065 -1.047803
9 0.0508 -1.235699
回答by atomh33ls
You could ignore nans using isnan.
您可以使用isnan.
z = a # initialise array for zscores
z[~np.isnan(a)] = zscore(a[~np.isnan(a)])
pandas.DataFrame({'a':a,'Zscore':z})
Zscore a
0 NaN NaN
1 -1.148329 0.0767
2 0.071478 0.4383
3 1.246419 0.7866
4 1.322320 0.8091
5 -0.747912 0.1954
6 0.720512 0.6307
7 0.819014 0.6599
8 -1.047803 0.1065
9 -1.235699 0.0508
回答by Toby Petty
Another alternative solution to this problem is to fill the NaNs in a DataFrame with the column means when calculating the z-score. This will result in the NaNs being calculated as having a z-score of 0, which can then be masked out using notnaon the original df.
此问题的另一个替代解决方案是在计算 z 分数时使用列均值填充 DataFrame 中的 NaN。这将导致 NaN 被计算为 z-score 为 0,然后可以使用notna原始 df将其屏蔽掉。
You can create a DataFrame of the same dimensions as the original df, containing the z-scores of the original df's values and NaNs in the same places in one line with:
您可以创建一个与原始 df 维度相同的 DataFrame,其中包含原始 df 值的 z 分数和一行中相同位置的 NaN:
zscore_df = pd.DataFrame(scipy.stats.zscore(df.fillna(df.mean())), index=df.index, columns=df.columns).where(df.notna())
回答by Lenz
I am not sure since when this parameter exists, because I have not been working with python for long. But you can simply use the parameter nan_policy = 'omit'and nans are ignored in the calculation:
我不确定这个参数何时存在,因为我很久没有使用 python 了。但是您可以简单地使用参数nan_policy = 'omit'并且在计算中忽略nans:
a = np.array([np.nan, 0.0767, 0.4383, 0.7866, 0.8091, 0.1954, 0.6307, 0.6599, 0.1065, 0.0508])
ZScore_a = stats.zscore(a,nan_policy='omit')
print(ZScore_a)
[nan -1.14832945 0.07147776 1.24641928 1.3223199 -0.74791154
0.72051236 0.81901449 -1.0478033 -1.23569949]

