Python 如何用 nans zscore 规范化熊猫列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/23451244/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
how to zscore normalize pandas column with nans?
提问by
I have a pandas dataframe with a column of real values that I want to zscore normalize:
我有一个 Pandas 数据框,其中有一列我想要 zscore 归一化的真实值:
>> a
array([ nan, 0.0767, 0.4383, 0.7866, 0.8091, 0.1954, 0.6307,
0.6599, 0.1065, 0.0508])
>> df = pandas.DataFrame({"a": a})
The problem is that a single nan
value makes all the array nan
:
问题是单个nan
值构成所有数组nan
:
>> from scipy.stats import zscore
>> zscore(df["a"])
array([ nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
What's the correct way to apply zscore
(or an equivalent function not from scipy) to a column of a pandas dataframe and have it ignore the nan
values? I'd like it to be same dimension as original column with np.nan
for values that can't be normalized
将zscore
(或不是来自 scipy 的等效函数)应用于 Pandas 数据帧的列并让它忽略nan
值的正确方法是什么?我希望它与原始列的维度相同,并且np.nan
对于无法标准化的值
edit: maybe the best solution is to use scipy.stats.nanmean
and scipy.stats.nanstd
? I don't see why the degrees of freedom need to be changed for std
for this purpose:
编辑:也许最好的解决方案是使用scipy.stats.nanmean
和scipy.stats.nanstd
?我不明白为什么需要std
为此目的更改自由度:
zscore = lambda x: (x - scipy.stats.nanmean(x)) / scipy.stats.nanstd(x)
采纳答案by Karl D.
Well the pandas'
versions of mean
and std
will hand the Nan
so you could just compute that way (to get the same as scipy zscore I think you need to use ddof=0 on std
):
好吧pandas'
,mean
和的版本std
将传递,Nan
因此您可以以这种方式进行计算(要获得与 scipy zscore 相同的值,我认为您需要在 ddof=0 上使用std
):
df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
print df
a zscore
0 NaN NaN
1 0.0767 -1.148329
2 0.4383 0.071478
3 0.7866 1.246419
4 0.8091 1.322320
5 0.1954 -0.747912
6 0.6307 0.720512
7 0.6599 0.819014
8 0.1065 -1.047803
9 0.0508 -1.235699
回答by atomh33ls
You could ignore nans using isnan
.
您可以使用isnan
.
z = a # initialise array for zscores
z[~np.isnan(a)] = zscore(a[~np.isnan(a)])
pandas.DataFrame({'a':a,'Zscore':z})
Zscore a
0 NaN NaN
1 -1.148329 0.0767
2 0.071478 0.4383
3 1.246419 0.7866
4 1.322320 0.8091
5 -0.747912 0.1954
6 0.720512 0.6307
7 0.819014 0.6599
8 -1.047803 0.1065
9 -1.235699 0.0508
回答by Toby Petty
Another alternative solution to this problem is to fill the NaNs in a DataFrame with the column means when calculating the z-score. This will result in the NaNs being calculated as having a z-score of 0, which can then be masked out using notna
on the original df.
此问题的另一个替代解决方案是在计算 z 分数时使用列均值填充 DataFrame 中的 NaN。这将导致 NaN 被计算为 z-score 为 0,然后可以使用notna
原始 df将其屏蔽掉。
You can create a DataFrame of the same dimensions as the original df, containing the z-scores of the original df's values and NaNs in the same places in one line with:
您可以创建一个与原始 df 维度相同的 DataFrame,其中包含原始 df 值的 z 分数和一行中相同位置的 NaN:
zscore_df = pd.DataFrame(scipy.stats.zscore(df.fillna(df.mean())), index=df.index, columns=df.columns).where(df.notna())
回答by Lenz
I am not sure since when this parameter exists, because I have not been working with python for long. But you can simply use the parameter nan_policy = 'omit'and nans are ignored in the calculation:
我不确定这个参数何时存在,因为我很久没有使用 python 了。但是您可以简单地使用参数nan_policy = 'omit'并且在计算中忽略nans:
a = np.array([np.nan, 0.0767, 0.4383, 0.7866, 0.8091, 0.1954, 0.6307, 0.6599, 0.1065, 0.0508])
ZScore_a = stats.zscore(a,nan_policy='omit')
print(ZScore_a)
[nan -1.14832945 0.07147776 1.24641928 1.3223199 -0.74791154
0.72051236 0.81901449 -1.0478033 -1.23569949]