Python 如何用 nans zscore 规范化熊猫列？

Question

提问by

I have a pandas dataframe with a column of real values that I want to zscore normalize:

我有一个 Pandas 数据框，其中有一列我想要 zscore 归一化的真实值：

>> a
array([    nan,  0.0767,  0.4383,  0.7866,  0.8091,  0.1954,  0.6307,
        0.6599,  0.1065,  0.0508])
>> df = pandas.DataFrame({"a": a})

The problem is that a single nanvalue makes all the array nan:

问题是单个nan值构成所有数组nan：

>> from scipy.stats import zscore
>> zscore(df["a"])
array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])

What's the correct way to apply zscore(or an equivalent function not from scipy) to a column of a pandas dataframe and have it ignore the nanvalues? I'd like it to be same dimension as original column with np.nanfor values that can't be normalized

将zscore（或不是来自 scipy 的等效函数）应用于 Pandas 数据帧的列并让它忽略nan值的正确方法是什么？我希望它与原始列的维度相同，并且np.nan对于无法标准化的值

edit: maybe the best solution is to use scipy.stats.nanmeanand scipy.stats.nanstd? I don't see why the degrees of freedom need to be changed for stdfor this purpose:

编辑：也许最好的解决方案是使用scipy.stats.nanmean和scipy.stats.nanstd？我不明白为什么需要std为此目的更改自由度：

zscore = lambda x: (x - scipy.stats.nanmean(x)) / scipy.stats.nanstd(x)

Answer 1

采纳答案by Karl D.

Well the pandas'versions of meanand stdwill hand the Nanso you could just compute that way (to get the same as scipy zscore I think you need to use ddof=0 on std):

好吧pandas'，mean和的版本std将传递，Nan因此您可以以这种方式进行计算（要获得与 scipy zscore 相同的值，我认为您需要在 ddof=0 上使用std）：

df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
print df

        a    zscore
0     NaN       NaN
1  0.0767 -1.148329
2  0.4383  0.071478
3  0.7866  1.246419
4  0.8091  1.322320
5  0.1954 -0.747912
6  0.6307  0.720512
7  0.6599  0.819014
8  0.1065 -1.047803
9  0.0508 -1.235699

Answer 2

回答by atomh33ls

You could ignore nans using isnan.

您可以使用isnan.

z = a                    # initialise array for zscores
z[~np.isnan(a)] = zscore(a[~np.isnan(a)])
pandas.DataFrame({'a':a,'Zscore':z})

     Zscore       a
0       NaN     NaN
1 -1.148329  0.0767
2  0.071478  0.4383
3  1.246419  0.7866
4  1.322320  0.8091
5 -0.747912  0.1954
6  0.720512  0.6307
7  0.819014  0.6599
8 -1.047803  0.1065
9 -1.235699  0.0508

Answer 3

回答by Toby Petty

Another alternative solution to this problem is to fill the NaNs in a DataFrame with the column means when calculating the z-score. This will result in the NaNs being calculated as having a z-score of 0, which can then be masked out using notnaon the original df.

此问题的另一个替代解决方案是在计算 z 分数时使用列均值填充 DataFrame 中的 NaN。这将导致 NaN 被计算为 z-score 为 0，然后可以使用notna原始 df将其屏蔽掉。

You can create a DataFrame of the same dimensions as the original df, containing the z-scores of the original df's values and NaNs in the same places in one line with:

您可以创建一个与原始 df 维度相同的 DataFrame，其中包含原始 df 值的 z 分数和一行中相同位置的 NaN：

zscore_df = pd.DataFrame(scipy.stats.zscore(df.fillna(df.mean())), index=df.index, columns=df.columns).where(df.notna())

Answer 4

回答by Lenz

I am not sure since when this parameter exists, because I have not been working with python for long. But you can simply use the parameter nan_policy = 'omit'and nans are ignored in the calculation:

我不确定这个参数何时存在，因为我很久没有使用 python 了。但是您可以简单地使用参数nan_policy = 'omit'并且在计算中忽略nans：

a = np.array([np.nan,  0.0767,  0.4383,  0.7866,  0.8091,  0.1954,  0.6307, 0.6599, 0.1065,  0.0508])
ZScore_a = stats.zscore(a,nan_policy='omit')

print(ZScore_a)
[nan -1.14832945  0.07147776  1.24641928  1.3223199  -0.74791154
0.72051236  0.81901449 -1.0478033  -1.23569949]

Python 如何用 nans zscore 规范化熊猫列？

提问by

采纳答案by Karl D.

回答by atomh33ls

回答by Toby Petty

回答by Lenz

相关推荐

最近更新

标签

Python 如何用 nans zscore 规范化熊猫列？

提问by

采纳答案by Karl D.

回答by atomh33ls

回答by Toby Petty

回答by Lenz

相关推荐

Python Pandas - 绘制堆积条形图

Python ValueError：解包的值太多（预期为 2 个）错误

Python中的循环列表迭代器

Python json.loads ValueError，需要分隔符

相关推荐

最近更新

标签