为什么在 Pandas 数据框中使用 Z-score 进行归一化会生成 NaN 列?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/50194821/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 05:32:04  来源:igfitidea点击:

Why normalization using Z-score in Pandas dataframe generates NaN columns?

pythonpython-3.xpandasstatistics

提问by steve

I use Z-score from scipyto normalize my dataset as the following:

我使用 Z-score fromscipy来规范化我的数据集,如下所示:

import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import zscore

df = pd.DataFrame(pd.read_csv('dataset.csv', sep=','))
df = df.dropna(how='any') # drop nan entries
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] # remove outliers

print(df.describe())
df = df.apply(zscore) # Normalization
print(df.describe())

However, I get some columns changed to be NaN, particularly mta_taxand trip_typeas you see below, yet they were numerical before applying Z-score normalization. Is this a bug in my code or Z-score can generate NaN?

不过,我得到了一些列改为NaN,特别是mta_taxtrip_type你见下文,但他们采用的Z分数标准化前的数值。这是我的代码中的错误还是 Z-score 可以生成NaN

Before Normalization:

标准化前:

           VendorID    RatecodeID  PULocationID  DOLocationID  \
count  1.055286e+07  1.055286e+07  1.055286e+07  1.055286e+07   
mean   1.794324e+00  1.000000e+00  1.106734e+02  1.285285e+02   
std    4.041947e-01  4.353414e-04  7.541486e+01  7.729142e+01   
min    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
25%    2.000000e+00  1.000000e+00  4.900000e+01  6.100000e+01   
50%    2.000000e+00  1.000000e+00  8.200000e+01  1.290000e+02   
75%    2.000000e+00  1.000000e+00  1.660000e+02  1.930000e+02   
max    2.000000e+00  2.000000e+00  2.650000e+02  2.650000e+02   

       passenger_count  trip_distance   fare_amount         extra     mta_tax  \
count     1.055286e+07   1.055286e+07  1.055286e+07  1.055286e+07  10552857.0   
mean      1.140647e+00   2.399851e+00  1.082419e+01  3.607218e-01         0.5   
std       4.436568e-01   2.014673e+00  6.464638e+00  3.797668e-01         0.0   
min       0.000000e+00   0.000000e+00  0.000000e+00 -6.700000e-01         0.5   
25%       1.000000e+00   1.000000e+00  6.000000e+00  0.000000e+00         0.5   
50%       1.000000e+00   1.700000e+00  9.000000e+00  5.000000e-01         0.5   
75%       1.000000e+00   3.120000e+00  1.350000e+01  5.000000e-01         0.5   
max       4.000000e+00   1.117000e+01  4.100000e+01  1.000000e+00         0.5   

         tip_amount  tolls_amount  improvement_surcharge  total_amount  \
count  1.055286e+07  1.055286e+07           1.055286e+07  1.055286e+07   
mean   1.028691e+00  5.512108e-02           3.000000e-01  1.312541e+01   
std    1.510206e+00  5.524008e-01           4.110357e-11  7.370554e+00   
min    0.000000e+00  0.000000e+00           3.000000e-01  0.000000e+00   
25%    0.000000e+00  0.000000e+00           3.000000e-01  7.800000e+00   
50%    0.000000e+00  0.000000e+00           3.000000e-01  1.080000e+01   
75%    1.860000e+00  0.000000e+00           3.000000e-01  1.630000e+01   
max    7.660000e+00  8.000000e+00           3.000000e-01  4.877000e+01   

       payment_type   trip_type  
count  1.055286e+07  10552857.0  
mean   1.501672e+00         1.0  
std    5.061254e-01         0.0  
min    1.000000e+00         1.0  
25%    1.000000e+00         1.0  
50%    1.000000e+00         1.0  
75%    2.000000e+00         1.0  
max    3.000000e+00         1.0 

After Normalization:

归一化后:

           VendorID    RatecodeID  PULocationID  DOLocationID  \
count  1.055286e+07  1.055286e+07  1.055286e+07  1.055286e+07   
mean  -1.235870e-12  1.006184e-13 -3.819625e-14 -1.004818e-14   
std    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
min   -1.965201e+00 -4.353414e-04 -1.454268e+00 -1.649970e+00   
25%    5.088537e-01 -4.353414e-04 -8.177886e-01 -8.736870e-01   
50%    5.088537e-01 -4.353414e-04 -3.802090e-01  6.100220e-03   
75%    5.088537e-01 -4.353414e-04  7.336298e-01  8.341353e-01   
max    5.088537e-01  2.297048e+03  2.046369e+00  1.765675e+00   

       passenger_count  trip_distance   fare_amount         extra  mta_tax  \
count     1.055286e+07   1.055286e+07  1.055286e+07  1.055286e+07      0.0   
mean     -3.942620e-14  -3.206434e-14 -4.744100e-13 -1.042732e-12      NaN   
std       1.000000e+00   1.000000e+00  1.000000e+00  1.000000e+00      NaN   
min      -2.571013e+00  -1.191187e+00 -1.674370e+00 -2.714092e+00      NaN   
25%      -3.170185e-01  -6.948283e-01 -7.462435e-01 -9.498508e-01      NaN   
50%      -3.170185e-01  -3.473773e-01 -2.821804e-01  3.667467e-01      NaN   
75%      -3.170185e-01   3.574519e-01  4.139144e-01  3.667467e-01      NaN   
max       6.444966e+00   4.353138e+00  4.667827e+00  1.683344e+00      NaN   

         tip_amount  tolls_amount  improvement_surcharge  total_amount  \
count  1.055286e+07  1.055286e+07             10552857.0  1.055286e+07   
mean   3.152945e-13 -2.877092e-14                   -1.0  2.081611e-14   
std    1.000000e+00  1.000000e+00                    0.0  1.000000e+00   
min   -6.811593e-01 -9.978459e-02                   -1.0 -1.780791e+00   
25%   -6.811593e-01 -9.978459e-02                   -1.0 -7.225258e-01   
50%   -6.811593e-01 -9.978459e-02                   -1.0 -3.155007e-01   
75%    5.504607e-01 -9.978459e-02                   -1.0  4.307119e-01   
max    4.390996e+00  1.438246e+01                   -1.0  4.836080e+00   

       payment_type  trip_type  
count  1.055286e+07        0.0  
mean   1.387184e-12        NaN  
std    1.000000e+00        NaN  
min   -9.912012e-01        NaN  
25%   -9.912012e-01        NaN  
50%   -9.912012e-01        NaN  
75%    9.845937e-01        NaN  
max    2.960389e+00        NaN 

Thank you

谢谢

回答by coffeinjunky

To follow up on the comments, have a look at the following:

要跟进评论,请查看以下内容:

df = pd.DataFrame({'a': [1,2,3], 'b': [2,2,2], 'c': [5,6,7], 'd':[8,8,8] })

Note how 'b' and 'd' are constant, which implies a standard deviation of 0. Applying the zscore-function means subtracting the mean and dividing by the standard error. If you divide a number by 0, the result is not defined, and Pandas will show NaN.

请注意“b”和“d”如何是常数,这意味着标准偏差为 0。应用zscore- 函数意味着减去平均值并除以标准误差。如果将数字除以 0,则结​​果未定义,Pandas 将显示 NaN。

df.apply(zscore)
Out[8]: 
          a   b         c   d
0 -1.224745 NaN -1.224745 NaN
1  0.000000 NaN  0.000000 NaN
2  1.224745 NaN  1.224745 NaN

You either need to apply the zscorefunction to selected columns, or change the function so that it omits constant columns. To apply the function only to selected columns, you can do the following:

您要么需要将该zscore函数应用于选定的列,要么更改该函数以使其省略常量列。要将函数仅应用于选定的列,您可以执行以下操作:

df[['a','c']] = df[['a','c']].apply(zscore)

df
Out[9]: 
          a  b         c  d
0 -1.224745  2 -1.224745  8
1  0.000000  2  0.000000  8
2  1.224745  2  1.224745  8

To let the function check if the column is constant, let's use a lambdafunction that returns the column unchanged if the column has a standard deviation of 0, or the standardized column otherwise.

为了让函数检查列是否为常量,让我们使用一个lambda函数,如果列的标准偏差为 0,则返回列不变,否则返回标准化列。

df.apply(lambda x: x if np.std(x) == 0 else zscore(x))
Out[15]: 
          a  b         c  d
0 -1.224745  2 -1.224745  8
1  0.000000  2  0.000000  8
2  1.224745  2  1.224745  8