为什么在 Pandas 数据框中使用 Z-score 进行归一化会生成 NaN 列？

Question

提问by steve

I use Z-score from scipyto normalize my dataset as the following:

我使用 Z-score fromscipy来规范化我的数据集，如下所示：

import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import zscore

df = pd.DataFrame(pd.read_csv('dataset.csv', sep=','))
df = df.dropna(how='any') # drop nan entries
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] # remove outliers

print(df.describe())
df = df.apply(zscore) # Normalization
print(df.describe())

However, I get some columns changed to be NaN, particularly mta_taxand trip_typeas you see below, yet they were numerical before applying Z-score normalization. Is this a bug in my code or Z-score can generate NaN?

不过，我得到了一些列改为NaN，特别是mta_tax和trip_type你见下文，但他们采用的Z分数标准化前的数值。这是我的代码中的错误还是 Z-score 可以生成NaN？

Before Normalization:

标准化前：

           VendorID    RatecodeID  PULocationID  DOLocationID  \
count  1.055286e+07  1.055286e+07  1.055286e+07  1.055286e+07   
mean   1.794324e+00  1.000000e+00  1.106734e+02  1.285285e+02   
std    4.041947e-01  4.353414e-04  7.541486e+01  7.729142e+01   
min    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
25%    2.000000e+00  1.000000e+00  4.900000e+01  6.100000e+01   
50%    2.000000e+00  1.000000e+00  8.200000e+01  1.290000e+02   
75%    2.000000e+00  1.000000e+00  1.660000e+02  1.930000e+02   
max    2.000000e+00  2.000000e+00  2.650000e+02  2.650000e+02   

       passenger_count  trip_distance   fare_amount         extra     mta_tax  \
count     1.055286e+07   1.055286e+07  1.055286e+07  1.055286e+07  10552857.0   
mean      1.140647e+00   2.399851e+00  1.082419e+01  3.607218e-01         0.5   
std       4.436568e-01   2.014673e+00  6.464638e+00  3.797668e-01         0.0   
min       0.000000e+00   0.000000e+00  0.000000e+00 -6.700000e-01         0.5   
25%       1.000000e+00   1.000000e+00  6.000000e+00  0.000000e+00         0.5   
50%       1.000000e+00   1.700000e+00  9.000000e+00  5.000000e-01         0.5   
75%       1.000000e+00   3.120000e+00  1.350000e+01  5.000000e-01         0.5   
max       4.000000e+00   1.117000e+01  4.100000e+01  1.000000e+00         0.5   

         tip_amount  tolls_amount  improvement_surcharge  total_amount  \
count  1.055286e+07  1.055286e+07           1.055286e+07  1.055286e+07   
mean   1.028691e+00  5.512108e-02           3.000000e-01  1.312541e+01   
std    1.510206e+00  5.524008e-01           4.110357e-11  7.370554e+00   
min    0.000000e+00  0.000000e+00           3.000000e-01  0.000000e+00   
25%    0.000000e+00  0.000000e+00           3.000000e-01  7.800000e+00   
50%    0.000000e+00  0.000000e+00           3.000000e-01  1.080000e+01   
75%    1.860000e+00  0.000000e+00           3.000000e-01  1.630000e+01   
max    7.660000e+00  8.000000e+00           3.000000e-01  4.877000e+01   

       payment_type   trip_type  
count  1.055286e+07  10552857.0  
mean   1.501672e+00         1.0  
std    5.061254e-01         0.0  
min    1.000000e+00         1.0  
25%    1.000000e+00         1.0  
50%    1.000000e+00         1.0  
75%    2.000000e+00         1.0  
max    3.000000e+00         1.0

After Normalization:

归一化后：

           VendorID    RatecodeID  PULocationID  DOLocationID  \
count  1.055286e+07  1.055286e+07  1.055286e+07  1.055286e+07   
mean  -1.235870e-12  1.006184e-13 -3.819625e-14 -1.004818e-14   
std    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
min   -1.965201e+00 -4.353414e-04 -1.454268e+00 -1.649970e+00   
25%    5.088537e-01 -4.353414e-04 -8.177886e-01 -8.736870e-01   
50%    5.088537e-01 -4.353414e-04 -3.802090e-01  6.100220e-03   
75%    5.088537e-01 -4.353414e-04  7.336298e-01  8.341353e-01   
max    5.088537e-01  2.297048e+03  2.046369e+00  1.765675e+00   

       passenger_count  trip_distance   fare_amount         extra  mta_tax  \
count     1.055286e+07   1.055286e+07  1.055286e+07  1.055286e+07      0.0   
mean     -3.942620e-14  -3.206434e-14 -4.744100e-13 -1.042732e-12      NaN   
std       1.000000e+00   1.000000e+00  1.000000e+00  1.000000e+00      NaN   
min      -2.571013e+00  -1.191187e+00 -1.674370e+00 -2.714092e+00      NaN   
25%      -3.170185e-01  -6.948283e-01 -7.462435e-01 -9.498508e-01      NaN   
50%      -3.170185e-01  -3.473773e-01 -2.821804e-01  3.667467e-01      NaN   
75%      -3.170185e-01   3.574519e-01  4.139144e-01  3.667467e-01      NaN   
max       6.444966e+00   4.353138e+00  4.667827e+00  1.683344e+00      NaN   

         tip_amount  tolls_amount  improvement_surcharge  total_amount  \
count  1.055286e+07  1.055286e+07             10552857.0  1.055286e+07   
mean   3.152945e-13 -2.877092e-14                   -1.0  2.081611e-14   
std    1.000000e+00  1.000000e+00                    0.0  1.000000e+00   
min   -6.811593e-01 -9.978459e-02                   -1.0 -1.780791e+00   
25%   -6.811593e-01 -9.978459e-02                   -1.0 -7.225258e-01   
50%   -6.811593e-01 -9.978459e-02                   -1.0 -3.155007e-01   
75%    5.504607e-01 -9.978459e-02                   -1.0  4.307119e-01   
max    4.390996e+00  1.438246e+01                   -1.0  4.836080e+00   

       payment_type  trip_type  
count  1.055286e+07        0.0  
mean   1.387184e-12        NaN  
std    1.000000e+00        NaN  
min   -9.912012e-01        NaN  
25%   -9.912012e-01        NaN  
50%   -9.912012e-01        NaN  
75%    9.845937e-01        NaN  
max    2.960389e+00        NaN

Thank you

谢谢

Answer 1

回答by coffeinjunky

To follow up on the comments, have a look at the following:

要跟进评论，请查看以下内容：

df = pd.DataFrame({'a': [1,2,3], 'b': [2,2,2], 'c': [5,6,7], 'd':[8,8,8] })

Note how 'b' and 'd' are constant, which implies a standard deviation of 0. Applying the zscore-function means subtracting the mean and dividing by the standard error. If you divide a number by 0, the result is not defined, and Pandas will show NaN.

请注意“b”和“d”如何是常数，这意味着标准偏差为 0。应用zscore- 函数意味着减去平均值并除以标准误差。如果将数字除以 0，则结果未定义，Pandas 将显示 NaN。

df.apply(zscore)
Out[8]: 
          a   b         c   d
0 -1.224745 NaN -1.224745 NaN
1  0.000000 NaN  0.000000 NaN
2  1.224745 NaN  1.224745 NaN

You either need to apply the zscorefunction to selected columns, or change the function so that it omits constant columns. To apply the function only to selected columns, you can do the following:

您要么需要将该zscore函数应用于选定的列，要么更改该函数以使其省略常量列。要将函数仅应用于选定的列，您可以执行以下操作：

df[['a','c']] = df[['a','c']].apply(zscore)

df
Out[9]: 
          a  b         c  d
0 -1.224745  2 -1.224745  8
1  0.000000  2  0.000000  8
2  1.224745  2  1.224745  8

To let the function check if the column is constant, let's use a lambdafunction that returns the column unchanged if the column has a standard deviation of 0, or the standardized column otherwise.

为了让函数检查列是否为常量，让我们使用一个lambda函数，如果列的标准偏差为 0，则返回列不变，否则返回标准化列。

df.apply(lambda x: x if np.std(x) == 0 else zscore(x))
Out[15]: 
          a  b         c  d
0 -1.224745  2 -1.224745  8
1  0.000000  2  0.000000  8
2  1.224745  2  1.224745  8

为什么在 Pandas 数据框中使用 Z-score 进行归一化会生成 NaN 列？

提问by steve

Before Normalization:

标准化前：

After Normalization:

归一化后：

回答by coffeinjunky

相关推荐

最近更新

标签

为什么在 Pandas 数据框中使用 Z-score 进行归一化会生成 NaN 列？

提问by steve

Before Normalization:

标准化前：

After Normalization:

归一化后：

回答by coffeinjunky

相关推荐

从 Pandas 数据框中绘制堆积条形图

pandas 从具有不同长度的列表生成数据帧

pandas XML 到 CSV Python

pandas 熊猫读取带有部分通配符的csv文件

相关推荐

最近更新

标签