为什么在 Pandas 数据框中使用 Z-score 进行归一化会生成 NaN 列?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/50194821/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Why normalization using Z-score in Pandas dataframe generates NaN columns?
提问by steve
I use Z-score from scipy
to normalize my dataset as the following:
我使用 Z-score fromscipy
来规范化我的数据集,如下所示:
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import zscore
df = pd.DataFrame(pd.read_csv('dataset.csv', sep=','))
df = df.dropna(how='any') # drop nan entries
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] # remove outliers
print(df.describe())
df = df.apply(zscore) # Normalization
print(df.describe())
However, I get some columns changed to be NaN
, particularly mta_tax
and trip_type
as you see below, yet they were numerical before applying Z-score normalization. Is this a bug in my code or Z-score can generate NaN
?
不过,我得到了一些列改为NaN
,特别是mta_tax
和trip_type
你见下文,但他们采用的Z分数标准化前的数值。这是我的代码中的错误还是 Z-score 可以生成NaN
?
Before Normalization:
标准化前:
VendorID RatecodeID PULocationID DOLocationID \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07
mean 1.794324e+00 1.000000e+00 1.106734e+02 1.285285e+02
std 4.041947e-01 4.353414e-04 7.541486e+01 7.729142e+01
min 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
25% 2.000000e+00 1.000000e+00 4.900000e+01 6.100000e+01
50% 2.000000e+00 1.000000e+00 8.200000e+01 1.290000e+02
75% 2.000000e+00 1.000000e+00 1.660000e+02 1.930000e+02
max 2.000000e+00 2.000000e+00 2.650000e+02 2.650000e+02
passenger_count trip_distance fare_amount extra mta_tax \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07 10552857.0
mean 1.140647e+00 2.399851e+00 1.082419e+01 3.607218e-01 0.5
std 4.436568e-01 2.014673e+00 6.464638e+00 3.797668e-01 0.0
min 0.000000e+00 0.000000e+00 0.000000e+00 -6.700000e-01 0.5
25% 1.000000e+00 1.000000e+00 6.000000e+00 0.000000e+00 0.5
50% 1.000000e+00 1.700000e+00 9.000000e+00 5.000000e-01 0.5
75% 1.000000e+00 3.120000e+00 1.350000e+01 5.000000e-01 0.5
max 4.000000e+00 1.117000e+01 4.100000e+01 1.000000e+00 0.5
tip_amount tolls_amount improvement_surcharge total_amount \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07
mean 1.028691e+00 5.512108e-02 3.000000e-01 1.312541e+01
std 1.510206e+00 5.524008e-01 4.110357e-11 7.370554e+00
min 0.000000e+00 0.000000e+00 3.000000e-01 0.000000e+00
25% 0.000000e+00 0.000000e+00 3.000000e-01 7.800000e+00
50% 0.000000e+00 0.000000e+00 3.000000e-01 1.080000e+01
75% 1.860000e+00 0.000000e+00 3.000000e-01 1.630000e+01
max 7.660000e+00 8.000000e+00 3.000000e-01 4.877000e+01
payment_type trip_type
count 1.055286e+07 10552857.0
mean 1.501672e+00 1.0
std 5.061254e-01 0.0
min 1.000000e+00 1.0
25% 1.000000e+00 1.0
50% 1.000000e+00 1.0
75% 2.000000e+00 1.0
max 3.000000e+00 1.0
After Normalization:
归一化后:
VendorID RatecodeID PULocationID DOLocationID \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07
mean -1.235870e-12 1.006184e-13 -3.819625e-14 -1.004818e-14
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -1.965201e+00 -4.353414e-04 -1.454268e+00 -1.649970e+00
25% 5.088537e-01 -4.353414e-04 -8.177886e-01 -8.736870e-01
50% 5.088537e-01 -4.353414e-04 -3.802090e-01 6.100220e-03
75% 5.088537e-01 -4.353414e-04 7.336298e-01 8.341353e-01
max 5.088537e-01 2.297048e+03 2.046369e+00 1.765675e+00
passenger_count trip_distance fare_amount extra mta_tax \
count 1.055286e+07 1.055286e+07 1.055286e+07 1.055286e+07 0.0
mean -3.942620e-14 -3.206434e-14 -4.744100e-13 -1.042732e-12 NaN
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 NaN
min -2.571013e+00 -1.191187e+00 -1.674370e+00 -2.714092e+00 NaN
25% -3.170185e-01 -6.948283e-01 -7.462435e-01 -9.498508e-01 NaN
50% -3.170185e-01 -3.473773e-01 -2.821804e-01 3.667467e-01 NaN
75% -3.170185e-01 3.574519e-01 4.139144e-01 3.667467e-01 NaN
max 6.444966e+00 4.353138e+00 4.667827e+00 1.683344e+00 NaN
tip_amount tolls_amount improvement_surcharge total_amount \
count 1.055286e+07 1.055286e+07 10552857.0 1.055286e+07
mean 3.152945e-13 -2.877092e-14 -1.0 2.081611e-14
std 1.000000e+00 1.000000e+00 0.0 1.000000e+00
min -6.811593e-01 -9.978459e-02 -1.0 -1.780791e+00
25% -6.811593e-01 -9.978459e-02 -1.0 -7.225258e-01
50% -6.811593e-01 -9.978459e-02 -1.0 -3.155007e-01
75% 5.504607e-01 -9.978459e-02 -1.0 4.307119e-01
max 4.390996e+00 1.438246e+01 -1.0 4.836080e+00
payment_type trip_type
count 1.055286e+07 0.0
mean 1.387184e-12 NaN
std 1.000000e+00 NaN
min -9.912012e-01 NaN
25% -9.912012e-01 NaN
50% -9.912012e-01 NaN
75% 9.845937e-01 NaN
max 2.960389e+00 NaN
Thank you
谢谢
回答by coffeinjunky
To follow up on the comments, have a look at the following:
要跟进评论,请查看以下内容:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,2,2], 'c': [5,6,7], 'd':[8,8,8] })
Note how 'b' and 'd' are constant, which implies a standard deviation of 0. Applying the zscore
-function means subtracting the mean and dividing by the standard error. If you divide a number by 0, the result is not defined, and Pandas will show NaN.
请注意“b”和“d”如何是常数,这意味着标准偏差为 0。应用zscore
- 函数意味着减去平均值并除以标准误差。如果将数字除以 0,则结果未定义,Pandas 将显示 NaN。
df.apply(zscore)
Out[8]:
a b c d
0 -1.224745 NaN -1.224745 NaN
1 0.000000 NaN 0.000000 NaN
2 1.224745 NaN 1.224745 NaN
You either need to apply the zscore
function to selected columns, or change the function so that it omits constant columns. To apply the function only to selected columns, you can do the following:
您要么需要将该zscore
函数应用于选定的列,要么更改该函数以使其省略常量列。要将函数仅应用于选定的列,您可以执行以下操作:
df[['a','c']] = df[['a','c']].apply(zscore)
df
Out[9]:
a b c d
0 -1.224745 2 -1.224745 8
1 0.000000 2 0.000000 8
2 1.224745 2 1.224745 8
To let the function check if the column is constant, let's use a lambda
function that returns the column unchanged if the column has a standard deviation of 0, or the standardized column otherwise.
为了让函数检查列是否为常量,让我们使用一个lambda
函数,如果列的标准偏差为 0,则返回列不变,否则返回标准化列。
df.apply(lambda x: x if np.std(x) == 0 else zscore(x))
Out[15]:
a b c d
0 -1.224745 2 -1.224745 8
1 0.000000 2 0.000000 8
2 1.224745 2 1.224745 8