pandas ValueError:在预处理数据时,输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/47767162/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-14 04:53:14  来源:igfitidea点击:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64') while preprocessing Data

pythonpandasmachine-learningscikit-learndata-science

提问by Parthapratim Neog

I have two CSV files(Training setand Test Set). Since there are visible NaNvalues in few of the columns (status, hedge_value, indicator_code, portfolio_id, desk_id, office_id).

我有两个 CSV 文件(训练集测试集)。由于NaN在少数列 ( status, hedge_value, indicator_code, portfolio_id, desk_id, office_id) 中有可见值。

I start the process by replacing the NaNvalues with some huge value corresponding to the column. Then I am doing LabelEncodingto remove the text data and convert them into Numerical data. Now, when I try to do OneHotEncodingon the categorical data, I get the error. I tried giving input one by one into the OneHotEncodingconstructor, but I get the same error for every column.

我通过用NaN与列对应的一些巨大值替换值来开始这个过程。然后我正在LabelEncoding删除文本数据并将它们转换为数值数据。现在,当我尝试对OneHotEncoding分类数据进行处理时,出现错误。我尝试将输入一个一个地输入到OneHotEncoding构造函数中,但是每一列都出现相同的错误。

Basically, my end goal is to predict the return values, but I am stuck in the data preprocessing part because of this. How do I solve this issue?

基本上,我的最终目标是预测返回值,但因此我陷入了数据预处理部分。我该如何解决这个问题?

I am using Python3.6with Pandasand Sklearnfor data processing.

我使用Python3.6PandasSklearn进行数据处理。

Code

代码

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

test_data = pd.read_csv('test.csv')
train_data = pd.read_csv('train.csv')

# Replacing Nan values here
train_data['status']=train_data['status'].fillna(2.0)
train_data['hedge_value']=train_data['hedge_value'].fillna(2.0)
train_data['indicator_code']=train_data['indicator_code'].fillna(2.0)
train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999')
train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999')
train_data['office_id']=train_data['office_id'].fillna('OFF99999999')

x_train = train_data.iloc[:, :-1].values
y_train = train_data.iloc[:, 17].values

# =============================================================================
# from sklearn.preprocessing import Imputer
# imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
# imputer.fit(x_train[:, 15:17])
# x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])
# 
# imputer.fit(x_train[:, 12:13])
# x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])
# =============================================================================


# Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like 
# Country name, Purchased status will give trouble
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])
x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])
x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])
x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])
x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])
x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])
x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])


# =============================================================================
# import numpy as np
# x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)
# x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)
# np.isnan(x_train[:, 3]).any()
# =============================================================================


# =============================================================================
# from sklearn.preprocessing import StandardScaler
# sc_X = StandardScaler()
# x_train = sc_X.fit_transform(x_train)
# =============================================================================

onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])
x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

Error

错误

Traceback (most recent call last):

  File "<ipython-input-4-4992bf3d00b8>", line 58, in <module>
    x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 2019, in fit_transform
    self.categorical_features, copy=True)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1809, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
    _assert_all_finite(array)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
    " or a value too large for %r." % X.dtype)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

回答by Parthapratim Neog

I was going through the dataset again after posting the question and I found another column with a NaN. I can't believe I wasted so much time on this when I could have just used the Pandas function to get the list of columns that had NaN. So, using the following code, I found that I missed out three columns. I was visually searching for NaNwhen I could have just used this function. After handling these new NaNs, the code worked properly.

发布问题后,我再次浏览数据集,发现另一列带有NaN. 我简直不敢相信我在这上面浪费了这么多时间,而我本来可以使用 Pandas 函数来获取具有NaN. 所以,使用下面的代码,我发现我错过了三列。我在视觉上寻找NaN何时可以使用此功能。处理NaN完这些 new 之后,代码就可以正常工作了。

pd.isnull(train_data).sum() > 0

Result

结果

portfolio_id      False
desk_id           False
office_id         False
pf_category       False
start_date        False
sold               True
country_code      False
euribor_rate      False
currency          False
libor_rate         True
bought             True
creation_date     False
indicator_code    False
sell_date         False
type              False
hedge_value       False
status            False
return            False
dtype: bool

回答by Vivek Kumar

The error is in your other features that you are treating as non-categorical features.

错误在于您将其视为非分类特征的其他特征。

Those columns like 'hedge_value', 'indicator_code'etc contains mixed type data like TRUE, FALSEfrom the original csv and 2.0from your fillna()call. The OneHotEncoder is not able to process them.

这些列一样'hedge_value''indicator_code'等包含像混合类型的数据TRUEFALSE从原来的CSV和2.0fillna()通话。OneHotEncoder 无法处理它们。

As mentioned in OneHotEncoder fit()documentation:

正如 OneHotEncoderfit()文档中提到的:

 fit(X, y=None)

    Fit OneHotEncoder to X.
    Parameters: 

    X : array-like, shape [n_samples, n_feature]

        Input array of type int.

You can see that it requires all X to be of numerical (int, but float will do) type.

您可以看到它要求所有 X 都是数字(int,但 float 可以)类型。

As a workaround you can do this to encode your categorical features:

作为一种解决方法,您可以这样做来对您的分类特征进行编码:

X_train_categorical = x_train[:, [0,1,2,3,6,8,14]]
onehotencoder = OneHotEncoder()
X_train_categorical = onehotencoder.fit_transform(X_train_categorical).toarray()

And then concatenate this with your non-categorical features.

然后将其与您的非分类特征连接起来。

回答by Kohn1001

To use it in production the best practice is to use Imputer and then save in pkl with the model

要在生产中使用它,最佳做法是使用 Imputer,然后与模型一起保存在 pkl 中

This is a wrok around

这是一个错误

df[df==np.inf]=np.nan
df.fillna(df.mean(), inplace=True)

Better to use this

最好用这个