pandas ValueError：在预处理数据时，输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值

Question

提问by Parthapratim Neog

I have two CSV files(Training setand Test Set). Since there are visible NaNvalues in few of the columns (status, hedge_value, indicator_code, portfolio_id, desk_id, office_id).

我有两个 CSV 文件（训练集和测试集）。由于NaN在少数列 ( status, hedge_value, indicator_code, portfolio_id, desk_id, office_id) 中有可见值。

I start the process by replacing the NaNvalues with some huge value corresponding to the column. Then I am doing LabelEncodingto remove the text data and convert them into Numerical data. Now, when I try to do OneHotEncodingon the categorical data, I get the error. I tried giving input one by one into the OneHotEncodingconstructor, but I get the same error for every column.

我通过用NaN与列对应的一些巨大值替换值来开始这个过程。然后我正在LabelEncoding删除文本数据并将它们转换为数值数据。现在，当我尝试对OneHotEncoding分类数据进行处理时，出现错误。我尝试将输入一个一个地输入到OneHotEncoding构造函数中，但是每一列都出现相同的错误。

Basically, my end goal is to predict the return values, but I am stuck in the data preprocessing part because of this. How do I solve this issue?

基本上，我的最终目标是预测返回值，但因此我陷入了数据预处理部分。我该如何解决这个问题？

I am using Python3.6with Pandasand Sklearnfor data processing.

我使用Python3.6与Pandas和Sklearn进行数据处理。

Code

代码

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

test_data = pd.read_csv('test.csv')
train_data = pd.read_csv('train.csv')

# Replacing Nan values here
train_data['status']=train_data['status'].fillna(2.0)
train_data['hedge_value']=train_data['hedge_value'].fillna(2.0)
train_data['indicator_code']=train_data['indicator_code'].fillna(2.0)
train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999')
train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999')
train_data['office_id']=train_data['office_id'].fillna('OFF99999999')

x_train = train_data.iloc[:, :-1].values
y_train = train_data.iloc[:, 17].values

# =============================================================================
# from sklearn.preprocessing import Imputer
# imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
# imputer.fit(x_train[:, 15:17])
# x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])
# 
# imputer.fit(x_train[:, 12:13])
# x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])
# =============================================================================


# Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like 
# Country name, Purchased status will give trouble
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])
x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])
x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])
x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])
x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])
x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])
x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])


# =============================================================================
# import numpy as np
# x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)
# x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)
# np.isnan(x_train[:, 3]).any()
# =============================================================================


# =============================================================================
# from sklearn.preprocessing import StandardScaler
# sc_X = StandardScaler()
# x_train = sc_X.fit_transform(x_train)
# =============================================================================

onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])
x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

Error

错误

Traceback (most recent call last):

  File "<ipython-input-4-4992bf3d00b8>", line 58, in <module>
    x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 2019, in fit_transform
    self.categorical_features, copy=True)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1809, in _transform_selected
    X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array
    _assert_all_finite(array)

  File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite
    " or a value too large for %r." % X.dtype)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Answer 1

回答by Parthapratim Neog

I was going through the dataset again after posting the question and I found another column with a NaN. I can't believe I wasted so much time on this when I could have just used the Pandas function to get the list of columns that had NaN. So, using the following code, I found that I missed out three columns. I was visually searching for NaNwhen I could have just used this function. After handling these new NaNs, the code worked properly.

发布问题后，我再次浏览数据集，发现另一列带有NaN. 我简直不敢相信我在这上面浪费了这么多时间，而我本来可以使用 Pandas 函数来获取具有NaN. 所以，使用下面的代码，我发现我错过了三列。我在视觉上寻找NaN何时可以使用此功能。处理NaN完这些 new 之后，代码就可以正常工作了。

pd.isnull(train_data).sum() > 0

Result

结果

portfolio_id      False
desk_id           False
office_id         False
pf_category       False
start_date        False
sold               True
country_code      False
euribor_rate      False
currency          False
libor_rate         True
bought             True
creation_date     False
indicator_code    False
sell_date         False
type              False
hedge_value       False
status            False
return            False
dtype: bool

Answer 2

回答by Vivek Kumar

The error is in your other features that you are treating as non-categorical features.

错误在于您将其视为非分类特征的其他特征。

Those columns like 'hedge_value', 'indicator_code'etc contains mixed type data like TRUE, FALSEfrom the original csv and 2.0from your fillna()call. The OneHotEncoder is not able to process them.

这些列一样'hedge_value'，'indicator_code'等包含像混合类型的数据TRUE，FALSE从原来的CSV和2.0您fillna()通话。OneHotEncoder 无法处理它们。

As mentioned in OneHotEncoder fit()documentation:

正如 OneHotEncoderfit()文档中提到的：

 fit(X, y=None)

    Fit OneHotEncoder to X.
    Parameters: 

    X : array-like, shape [n_samples, n_feature]

        Input array of type int.

You can see that it requires all X to be of numerical (int, but float will do) type.

您可以看到它要求所有 X 都是数字（int，但 float 可以）类型。

As a workaround you can do this to encode your categorical features:

作为一种解决方法，您可以这样做来对您的分类特征进行编码：

X_train_categorical = x_train[:, [0,1,2,3,6,8,14]]
onehotencoder = OneHotEncoder()
X_train_categorical = onehotencoder.fit_transform(X_train_categorical).toarray()

And then concatenate this with your non-categorical features.

然后将其与您的非分类特征连接起来。

Answer 3

回答by Kohn1001

To use it in production the best practice is to use Imputer and then save in pkl with the model

要在生产中使用它，最佳做法是使用 Imputer，然后与模型一起保存在 pkl 中

This is a wrok around

这是一个错误

df[df==np.inf]=np.nan
df.fillna(df.mean(), inplace=True)

Better to use this

最好用这个

pandas ValueError：在预处理数据时，输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值

提问by Parthapratim Neog

回答by Parthapratim Neog

回答by Vivek Kumar

回答by Kohn1001

相关推荐

最近更新

标签

pandas ValueError：在预处理数据时，输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值

提问by Parthapratim Neog

回答by Parthapratim Neog

回答by Vivek Kumar

回答by Kohn1001

相关推荐

pandas Python：生成具有趋势的随机时间序列数据（例如周期性、指数衰减等）

pandas 将数据帧行转换为 Python 集

pandas 所有熊猫细胞的归类化

Python Pandas groupby 应用 lambda 参数

相关推荐

最近更新

标签