Pandas OneHotEncoder.fit(dataframe) 返回 ValueError: 以 10 为基数的 long() 的无效文字

Question

提问by dukebody

I'm trying to convert a Pandas dataframe to a NumPy array to create a model with Sklearn. I'll simplify the problem here.

我正在尝试将 Pandas 数据帧转换为 NumPy 数组以使用 Sklearn 创建模型。我将在这里简化问题。

>>> mydf.head(10)
IdVisita
445                                  latam
446                                    NaN
447                                 grados
448                                 grados
449                                eventos
450                                eventos
451         Reescribe-medios-clases-online
454                             postgrados
455                             postgrados
456                             postgrados
Name: cat1, dtype: object

>>> from sklearn import preprocessing
>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit(mydf)

Traceback:

追溯：

ValueError                                Traceback (most recent call last)
<ipython-input-74-f581ab15cbed> in <module>()
      2 mydf.head(10)
      3 enc = preprocessing.OneHotEncoder()
----> 4 enc.fit(mydf)

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit(self, X, y)
    996         self
    997         """
--> 998         self.fit_transform(X)
    999         return self
   1000 

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit_transform(self, X, y)
   1052         """
   1053         return _transform_selected(X, self._fit_transform,
-> 1054                                    self.categorical_features, copy=True)
   1055 
   1056     def _transform(self, X):

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _transform_selected(X, transform, selected, copy)
    870     """
    871     if selected == "all":
--> 872         return transform(X)
    873 
    874     X = atleast2d_or_csc(X, copy=copy)

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _fit_transform(self, X)
   1001     def _fit_transform(self, X):
   1002         """Assumes X contains only categorical features."""
-> 1003         X = check_arrays(X, sparse_format='dense', dtype=np.int)[0]
   1004         if np.any(X < 0):
   1005             raise ValueError("X needs to contain only non-negative integers.")

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_arrays(*arrays, **options)
    279                     array = np.ascontiguousarray(array, dtype=dtype)
    280                 else:
--> 281                     array = np.asarray(array, dtype=dtype)
    282                 if not allow_nans:
    283                     _assert_all_finite(array)

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    460 
    461     """
--> 462     return array(a, dtype, copy=False, order=order)
    463 
    464 def asanyarray(a, dtype=None, order=None):

ValueError: invalid literal for long() with base 10: 'postgrados'

Notice IdVisitais the index here and numbers might not be all consecutive.

注意IdVisita这里是索引，数字可能不是连续的。

Any clues?

有什么线索吗？

Answer 1

采纳答案by EdChum

Your error here is that you are calling OneHotEncoder which from the docs

您的错误是您正在调用文档中的 OneHotEncoder

The input to this transformer should be a matrix of integers

这个转换器的输入应该是一个整数矩阵

but your df has a single column 'cat1' which is of dtype objectwhich is in fact a String.

但是您的 df 有一个单列“cat1”，它是 dtype object，实际上是一个字符串。

You should use LabelEcnoder:

您应该使用LabelEcnder：

In [13]:

le = preprocessing.LabelEncoder()
le.fit(df.dropna().values)
le.classes_
C:\WinPython-64bit-3.3.3.2\python-3.3.3.amd64\lib\site-packages\sklearn\preprocessing\label.py:108: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[13]:
array(['Reescribe-medios-clases-online', 'eventos', 'grados', 'latam',
       'postgrados'], dtype=object)

Note I had to drop the NaNrow as this will introduce a mixed dtype which cannot be used for ordering e.g. float > str will not work

注意我不得不删除该NaN行，因为这将引入一个不能用于排序的混合数据类型，例如 float > str 将不起作用

Answer 2

回答by elachell

A simpler approach is to use DictVectorizer, which does the conversion to integer as well as the OneHotEncodingat the same step.

一种更简单的方法是使用DictVectorizer，它在同一步骤中进行整数和OneHotEncoding的转换。

Using it with the argument DictVectorizer(sparse=False)allows getting a DataFrameafter the fit_transformto keep working with Pandas.

将它与参数一起使用DictVectorizer(sparse=False)允许在DataFrame之后fit_transform继续使用 Pandas。

Pandas OneHotEncoder.fit(dataframe) 返回 ValueError: 以 10 为基数的 long() 的无效文字

提问by dukebody

采纳答案by EdChum

回答by elachell

相关推荐

最近更新

标签

Pandas OneHotEncoder.fit(dataframe) 返回 ValueError: 以 10 为基数的 long() 的无效文字

提问by dukebody

采纳答案by EdChum

回答by elachell

相关推荐

如何使用 Statsmodels 库从 Pandas 数据框创建马赛克图？

如何在 Pandas 中选择“本月的最后一个工作日”？

pandas 使用pandas将具有缺失值的csv数据读入python

pandas 计算熊猫 DF 列子集的均值或方差

相关推荐

最近更新

标签