Pandas OneHotEncoder.fit(dataframe) 返回 ValueError: 以 10 为基数的 long() 的无效文字

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27617078/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-09-13 22:47:09  来源:igfitidea点击:

Pandas OneHotEncoder.fit(dataframe) returns ValueError: invalid literal for long() with base 10

pythonnumpypandasscikit-learn

提问by dukebody

I'm trying to convert a Pandas dataframe to a NumPy array to create a model with Sklearn. I'll simplify the problem here.

我正在尝试将 Pandas 数据帧转换为 NumPy 数组以使用 Sklearn 创建模型。我将在这里简化问题。

>>> mydf.head(10)
IdVisita
445                                  latam
446                                    NaN
447                                 grados
448                                 grados
449                                eventos
450                                eventos
451         Reescribe-medios-clases-online
454                             postgrados
455                             postgrados
456                             postgrados
Name: cat1, dtype: object

>>> from sklearn import preprocessing
>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit(mydf)

Traceback:

追溯:

ValueError                                Traceback (most recent call last)
<ipython-input-74-f581ab15cbed> in <module>()
      2 mydf.head(10)
      3 enc = preprocessing.OneHotEncoder()
----> 4 enc.fit(mydf)

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit(self, X, y)
    996         self
    997         """
--> 998         self.fit_transform(X)
    999         return self
   1000 

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit_transform(self, X, y)
   1052         """
   1053         return _transform_selected(X, self._fit_transform,
-> 1054                                    self.categorical_features, copy=True)
   1055 
   1056     def _transform(self, X):

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _transform_selected(X, transform, selected, copy)
    870     """
    871     if selected == "all":
--> 872         return transform(X)
    873 
    874     X = atleast2d_or_csc(X, copy=copy)

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _fit_transform(self, X)
   1001     def _fit_transform(self, X):
   1002         """Assumes X contains only categorical features."""
-> 1003         X = check_arrays(X, sparse_format='dense', dtype=np.int)[0]
   1004         if np.any(X < 0):
   1005             raise ValueError("X needs to contain only non-negative integers.")

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_arrays(*arrays, **options)
    279                     array = np.ascontiguousarray(array, dtype=dtype)
    280                 else:
--> 281                     array = np.asarray(array, dtype=dtype)
    282                 if not allow_nans:
    283                     _assert_all_finite(array)

/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
    460 
    461     """
--> 462     return array(a, dtype, copy=False, order=order)
    463 
    464 def asanyarray(a, dtype=None, order=None):

ValueError: invalid literal for long() with base 10: 'postgrados'

Notice IdVisitais the index here and numbers might not be all consecutive.

注意IdVisita这里是索引,数字可能不是连续的。

Any clues?

有什么线索吗?

采纳答案by EdChum

Your error here is that you are calling OneHotEncoder which from the docs

您的错误是您正在调用文档中的 OneHotEncoder

The input to this transformer should be a matrix of integers

这个转换器的输入应该是一个整数矩阵

but your df has a single column 'cat1' which is of dtype objectwhich is in fact a String.

但是您的 df 有一个单列“cat1”,它是 dtype object,实际上是一个字符串。

You should use LabelEcnoder:

您应该使用LabelEcnder

In [13]:

le = preprocessing.LabelEncoder()
le.fit(df.dropna().values)
le.classes_
C:\WinPython-64bit-3.3.3.2\python-3.3.3.amd64\lib\site-packages\sklearn\preprocessing\label.py:108: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Out[13]:
array(['Reescribe-medios-clases-online', 'eventos', 'grados', 'latam',
       'postgrados'], dtype=object)

Note I had to drop the NaNrow as this will introduce a mixed dtype which cannot be used for ordering e.g. float > str will not work

注意我不得不删除该NaN行,因为这将引入一个不能用于排序的混合数据类型,例如 float > str 将不起作用

回答by elachell

A simpler approach is to use DictVectorizer, which does the conversion to integer as well as the OneHotEncodingat the same step.

一种更简单的方法是使用DictVectorizer,它在同一步骤中进行整数和OneHotEncoding的转换。

Using it with the argument DictVectorizer(sparse=False)allows getting a DataFrameafter the fit_transformto keep working with Pandas.

将它与参数一起使用DictVectorizer(sparse=False)允许在DataFrame之后fit_transform继续使用 Pandas。