Pandas OneHotEncoder.fit(dataframe) 返回 ValueError: 以 10 为基数的 long() 的无效文字
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/27617078/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Pandas OneHotEncoder.fit(dataframe) returns ValueError: invalid literal for long() with base 10
提问by dukebody
I'm trying to convert a Pandas dataframe to a NumPy array to create a model with Sklearn. I'll simplify the problem here.
我正在尝试将 Pandas 数据帧转换为 NumPy 数组以使用 Sklearn 创建模型。我将在这里简化问题。
>>> mydf.head(10)
IdVisita
445 latam
446 NaN
447 grados
448 grados
449 eventos
450 eventos
451 Reescribe-medios-clases-online
454 postgrados
455 postgrados
456 postgrados
Name: cat1, dtype: object
>>> from sklearn import preprocessing
>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit(mydf)
Traceback:
追溯:
ValueError Traceback (most recent call last)
<ipython-input-74-f581ab15cbed> in <module>()
2 mydf.head(10)
3 enc = preprocessing.OneHotEncoder()
----> 4 enc.fit(mydf)
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit(self, X, y)
996 self
997 """
--> 998 self.fit_transform(X)
999 return self
1000
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in fit_transform(self, X, y)
1052 """
1053 return _transform_selected(X, self._fit_transform,
-> 1054 self.categorical_features, copy=True)
1055
1056 def _transform(self, X):
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _transform_selected(X, transform, selected, copy)
870 """
871 if selected == "all":
--> 872 return transform(X)
873
874 X = atleast2d_or_csc(X, copy=copy)
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.pyc in _fit_transform(self, X)
1001 def _fit_transform(self, X):
1002 """Assumes X contains only categorical features."""
-> 1003 X = check_arrays(X, sparse_format='dense', dtype=np.int)[0]
1004 if np.any(X < 0):
1005 raise ValueError("X needs to contain only non-negative integers.")
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_arrays(*arrays, **options)
279 array = np.ascontiguousarray(array, dtype=dtype)
280 else:
--> 281 array = np.asarray(array, dtype=dtype)
282 if not allow_nans:
283 _assert_all_finite(array)
/home/dukebody/Apps/Anaconda/lib/python2.7/site-packages/numpy/core/numeric.pyc in asarray(a, dtype, order)
460
461 """
--> 462 return array(a, dtype, copy=False, order=order)
463
464 def asanyarray(a, dtype=None, order=None):
ValueError: invalid literal for long() with base 10: 'postgrados'
Notice IdVisitais the index here and numbers might not be all consecutive.
注意IdVisita这里是索引,数字可能不是连续的。
Any clues?
有什么线索吗?
采纳答案by EdChum
Your error here is that you are calling OneHotEncoder which from the docs
您的错误是您正在调用文档中的 OneHotEncoder
The input to this transformer should be a matrix of integers
这个转换器的输入应该是一个整数矩阵
but your df has a single column 'cat1' which is of dtype objectwhich is in fact a String.
但是您的 df 有一个单列“cat1”,它是 dtype object,实际上是一个字符串。
You should use LabelEcnoder:
您应该使用LabelEcnder:
In [13]:
le = preprocessing.LabelEncoder()
le.fit(df.dropna().values)
le.classes_
C:\WinPython-64bit-3.3.3.2\python-3.3.3.amd64\lib\site-packages\sklearn\preprocessing\label.py:108: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Out[13]:
array(['Reescribe-medios-clases-online', 'eventos', 'grados', 'latam',
'postgrados'], dtype=object)
Note I had to drop the NaNrow as this will introduce a mixed dtype which cannot be used for ordering e.g. float > str will not work
注意我不得不删除该NaN行,因为这将引入一个不能用于排序的混合数据类型,例如 float > str 将不起作用
回答by elachell
A simpler approach is to use DictVectorizer, which does the conversion to integer as well as the OneHotEncodingat the same step.
一种更简单的方法是使用DictVectorizer,它在同一步骤中进行整数和OneHotEncoding的转换。
Using it with the argument DictVectorizer(sparse=False)allows getting a DataFrameafter the fit_transformto keep working with Pandas.
将它与参数一起使用DictVectorizer(sparse=False)允许在DataFrame之后fit_transform继续使用 Pandas。

