pandas Scipy hstack 导致“类型错误:不支持类型转换:(dtype('float64'), dtype('O'))”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 
原文地址: http://stackoverflow.com/questions/22273242/
Warning: these are provided under cc-by-sa 4.0 license.  You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Scipy hstack results in "TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))"
提问by Simon Kiely
I am trying to run hstack to join a column of integer values to a list of columns created by a TF-IDF (so I can eventually use all of these columns/features in a classifier).
我正在尝试运行 hstack 将一列整数值连接到由 TF-IDF 创建的列列表(因此我最终可以在分类器中使用所有这些列/功能)。
I'm reading in the column using pandas, checking for any NA values and converting them to the largest value in the dataframe like so :
我正在使用 Pandas 在列中阅读,检查任何 NA 值并将它们转换为数据框中的最大值,如下所示:
  OtherColumn = p.read_csv('file.csv', delimiter=";", na_values=['?'])[["OtherColumn"]]
  OtherColumn = OtherColumn.fillna(OtherColumn.max())
  OtherColumn = OtherColumn.convert_objects(convert_numeric=True)
Then I read in my text column and run TF-IDF to create loads of features :
然后我阅读我的文本列并运行 TF-IDF 来创建大量功能:
  X = list(np.array(p.read_csv('file.csv', delimiter=";"))[:,2])
  tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)
  tfv.fit(X)
Finally, I want to join them all together, and this is where our error occurs and the program cannot run, and also I am unsure whether I am using the StandardScaler appropriately here :
最后,我想将它们全部连接在一起,这就是我们发生错误并且程序无法运行的地方,而且我不确定我是否在这里适当地使用了 StandardScaler:
  X =  sp.sparse.hstack((X, OtherColumn.values)) #error here
  sc = preprocessing.StandardScaler().fit(X)
  X = sc.transform(X)
  X_test = sc.transform(X_test)
Full error message:
完整的错误信息:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-79d1e70bc1bc> in <module>()
---> 47 X =  sp.sparse.hstack((X, OtherColumn.values))
     48 sc = preprocessing.StandardScaler().fit(X)
     49 X = sc.transform(X)
C:\Users\Simon\Anaconda\lib\site-packages\scipy\sparse\construct.pyc in hstack(blocks, format, dtype)
    421 
    422     """
--> 423     return bmat([blocks], format=format, dtype=dtype)
    424 
    425 
C:\Users\Simon\Anaconda\lib\site-packages\scipy\sparse\construct.pyc in bmat(blocks, format, dtype)
    537     nnz = sum([A.nnz for A in blocks[block_mask]])
    538     if dtype is None:
--> 539         dtype = upcast(*tuple([A.dtype for A in blocks[block_mask]]))
    540 
    541     row_offsets = np.concatenate(([0], np.cumsum(brow_lengths)))
C:\Users\Simon\Anaconda\lib\site-packages\scipy\sparse\sputils.pyc in upcast(*args)
     58             return t
     59 
---> 60     raise TypeError('no supported conversion for types: %r' % (args,))
     61 
     62 
TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))
回答by hpaulj
As discussed in Numpy hstack - "ValueError: all the input arrays must have same number of dimensions" - but they doyou many need to explicitly cast the inputs to sparse.hstack.  The sparsecode is not as robust as the core numpycode.
正如Numpy hstack 中所讨论的- “ValueError:所有输入数组必须具有相同的维数” - 但是他们确实需要将输入显式转换为sparse.hstack. 该sparse代码是不一样强大的核心numpy代码。
If Xis a sparse array with dtype=float, and Ais dense with dtype=object, several options are possible.
如果X是一个带有 的稀疏数组dtype=float,并且A带有dtype=object,那么有几个选项是可能的。
sparse.hstack(X, A) # error
sparse.hstack(X.astype(object), A) # cast X to object; return object
sparse.hstack(X, A.astype(float)) # cast A to float; return float
hstack(X.A, A) # make X dense, result will be type object
A.astype(float)will work if Acontains some NaN.  See http://pandas.pydata.org/pandas-docs/stable/gotchas.htmlregarding NaN. If Ais object for some other reason (e.g. ragged lists), then we'll have to revisit the issue.
A.astype(float)如果A包含一些NaN. 有关 NaN,请参阅http://pandas.pydata.org/pandas-docs/stable/gotchas.html。如果A由于某些其他原因(例如不完整的列表)而成为对象,那么我们将不得不重新审视这个问题。
Another possibility is to use Pandas's concat. http://pandas.pydata.org/pandas-docs/stable/merging.html.  I assume Pandas has paid more attention to these issues than the sparsecoders.
另一种可能性是使用 Pandas 的concat. http://pandas.pydata.org/pandas-docs/stable/merging.html。我认为 Pandas 比sparse编码人员更关注这些问题。

