Python Numpy hstack-“ValueError：所有输入数组必须具有相同的维数”-但它们确实如此

Question

提问by Simon Kiely

I am trying to join two numpy arrays. In one I have a set of columns/features after running TF-IDF on a single column of text. In the other I have one column/feature which is an integer. So I read in a column of train and test data, run TF-IDF on this, and then I want to add another integer column because I think this will help my classifier learn more accurately how it should behave.

我正在尝试加入两个 numpy 数组。在一个文本列上运行 TF-IDF 后，我有一组列/功能。在另一个中，我有一个列/特征，它是一个整数。所以我读入了一列训练和测试数据，对此运行 TF-IDF，然后我想添加另一个整数列，因为我认为这将帮助我的分类器更准确地了解它应该如何表现。

Unfortunately, I am getting the error in the title when I try and run hstackto add this single column to my other numpy array.

不幸的是，当我尝试运行hstack将此单列添加到我的另一个 numpy 数组时，我在标题中遇到了错误。

Here is my code :

这是我的代码：

  #reading in test/train data for TF-IDF
  traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
  testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2])

  #reading in labels for training
  y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2]

  #reading in single integer column to join
  AlexaTrainData = p.read_csv('FinalCSVFin.csv', delimiter=";")[["alexarank"]]
  AlexaTestData = p.read_csv('FinalTestCSVFin.csv', delimiter=";")[["alexarank"]]
  AllAlexaAndGoogleInfo = AlexaTestData.append(AlexaTrainData)

  tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) #tf-idf object
  rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None) #Classifier
  X_all = traindata + testdata #adding test and train data to put into tf-idf
  lentrain = len(traindata) #find length of train data
  tfv.fit(X_all) #fit tf-idf on all our text
  X_all = tfv.transform(X_all) #transform it
  X = X_all[:lentrain] #reduce to size of training set
  AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo[:lentrain] #reduce to size of training set
  X_test = X_all[lentrain:] #reduce to size of training set

  #printing debug info, output below : 
  print "X.shape => " + str(X.shape)
  print "AllAlexaAndGoogleInfo.shape => " + str(AllAlexaAndGoogleInfo.shape)
  print "X_all.shape => " + str(X_all.shape)

  #line we get error on
  X = np.hstack((X, AllAlexaAndGoogleInfo))

Below is the output and error message :

以下是输出和错误消息：

X.shape => (7395, 238377)
AllAlexaAndGoogleInfo.shape => (7395, 1)
X_all.shape => (10566, 238377)



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-2b310887b5e4> in <module>()
     31 print "X_all.shape => " + str(X_all.shape)
     32 #X = np.column_stack((X, AllAlexaAndGoogleInfo))
---> 33 X = np.hstack((X, AllAlexaAndGoogleInfo))
     34 sc = preprocessing.StandardScaler().fit(X)
     35 X = sc.transform(X)

C:\Users\Simon\Anaconda\lib\site-packages\numpy\core\shape_base.pyc in hstack(tup)
    271     # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
    272     if arrs[0].ndim == 1:
--> 273         return _nx.concatenate(arrs, 0)
    274     else:
    275         return _nx.concatenate(arrs, 1)

ValueError: all the input arrays must have same number of dimensions

What is causing my problem here? How can I fix this? As far as I can see I should be able to join these columns? What have I misunderstood?

是什么导致了我的问题？我怎样才能解决这个问题？据我所知，我应该能够加入这些专栏吗？我误解了什么？

Thank you.

谢谢你。

Edit :

编辑：

Using the method in the answer below gets the following error :

使用以下答案中的方法会出现以下错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-640ef6dd335d> in <module>()
---> 36 X = np.column_stack((X, AllAlexaAndGoogleInfo))
     37 sc = preprocessing.StandardScaler().fit(X)
     38 X = sc.transform(X)

C:\Users\Simon\Anaconda\lib\site-packages\numpy\lib\shape_base.pyc in column_stack(tup)
    294             arr = array(arr,copy=False,subok=True,ndmin=2).T
    295         arrays.append(arr)
--> 296     return _nx.concatenate(arrays,1)
    297 
    298 def dstack(tup):

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Interestingly, I tried to print the dtypeof X and this worked fine :

有趣的是，我尝试打印dtypeX 并且效果很好：

X.dtype => float64

However, trying to print the dtype of AllAlexaAndGoogleInfolike so :

但是，尝试打印AllAlexaAndGoogleInfo像这样的 dtype ：

print "AllAlexaAndGoogleInfo.dtype => " + str(AllAlexaAndGoogleInfo.dtype)

produces :

产生：

'DataFrame' object has no attribute 'dtype'

Answer 1

采纳答案by YS-L

As Xis a sparse array, instead of numpy.hstack, use scipy.sparse.hstackto join the arrays. In my opinion the error message is kind of misleading here.

AsX是一个稀疏数组，而不是numpy.hstack，用于scipy.sparse.hstack连接数组。在我看来，错误消息在这里有点误导。

This minimal example illustrates the situation:

这个最小的例子说明了这种情况：

import numpy as np
from scipy import sparse

X = sparse.rand(10, 10000)
xt = np.random.random((10, 1))
print 'X shape:', X.shape
print 'xt shape:', xt.shape
print 'Stacked shape:', np.hstack((X,xt)).shape
#print 'Stacked shape:', sparse.hstack((X,xt)).shape #This works

Based on the following output

基于以下输出

X shape: (10, 10000)
xt shape: (10, 1)

one may expect that the hstackin the following line will work, but the fact is that it throws this error:

人们可能期望hstack以下行中的会起作用，但事实是它会引发此错误：

ValueError: all the input arrays must have same number of dimensions

So, use scipy.sparse.hstackwhen you have a sparse array to stack.

所以，scipy.sparse.hstack当你有一个稀疏数组要堆叠时使用。

In fact I have answered this as a comment in your another questions, and you mentioned that another error message pops up:

事实上，我已经在您的其他问题中作为评论回答了这个问题，并且您提到会弹出另一条错误消息：

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

First of all, AllAlexaAndGoogleInfodoes not have a dtypeas it is a DataFrame. To get it's underlying numpy array, simply use AllAlexaAndGoogleInfo.values. Check its dtype. Based on the error message, it has a dtypeof object, which means that it might contain non-numerical elements like strings.

首先，AllAlexaAndGoogleInfo没有 adtype因为它是 a DataFrame。要获得它的底层 numpy 数组，只需使用AllAlexaAndGoogleInfo.values. 检查其dtype. 根据错误消息，它有一个dtypeof object，这意味着它可能包含非数字元素，如字符串。

This is a minimal example that reproduces this situation:

这是重现这种情况的最小示例：

X = sparse.rand(100, 10000)
xt = np.random.random((100, 1))
xt = xt.astype('object') # Comment this to fix the error
print 'X:', X.shape, X.dtype
print 'xt:', xt.shape, xt.dtype
print 'Stacked shape:', sparse.hstack((X,xt)).shape

The error message:

错误信息：

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

So, check if there is any non-numerical values in AllAlexaAndGoogleInfoand repair them, before doing the stacking.

因此，AllAlexaAndGoogleInfo在进行堆叠之前，请检查是否有任何非数字值并修复它们。

Answer 2

回答by Drewness

Use .column_stack. Like so:

使用.column_stack. 像这样：

X = np.column_stack((X, AllAlexaAndGoogleInfo))

From the docs:

从文档：

Take a sequence of 1-D arrays and stack them as columns to make a single 2-D array. 2-D arrays are stacked as-is, just like with hstack.

取一系列一维数组并将它们堆叠为列以形成单个二维数组。二维数组按原样堆叠，就像 hstack 一样。

Answer 3

回答by hpaulj

Try:

尝试：

X = np.hstack((X, AllAlexaAndGoogleInfo.values))

I don't have a running Pandas module, so can't test it. But the DataFrame documentation describes values Numpy representation of NDFrame. np.hstackis a numpyfunction, and as such knows nothing about the internal structure of the DataFrame.

我没有正在运行的 Pandas 模块，因此无法对其进行测试。但是 DataFrame 文档描述了values Numpy representation of NDFrame. np.hstack是一个numpy函数，因此对 . 的内部结构一无所知DataFrame。

Python Numpy hstack-“ValueError：所有输入数组必须具有相同的维数”-但它们确实如此

提问by Simon Kiely

采纳答案by YS-L

回答by Drewness

回答by hpaulj

相关推荐

最近更新

标签

Python Numpy hstack-“ValueError：所有输入数组必须具有相同的维数”-但它们确实如此

提问by Simon Kiely

采纳答案by YS-L

回答by Drewness

回答by hpaulj

相关推荐

在python中使用PDFMiner从PDF文件中提取文本？

如何在python中标准化直方图？

Python 与 Cpython

如何使用 Python、Pandas 创建一个 Decile 和 Quintile 列以根据大小对另一个变量进行排名？

相关推荐

最近更新

标签