Python 类型错误:稀疏矩阵长度不明确;在使用 RF 分类器时使用 getnnz() 或 shape[0] 吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/28314337/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:05:31  来源:igfitidea点击:

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0] while using RF classifier?

pythonnumpymachine-learningnlpscikit-learn

提问by tumbleweed

I am learning about random forests in scikit learn and as an example I would like to use Random forest classifier for text classification, with my own dataset. So first I vectorized the text with tfidf and for classification:

我正在 scikit learn 中学习随机森林,作为一个例子,我想用我自己的数据集使用随机森林分类器进行文本分类。所以首先我用 tfidf 对文本进行矢量化并进行分类:

from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10) 
classifier.fit(X_train, y_train)           
prediction = classifier.predict(X_test)

When I run the classification I got this:

当我运行分类时,我得到了这个:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

then I used the .toarray()for X_trainand I got the following:

然后我使用了.toarray()forX_train并得到以下结果:

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

From a previous questionas I understood I need to reduce the dimensionality of the numpy array so I do the same:

从我理解的上一个问题中,我需要减少 numpy 数组的维数,所以我也这样做:

from sklearn.decomposition.truncated_svd import TruncatedSVD        
pca = TruncatedSVD(n_components=300)                                
X_reduced_train = pca.fit_transform(X_train)               

from sklearn.ensemble import RandomForestClassifier                 
classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(X_reduced_train, y_train)                            
prediction = classifier.predict(X_testing) 

Then I got this exception:

然后我得到了这个例外:

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
    n_samples = len(X)
  File "/usr/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 192, in __len__
    raise TypeError("sparse matrix length is ambiguous; use getnnz()"
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

The I tried the following:

我尝试了以下方法:

prediction = classifier.predict(X_train.getnnz()) 

And got this:

得到了这个:

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict
    n_samples = len(X)
TypeError: object of type 'int' has no len()

Two questions were raised from this: How can I use Random forests to classify correctly? and what's happening with X_train?.

由此提出了两个问题:如何使用随机森林进行正确分类?怎么了X_train?。

Then I tried the following:

然后我尝试了以下方法:

df = pd.read_csv('/path/file.csv',
header=0, sep=',', names=['id', 'text', 'label'])



X = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values



from sklearn.decomposition.truncated_svd import TruncatedSVD
pca = TruncatedSVD(n_components=2)
X = pca.fit_transform(X)

a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(a_train, b_train)
prediction = classifier.predict(a_test)

from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_report
print '\nscore:', classifier.score(a_train, b_test)
print '\nprecision:', precision_score(b_test, prediction)
print '\nrecall:', recall_score(b_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(b_test, prediction)
print '\n clasification report:\n', classification_report(b_test, prediction)

采纳答案by JAB

It is a bit unclear if you are passing the same data structure (type and shape) to the fitmethod and predictmethod of the classifier. Random forests will take a long time to run with a large number of features, hence the suggestion to reduce the dimensionality in the post you link to.

如果将相同的数据结构(类型和形状)传递给分类器的fit方法和predict方法,则有点不清楚。随机森林将需要很长时间才能运行大量特征,因此建议减少您链接到的帖子中的维度。

You should apply the SVD to both the training and test data so the classifier in trained on the same shaped input as the data you wish to predict for. Check the input to the fit, and the input to the predict method have the same number of features, and are both arrays rather than sparse matrices.

您应该将 SVD 应用于训练和测试数据,以便分类器在与您希望预测的数据相同形状的输入上进行训练。检查拟合的输入,预测方法的输入具有相同数量的特征,并且都是数组而不是稀疏矩阵。

updated with example:updated to use dataframe

更新示例:更新为使用数据框

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect= TfidfVectorizer(  use_idf=True, smooth_idf=True, sublinear_tf=False)
from sklearn.cross_validation import train_test_split

df= pd.DataFrame({'text':['cat on the','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\
              ,'class': [0,0,0,1,1,1,0,3]})



X = tfidf_vect.fit_transform(df['text'].values)
y = df['class'].values

from sklearn.decomposition.truncated_svd import TruncatedSVD        
pca = TruncatedSVD(n_components=2)                                
X_reduced_train = pca.fit_transform(X)  

a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier 

classifier=RandomForestClassifier(n_estimators=10)                  
classifier.fit(a_train.toarray(), b_train)                            
prediction = classifier.predict(a_test.toarray()) 

Note the SVD happens before the split into training and test sets, so that the array passed to the predictor has the same nas the array the fitmethod is called on.

请注意,SVD 在拆分为训练集和测试集之前发生,因此传递给预测器n的数组与fit调用该方法的数组相同。

回答by hpaulj

I don't know much about sklearn, though I vaguely recall some earlier issue triggered by a switch to using sparse matricies. Internally some of the matrices had to replaced by m.toarray()or m.todense().

我不太了解sklearn,但我依稀记得一些由切换到使用稀疏矩阵触发的早期问题。在内部,某些矩阵必须替换为m.toarray()m.todense()

But to give you an idea of what the error message was about, consider

但是为了让您了解错误消息的内容,请考虑

In [907]: A=np.array([[0,1],[3,4]])
In [908]: M=sparse.coo_matrix(A)
In [909]: len(A)
Out[909]: 2
In [910]: len(M)
...
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

In [911]: A.shape[0]
Out[911]: 2
In [912]: M.shape[0]
Out[912]: 2

len()usually is used in Python to count the number of 1st level terms of a list. When applied to a 2d array, it is the number of rows. But A.shape[0]is a better way of counting the rows. And M.shape[0]is the same. In this case you aren't interested in .getnnz, which is the number of nonzero terms of a sparse matrix. Adoesn't have this method, though can be derived from A.nonzero().

len()通常在 Python 中用于计算列表的第一级术语的数量。当应用于二维数组时,它是行数。但是A.shape[0]是一种更好的计算行数的方法。而且M.shape[0]是一样的。在这种情况下,您对 不感兴趣.getnnz,它是稀疏矩阵的非零项的数量。 A没有这种方法,但可以从A.nonzero().