Python sklearn 分类器获取 ValueError：输入形状错误

Question

提问by Mithril

I have a csv, struct is CAT1,CAT2,TITLE,URL,CONTENT, CAT1, CAT2, TITLE ,CONTENT are in chinese.

我有一个 csv, struct is CAT1,CAT2,TITLE,URL,CONTENT, CAT1, CAT2, TITLE ,CONTENT 都是中文的。

I want train LinearSVCor MultinomialNBwith X(TITLE) and feature(CAT1,CAT2), both get this error. below is my code:

我想要训练LinearSVC或MultinomialNB使用 X(TITLE) 和功能 (CAT1,CAT2)，都会出现此错误。下面是我的代码：

PS: I write below code through this example scikit-learn text_analytics

PS：我通过这个例子写了下面的代码scikit-learn text_analytics

import numpy as np
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

label_list = []

def label_map_target(label):
    ''' map chinese feature name to integer  '''
    try:
        idx = label_list.index(label)
    except ValueError:
        idx = len(label_list)
        label_list.append(label)

    return idx


c1_list = []
c2_list = []
title_list = []
with open(csv_file, 'r') as f:
    # row_from_csv is for shorting this example
    for row in row_from_csv(f):
        c1_list.append(label_map_target(row[0])
        c2_list.append(label_map_target(row[1])
        title_list.append(row[2])

data = np.array(title_list)
target = np.array([c1_list, c2_list])
print target.shape
# (2, 4405)
target = target.reshape(4405,2)
print target.shape
# (4405, 2)

docs_train, docs_test, y_train, y_test = train_test_split(
   data, target, test_size=0.25, random_state=None)

# vect = TfidfVectorizer(tokenizer=jieba_tokenizer, min_df=3, max_df=0.95)
# use custom chinese tokenizer get same error
vect = TfidfVectorizer(min_df=3, max_df=0.95)
docs_train= vect.fit_transform(docs_train)

clf = LinearSVC()
clf.fit(docs_train, y_train)

error:

错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-904eb9af02cd> in <module>()
      1 clf = LinearSVC()
----> 2 clf.fit(docs_train, y_train)

C:\Python27\lib\site-packages\sklearn\svm\classes.pyc in fit(self, X, y)
    198 
    199         X, y = check_X_y(X, y, accept_sparse='csr',
--> 200                          dtype=np.float64, order="C")
    201         self.classes_ = np.unique(y)
    202 

C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric)
    447                         dtype=None)
    448     else:
--> 449         y = column_or_1d(y, warn=True)
    450         _assert_all_finite(y)
    451     if y_numeric and y.dtype.kind == 'O':

C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in column_or_1d(y, warn)
    483         return np.ravel(y)
    484 
--> 485     raise ValueError("bad input shape {0}".format(shape))
    486 
    487 

ValueError: bad input shape (3303, 2)

Answer 1

采纳答案by Mithril

Thanks to @meelo, I solved this problem. As he said: in my code, datais a feature vector, targetis target value. I mixed up two things.

感谢@meelo，我解决了这个问题。正如他所说：在我的代码中，data是特征向量，target是目标值。我混淆了两件事。

I learned that TfidfVectorizerprocesses data to [data, feature], and each data should map to just one target.

我了解到将TfidfVectorizer数据处理为 [数据，特征]，并且每个数据应该只映射到一个目标。

If I want to predict two type targets, I need two distinct targets:

如果我想预测两种类型的目标，我需要两个不同的目标：

target_C1with all C1 value
target_C2with all C2 value.

target_C1与所有 C1 值
target_C2与所有 C2 值。

Then use the two targets and original data to train two classifier for each target.

然后使用两个目标和原始数据为每个目标训练两个分类器。

Answer 2

回答by eslam samy

I had the same issue.

我遇到过同样的问题。

So if you are facing the same problem you should check the shape of clf.fit(X,y)parameters:

因此，如果您面临同样的问题，您应该检查clf.fit(X,y)参数的形状：

X : Training vector {array-like, sparse matrix}, shape (n_samples, n_features).

X : 训练向量 {array-like, sparse matrix}, shape (n_samples, n_features)。

y : Target vector relative to X array-like, shape (n_samples,).

y : 相对于 X 数组的目标向量，形状 (n_samples,)。

as you can see the y width should be 1, to make sure your target vector is shaped correctly try command

如您所见，y 宽度应为 1，以确保您的目标向量形状正确，请尝试命令

y.shape

should be (n_samples,)

应该是 (n_samples,)

In my case, for my training vector I was concatenating 3 separate vectors from 3 different vectorizers to use all as my final training vector. The problem was that each vector had the ['Label']column in it so the final training vector contained 3 ['Label']columns. Then when I used final_trainingVect['Label']as my Target vector it's shape was n_samples,3).

就我而言，对于我的训练向量，我将来自 3 个不同向量化器的 3 个独立向量连接起来，以将所有向量都用作我的最终训练向量。问题是每个向量都有一['Label']列，所以最终的训练向量包含 3['Label']列。然后当我用作final_trainingVect['Label']我的目标向量时，它的形状是 n_samples,3)。

Python sklearn 分类器获取 ValueError：输入形状错误

提问by Mithril

采纳答案by Mithril

回答by eslam samy

相关推荐

最近更新

标签

Python sklearn 分类器获取 ValueError：输入形状错误

提问by Mithril

采纳答案by Mithril

回答by eslam samy

相关推荐

Python 保存 Matplotlib 动画

Python 将大写应用于熊猫数据框中的列

如何从 Python 3 中当前目录中的文件导入？

如何在 Python 中停止循环线程？

相关推荐

最近更新

标签