Python CountVectorizer: AttributeError: 'numpy.ndarray' 对象没有属性 'lower'

Question

提问by ashu

I have a one-dimensional array with large strings in each of the elements. I am trying to use a CountVectorizerto convert text data into numerical vectors. However, I am getting an error saying:

我有一个一维数组，每个元素都有大字符串。我正在尝试使用 aCountVectorizer将文本数据转换为数值向量。但是，我收到一条错误消息：

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

mealarraycontains large strings in each of the elements. There are 5000 such samples. I am trying to vectorize this as given below:

mealarray在每个元素中包含大字符串。有 5000 个这样的样本。我正在尝试将其矢量化，如下所示：

vectorizer = CountVectorizer(
    stop_words='english',
    ngram_range=(1, 1),  #ngram_range=(1, 1) is the default
    dtype='double',
)
data = vectorizer.fit_transform(mealarray)

The full stacktrace :

完整的堆栈跟踪：

File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
    self.fixed_vocabulary_)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 748, in _count_vocab
    for feature in analyze(doc):
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 234, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 200, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

Answer 1

回答by Warren Weckesser

Check the shape of mealarray. If the argument to fit_transformis an array of strings, it must be a one-dimensional array. (That is, mealarray.shapemust be of the form (n,).) For example, you'll get the "no attribute" error if mealarrayhas a shape such as (n, 1).

检查的形状mealarray。如果参数 tofit_transform是字符串数组，则它必须是一维数组。（也就是说，mealarray.shape必须是形式(n,)。）例如，如果mealarray具有诸如(n, 1).

You could try something like

你可以尝试类似的东西

data = vectorizer.fit_transform(mealarray.ravel())

Answer 2

回答by ashu

Got the answer to my question. Basically, CountVectorizer is taking lists (with string contents) as an argument rather than array. That solved my problem.

得到了我的问题的答案。基本上，CountVectorizer 将列表（带有字符串内容）作为参数而不是数组。那解决了我的问题。

Answer 3

回答by Max Kleiner

A better solution is explicit call pandas series and pass it CountVectorizer():

一个更好的解决方案是显式调用 pandas 系列并将其传递给 CountVectorizer()：

>>> tex = df4['Text']
>>> type(tex)
<class 'pandas.core.series.Series'>
X_train_counts = count_vect.fit_transform(tex)

Next one won't work, cause its a frame and NOT series

下一个不起作用，因为它是一个框架而不是系列

>>> tex2 = (df4.ix[0:,[11]])
>>> type(tex2)
<class 'pandas.core.frame.DataFrame'>

Answer 4

回答by Mr. Sigma.

The error should be sufficient to get rid of the bug. Check if your dataframe or series has non string type element. Also, do check specifically if there are any nanvalues.

错误应该足以摆脱错误。检查您的数据框或系列是否具有非字符串类型元素。另外，请特别检查是否有任何nan值。

Python CountVectorizer: AttributeError: 'numpy.ndarray' 对象没有属性 'lower'

提问by ashu

回答by Warren Weckesser

回答by ashu

回答by Max Kleiner

回答by Mr. Sigma.

相关推荐

最近更新

标签

Python CountVectorizer: AttributeError: 'numpy.ndarray' 对象没有属性 'lower'

提问by ashu

回答by Warren Weckesser

回答by ashu

回答by Max Kleiner

回答by Mr. Sigma.

相关推荐

Python numpy 用向量减去矩阵的每一行

Python numpy 数组中的轴是如何索引的？

Python 熊猫中的轴是什么意思？

Python 在熊猫中删除多列

相关推荐

最近更新

标签