Python scikit-learn 中的 TfidfVectorizer：ValueError: np.nan is an invalid document

Question

提问by boltthrower

I'm using TfidfVectorizer from scikit-learn to do some feature extraction from text data. I have a CSV file with a Score (can be +1 or -1) and a Review (text). I pulled this data into a DataFrame so I can run the Vectorizer.

我正在使用 scikit-learn 的 TfidfVectorizer 从文本数据中提取一些特征。我有一个带有分数（可以是 +1 或 -1）和评论（文本）的 CSV 文件。我将这些数据拉入 DataFrame 中，以便我可以运行 Vectorizer。

This is my code:

这是我的代码：

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.read_csv("train_new.csv",
             names = ['Score', 'Review'], sep=',')

# x = df['Review'] == np.nan
#
# print x.to_csv(path='FindNaN.csv', sep=',', na_rep = 'string', index=True)
#
# print df.isnull().values.any()

v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
x = v.fit_transform(df['Review'])

This is the traceback for the error I get:

这是我得到的错误的回溯：

Traceback (most recent call last):
  File "/home/PycharmProjects/Review/src/feature_extraction.py", line 16, in <module>
x = v.fit_transform(df['Review'])
 File "/home/b/hw1/local/lib/python2.7/site-   packages/sklearn/feature_extraction/text.py", line 1305, in fit_transform
   X = super(TfidfVectorizer, self).fit_transform(raw_documents)
 File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
self.fixed_vocabulary_)
 File "/home/b/work/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
   for feature in analyze(doc):
 File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
 File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 118, in decode
 raise ValueError("np.nan is an invalid document, expected byte or "
 ValueError: np.nan is an invalid document, expected byte or unicode string.

I checked the CSV file and DataFrame for anything that's being read as NaN but I can't find anything. There are 18000 rows, none of which return isnanas True.

我检查了 CSV 文件和 DataFrame 中是否有任何被读取为 NaN 的内容，但我找不到任何内容。有 18000 行，没有一行返回isnanTrue。

This is what df['Review'].head()looks like:

这df['Review'].head()看起来像：

  0    This book is such a life saver.  It has been s...
  1    I bought this a few times for my older son and...
  2    This is great for basics, but I wish the space...
  3    This book is perfect!  I'm a first time new mo...
  4    During your postpartum stay at the hospital th...
  Name: Review, dtype: object

Answer 1

回答by Nickil Maveli

You need to convert the dtype objectto unicodestring as is clearly mentioned in the traceback.

您需要将 dtype 转换object为unicode字符串，如回溯中明确提到的。

x = v.fit_transform(df['Review'].values.astype('U'))  ## Even astype(str) would work

From the Doc page of TFIDF Vectorizer:

从 TFIDF Vectorizer 的文档页面：

fit_transform(raw_documents, y=None)
Parameters: raw_documents : iterable
an iterable which yields either str, unicodeor file objects

fit_transform（raw_documents，y=无）
参数： raw_documents : iterable
产生str、unicode或file 对象的可迭代对象

Answer 2

回答by Andy Ma

I find a more efficient way to solve this problem.

我找到了一种更有效的方法来解决这个问题。

x = v.fit_transform(df['Review'].apply(lambda x: np.str_(x)))

Of course you can use df['Review'].values.astype('U')to convert the entire Series. But I found using this function will consume much more memory if the Series you want to convert is really big. (I test this with a Series with 80w rows of data, and doing this astype('U')will consume about 96GB of memory)

当然你可以使用df['Review'].values.astype('U')来转换整个系列。但是我发现如果您要转换的系列非常大，则使用此函数会消耗更多内存。（我用一个包含 80w 行数据的系列测试这个，这样做astype('U')会消耗大约 96GB 的内存）

Instead, if you use the lambda expression to only convert the data in the Series from strto numpy.str_, which the result will also be accepted by the fit_transformfunction, this will be faster and will not increase the memory usage.

相反，如果您使用 lambda 表达式仅将 Series 中的数据从转换str为numpy.str_，结果也将被fit_transform函数接受，这将更快并且不会增加内存使用量。

I'm not sure why this will work because in the Doc page of TFIDF Vectorizer:

我不确定为什么这会起作用，因为在 TFIDF Vectorizer 的文档页面中：

fit_transform(raw_documents, y=None)
Parameters: raw_documents : iterable
an iterable which yields either str, unicode or file objects

fit_transform（raw_documents，y=无）
参数： raw_documents ：可迭代
产生 str、unicode 或文件对象的可迭代对象

But actually this iterable must yields np.str_instead of str.

但实际上这个迭代必须产生np.str_而不是str.

Python scikit-learn 中的 TfidfVectorizer：ValueError: np.nan is an invalid document

提问by boltthrower

回答by Nickil Maveli

回答by Andy Ma

相关推荐

最近更新

标签

Python scikit-learn 中的 TfidfVectorizer：ValueError: np.nan is an invalid document

提问by boltthrower

回答by Nickil Maveli

回答by Andy Ma

相关推荐

Python 如何使用带有 gzip 压缩选项的 pandas read_csv 读取 tar.gz 文件？

Python 熊猫数据透视表到数据框

Python 带有嵌入引号的文件名的“CSV 文件不存在”

Python 如何在 PyCharm 中设置环境变量？

相关推荐

最近更新

标签