Python scikit-learn 中的 TfidfVectorizer:ValueError: np.nan is an invalid document
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/39303912/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
TfidfVectorizer in scikit-learn : ValueError: np.nan is an invalid document
提问by boltthrower
I'm using TfidfVectorizer from scikit-learn to do some feature extraction from text data. I have a CSV file with a Score (can be +1 or -1) and a Review (text). I pulled this data into a DataFrame so I can run the Vectorizer.
我正在使用 scikit-learn 的 TfidfVectorizer 从文本数据中提取一些特征。我有一个带有分数(可以是 +1 或 -1)和评论(文本)的 CSV 文件。我将这些数据拉入 DataFrame 中,以便我可以运行 Vectorizer。
This is my code:
这是我的代码:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv("train_new.csv",
names = ['Score', 'Review'], sep=',')
# x = df['Review'] == np.nan
#
# print x.to_csv(path='FindNaN.csv', sep=',', na_rep = 'string', index=True)
#
# print df.isnull().values.any()
v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
x = v.fit_transform(df['Review'])
This is the traceback for the error I get:
这是我得到的错误的回溯:
Traceback (most recent call last):
File "/home/PycharmProjects/Review/src/feature_extraction.py", line 16, in <module>
x = v.fit_transform(df['Review'])
File "/home/b/hw1/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 1305, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
self.fixed_vocabulary_)
File "/home/b/work/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
for feature in analyze(doc):
File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 118, in decode
raise ValueError("np.nan is an invalid document, expected byte or "
ValueError: np.nan is an invalid document, expected byte or unicode string.
I checked the CSV file and DataFrame for anything that's being read as NaN but I can't find anything. There are 18000 rows, none of which return isnan
as True.
我检查了 CSV 文件和 DataFrame 中是否有任何被读取为 NaN 的内容,但我找不到任何内容。有 18000 行,没有一行返回isnan
True。
This is what df['Review'].head()
looks like:
这df['Review'].head()
看起来像:
0 This book is such a life saver. It has been s...
1 I bought this a few times for my older son and...
2 This is great for basics, but I wish the space...
3 This book is perfect! I'm a first time new mo...
4 During your postpartum stay at the hospital th...
Name: Review, dtype: object
回答by Nickil Maveli
You need to convert the dtype object
to unicode
string as is clearly mentioned in the traceback.
您需要将 dtype 转换object
为unicode
字符串,如回溯中明确提到的。
x = v.fit_transform(df['Review'].values.astype('U')) ## Even astype(str) would work
From the Doc page of TFIDF Vectorizer:
从 TFIDF Vectorizer 的文档页面:
fit_transform(raw_documents, y=None)
Parameters: raw_documents : iterable
an iterable which yields either str, unicodeor file objects
fit_transform(raw_documents,y=无)
参数: raw_documents : iterable
产生str、unicode或file 对象的可迭代对象
回答by Andy Ma
I find a more efficient way to solve this problem.
我找到了一种更有效的方法来解决这个问题。
x = v.fit_transform(df['Review'].apply(lambda x: np.str_(x)))
Of course you can use df['Review'].values.astype('U')
to convert the entire Series. But I found using this function will consume much more memory if the Series you want to convert is really big. (I test this with a Series with 80w rows of data, and doing this astype('U')
will consume about 96GB of memory)
当然你可以使用df['Review'].values.astype('U')
来转换整个系列。但是我发现如果您要转换的系列非常大,则使用此函数会消耗更多内存。(我用一个包含 80w 行数据的系列测试这个,这样做astype('U')
会消耗大约 96GB 的内存)
Instead, if you use the lambda expression to only convert the data in the Series from str
to numpy.str_
, which the result will also be accepted by the fit_transform
function, this will be faster and will not increase the memory usage.
相反,如果您使用 lambda 表达式仅将 Series 中的数据从 转换str
为numpy.str_
,结果也将被fit_transform
函数接受,这将更快并且不会增加内存使用量。
I'm not sure why this will work because in the Doc page of TFIDF Vectorizer:
我不确定为什么这会起作用,因为在 TFIDF Vectorizer 的文档页面中:
fit_transform(raw_documents, y=None)
Parameters: raw_documents : iterable
an iterable which yields either str, unicode or file objects
fit_transform(raw_documents,y=无)
参数: raw_documents :可迭代
产生 str、unicode 或文件对象的可迭代对象
But actually this iterable must yields np.str_
instead of str
.
但实际上这个迭代必须产生np.str_
而不是str
.