pandas raise ValueError("np.nan 是一个无效的文档,预期的字节或"
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/49259305/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
raise ValueError("np.nan is an invalid document, expected byte or "
提问by Sadhana Singh
i am using CountVectorizer in scikit-learn for Vectorizing the feature sequence. i got stuck when it is giving an error as below: ValueError: np.nan is an invalid document, expected byte or unicode string.
我在 scikit-learn 中使用 CountVectorizer 对特征序列进行矢量化。当它给出如下错误时我被卡住了:ValueError: np.nan is an invalid document, expected byte or unicode string。
i am taking an example csv dataset with two columns CONTENT and sentiment.my code is as below:
我正在使用一个包含两列内容和情绪的示例 csv 数据集。我的代码如下:
df = pd.read_csv("train.csv",encoding = "ISO-8859-1")
X, y = df.CONTENT, df.sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print X_train, y_train
vect = CountVectorizer(ngram_range=(1,3), analyzer='word', encoding = "ISO-8859-1")
print vect
X=vect.fit_transform(X_train, y_train)
y=vect.fit(X_test)
print vect.get_feature_names()
the error i got is:
我得到的错误是:
File "C:/Users/HP/cntVect.py", line 28, in <module>
X=vect.fit_transform(X_train, y_train)
File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
self.fixed_vocabulary_)
File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 762, in _count_vocab
for feature in analyze(doc):
File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 241, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\ProgramData\Anaconda2\lib\site-packages\sklearn\feature_extraction\text.py", line 121, in decode
raise ValueError("np.nan is an invalid document, expected byte or "
ValueError: np.nan is an invalid document, expected byte or unicode string.
回答by MaxU
replace NaN's with spaces - this should make CountVectorizer
happy:
用空格替换 NaN - 这应该CountVectorizer
很高兴:
X, y = df.CONTENT.fillna(' '), df.sentiment
回答by A H
You are not handling the Nan, i.e. "not a number" aptly. Use python's fillna() method to fill/replace the missing or NaN values in your pandas dataframe with a suitable value you want.
您没有恰当地处理 Nan,即“不是数字”。使用 python 的 fillna() 方法用您想要的合适值填充/替换您的 Pandas 数据框中的缺失值或 NaN 值。
Hence, instead of :
因此,而不是:
X, y = df.CONTENT, df.sentiment
X, y = df.CONTENT, df.sentiment
Use :
用 :
X, y = df.CONTENT.fillna(' '), df.sentiment
X, y = df.CONTENT.fillna(' '), df.sentiment
in which Nans are replaced by spaces.
其中 Nans 被空格替换。
回答by Durgaprasad Nagarkatte
What I can guess from your question is certain fields in the content are empty. You can follow the fillna method or drop the columns by df[df["Content"].notnull()]. This will give you the dataset where there are not NAN values.
我可以从您的问题中猜测内容中的某些字段是空的。您可以按照 fillna 方法或通过 df[df["Content"].notnull()] 删除列。这将为您提供没有 NAN 值的数据集。