Python TfidfVectorizer 抛出:空词汇;也许文件只包含停用词”
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/20928769/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python TfidfVectorizer throwing : empty vocabulary; perhaps the documents only contain stop words"
提问by Max Song
I'm trying to use Python's Tfidf to transform a corpus of text. However, when I try to fit_transform it, I get a value error ValueError: empty vocabulary; perhaps the documents only contain stop words.
我正在尝试使用 Python 的 Tfidf 来转换文本语料库。但是,当我尝试对其进行 fit_transform 时,我收到一个值错误 ValueError:空词汇;也许文档只包含停用词。
In [69]: TfidfVectorizer().fit_transform(smallcorp)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-69-ac16344f3129> in <module>()
----> 1 TfidfVectorizer().fit_transform(smallcorp)
/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
1217 vectors : array, [n_samples, n_features]
1218 """
-> 1219 X = super(TfidfVectorizer, self).fit_transform(raw_documents)
1220 self._tfidf.fit(X)
1221 # X is already a transformed view of raw_documents so
/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
778 max_features = self.max_features
779
--> 780 vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
781 X = X.tocsc()
782
/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
725 vocabulary = dict(vocabulary)
726 if not vocabulary:
--> 727 raise ValueError("empty vocabulary; perhaps the documents only"
728 " contain stop words")
729
ValueError: empty vocabulary; perhaps the documents only contain stop words
I read through the SO question here: Problems using a custom vocabulary for TfidfVectorizer scikit-learnand tried ogrisel's suggestion of using TfidfVectorizer(**params).build_analyzer()(dataset2) to check the results of the text analysis stepand that seems to be working as expected: snippet below:
我在这里通读了 SO 问题:问题使用自定义词汇表用于 TfidfVectorizer scikit-learn并尝试了 ogrisel 的建议,即使用TfidfVectorizer(**params).build_analyzer()(dataset2) 检查文本分析步骤的结果,这似乎是按预期工作:下面的片段:
In [68]: TfidfVectorizer().build_analyzer()(smallcorp)
Out[68]:
[u'due',
u'to',
u'lack',
u'of',
u'personal',
u'biggest',
u'education',
u'and',
u'husband',
u'to',
Is there something else that I am doing wrong? the corpus I am feeding it is just one giant long string punctuated by newlines.
还有什么我做错了吗?我提供给它的语料库只是一个巨大的长字符串,中间有换行符。
Thanks!
谢谢!
回答by herrfz
I guess it's because you just have one string. Try splitting it into a list of strings, e.g.:
我猜是因为你只有一根绳子。尝试将其拆分为字符串列表,例如:
In [51]: smallcorp
Out[51]: 'Ah! Now I have done Philosophy,\nI have finished Law and Medicine,\nAnd sadly even Theology:\nTaken fierce pains, from end to end.\nNow here I am, a fool for sure!\nNo wiser than I was before:'
In [52]: tf = TfidfVectorizer()
In [53]: tf.fit_transform(smallcorp.split('\n'))
Out[53]:
<6x28 sparse matrix of type '<type 'numpy.float64'>'
with 31 stored elements in Compressed Sparse Row format>
回答by Andreas Mueller
In version 0.12, we set the minimum document frequency to 2, which means that only words that appear at least twice will be considered. For your example to work, you need to set min_df=1. Since 0.13, this is the default setting.
So I guess you are using 0.12, right?
在 0.12 版本中,我们将最小文档频率设置为 2,这意味着只会考虑至少出现两次的单词。为了让您的示例正常工作,您需要设置min_df=1. 从 0.13 开始,这是默认设置。所以我猜你使用的是 0.12,对吗?
回答by iparjono
You can alternatively put your single string as a tuple, if you insist to have only one string. Instead of having:
如果您坚持只有一个字符串,您也可以将单个字符串作为元组。而不是:
smallcorp = "your text"
smallcorp = "your text"
you'd rather put it within a tuple.
你宁愿把它放在一个元组中。
In [22]: smallcorp = ("your text",)
In [23]: tf.fit_transform(smallcorp)
Out[23]:
<1x2 sparse matrix of type '<type 'numpy.float64'>'
with 2 stored elements in Compressed Sparse Row format>
回答by Victoria Stuart
I encountered a similar error while running a TF-IDF Python 3 script over a large corpus. Some small files (apparently) lacked keywords, throwing an error message.
我在大型语料库上运行 TF-IDF Python 3 脚本时遇到了类似的错误。一些小文件(显然)缺少关键字,从而引发错误消息。
I tried several solutions (adding dummy strings to my filteredlist if len(filtered = 0, ...) that did not help. The simplest solution was to add a try: ... except ... continueexpression.
我尝试了几种没有帮助的解决方案(filtered如果len(filtered = 0,将虚拟字符串添加到我的列表中,...)。最简单的解决方案是添加一个try: ... except ... continue表达式。
pattern = "(?u)\b[\w-]+\b"
cv = CountVectorizer(token_pattern=pattern)
# filtered is a list
filtered = [w for w in filtered if not w in my_stopwords and not w.isdigit()]
# ValueError:
# cv.fit(text)
# File "tfidf-sklearn.py", line 1675, in tfidf
# cv.fit(filtered)
# File "/home/victoria/venv/py37/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 1024, in fit
# self.fit_transform(raw_documents)
# ...
# ValueError: empty vocabulary; perhaps the documents only contain stop words
# Did not help:
# https://stackoverflow.com/a/20933883/1904943
#
# if len(filtered) == 0:
# filtered = ['xxx', 'yyy', 'zzz']
# Solution:
try:
cv.fit(filtered)
cv.fit_transform(filtered)
doc_freq_term_matrix = cv.transform(filtered)
except ValueError:
continue

