Python TfidfVectorizer 抛出：空词汇；也许文件只包含停用词”

Question

提问by Max Song

I'm trying to use Python's Tfidf to transform a corpus of text. However, when I try to fit_transform it, I get a value error ValueError: empty vocabulary; perhaps the documents only contain stop words.

我正在尝试使用 Python 的 Tfidf 来转换文本语料库。但是，当我尝试对其进行 fit_transform 时，我收到一个值错误 ValueError：空词汇；也许文档只包含停用词。

In [69]: TfidfVectorizer().fit_transform(smallcorp)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-69-ac16344f3129> in <module>()
----> 1 TfidfVectorizer().fit_transform(smallcorp)

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
   1217         vectors : array, [n_samples, n_features]
   1218         """
-> 1219         X = super(TfidfVectorizer, self).fit_transform(raw_documents)
   1220         self._tfidf.fit(X)
   1221         # X is already a transformed view of raw_documents so

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
    778         max_features = self.max_features
    779 
--> 780         vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
    781         X = X.tocsc()
    782 

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
    725             vocabulary = dict(vocabulary)
    726             if not vocabulary:
--> 727                 raise ValueError("empty vocabulary; perhaps the documents only"
    728                                  " contain stop words")
    729 

ValueError: empty vocabulary; perhaps the documents only contain stop words

I read through the SO question here: Problems using a custom vocabulary for TfidfVectorizer scikit-learnand tried ogrisel's suggestion of using TfidfVectorizer(**params).build_analyzer()(dataset2) to check the results of the text analysis stepand that seems to be working as expected: snippet below:

我在这里通读了 SO 问题：问题使用自定义词汇表用于 TfidfVectorizer scikit-learn并尝试了 ogrisel 的建议，即使用TfidfVectorizer(**params).build_analyzer()(dataset2) 检查文本分析步骤的结果，这似乎是按预期工作：下面的片段：

In [68]: TfidfVectorizer().build_analyzer()(smallcorp)
Out[68]: 
[u'due',
 u'to',
 u'lack',
 u'of',
 u'personal',
 u'biggest',
 u'education',
 u'and',
 u'husband',
 u'to',

Is there something else that I am doing wrong? the corpus I am feeding it is just one giant long string punctuated by newlines.

还有什么我做错了吗？我提供给它的语料库只是一个巨大的长字符串，中间有换行符。

Thanks!

谢谢！

Answer 1

回答by herrfz

I guess it's because you just have one string. Try splitting it into a list of strings, e.g.:

我猜是因为你只有一根绳子。尝试将其拆分为字符串列表，例如：

In [51]: smallcorp
Out[51]: 'Ah! Now I have done Philosophy,\nI have finished Law and Medicine,\nAnd sadly even Theology:\nTaken fierce pains, from end to end.\nNow here I am, a fool for sure!\nNo wiser than I was before:'

In [52]: tf = TfidfVectorizer()

In [53]: tf.fit_transform(smallcorp.split('\n'))
Out[53]: 
<6x28 sparse matrix of type '<type 'numpy.float64'>'
    with 31 stored elements in Compressed Sparse Row format>

Answer 2

回答by Andreas Mueller

In version 0.12, we set the minimum document frequency to 2, which means that only words that appear at least twice will be considered. For your example to work, you need to set min_df=1. Since 0.13, this is the default setting. So I guess you are using 0.12, right?

在 0.12 版本中，我们将最小文档频率设置为 2，这意味着只会考虑至少出现两次的单词。为了让您的示例正常工作，您需要设置min_df=1. 从 0.13 开始，这是默认设置。所以我猜你使用的是 0.12，对吗？

Answer 3

回答by iparjono

You can alternatively put your single string as a tuple, if you insist to have only one string. Instead of having:

如果您坚持只有一个字符串，您也可以将单个字符串作为元组。而不是：

smallcorp = "your text"

you'd rather put it within a tuple.

你宁愿把它放在一个元组中。

In [22]: smallcorp = ("your text",)
In [23]: tf.fit_transform(smallcorp)
Out[23]: 
<1x2 sparse matrix of type '<type 'numpy.float64'>'
    with 2 stored elements in Compressed Sparse Row format>

Answer 4

回答by Victoria Stuart

I encountered a similar error while running a TF-IDF Python 3 script over a large corpus. Some small files (apparently) lacked keywords, throwing an error message.

我在大型语料库上运行 TF-IDF Python 3 脚本时遇到了类似的错误。一些小文件（显然）缺少关键字，从而引发错误消息。

I tried several solutions (adding dummy strings to my filteredlist if len(filtered = 0, ...) that did not help. The simplest solution was to add a try: ... except ... continueexpression.

我尝试了几种没有帮助的解决方案（filtered如果len(filtered = 0，将虚拟字符串添加到我的列表中，...）。最简单的解决方案是添加一个try: ... except ... continue表达式。

pattern = "(?u)\b[\w-]+\b"
cv = CountVectorizer(token_pattern=pattern)

# filtered is a list
filtered = [w for w in filtered if not w in my_stopwords and not w.isdigit()]

# ValueError:
# cv.fit(text)
# File "tfidf-sklearn.py", line 1675, in tfidf
#   cv.fit(filtered)
#   File "/home/victoria/venv/py37/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 1024, in fit
#   self.fit_transform(raw_documents)
#   ...
#   ValueError: empty vocabulary; perhaps the documents only contain stop words

# Did not help:
# https://stackoverflow.com/a/20933883/1904943
#
# if len(filtered) == 0:
#     filtered = ['xxx', 'yyy', 'zzz']

# Solution:
try:
    cv.fit(filtered)
    cv.fit_transform(filtered)
    doc_freq_term_matrix = cv.transform(filtered)
except ValueError:
    continue

Python TfidfVectorizer 抛出：空词汇；也许文件只包含停用词”

提问by Max Song

回答by herrfz

回答by Andreas Mueller

回答by iparjono

回答by Victoria Stuart

相关推荐

最近更新

标签

Python TfidfVectorizer 抛出：空词汇；也许文件只包含停用词”

提问by Max Song

回答by herrfz

回答by Andreas Mueller

回答by iparjono

回答by Victoria Stuart

相关推荐

Python 如何使用 numpy/scipy 执行两样本单尾 t 检验

如何在 Selenium Webdriver 2 Python 中获取当前 URL？

将多个excel文件导入python pandas并将它们连接成一个数据帧

python：检查列表是多维还是一维

相关推荐

最近更新

标签