Python 理解 scikit CountVectorizer 中的 min_df 和 max_df

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27697766/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 02:08:35  来源:igfitidea点击:

Understanding min_df and max_df in scikit CountVectorizer

pythonmachine-learningscikit-learnnlp

提问by moeabdol

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly means? Is it the frequency of a word in its particular text file or is it the frequency of the word in the entire overall corpus (5 txt files)?

我有五个文本文件输入到 CountVectorizer。当为 CountVectorizer 实例指定 min_df 和 max_df 时,最小/最大文档频率究竟意味着什么?是某个词在其特定文本文件中的出现频率,还是该词在整个语料库(5 个 txt 文件)中的出现频率?

How is it different when min_df and max_df are provided as integers or as floats?

当 min_df 和 max_df 作为整数或浮点数提供时有什么不同?

The documentation doesn't seem to provide a thorough explanation nor does it supply an example to demonstrate the use of min_df and/or max_df. Could someone provide an explanation or example demonstrating min_df or max_df.

该文档似乎没有提供详尽的解释,也没有提供示例来演示 min_df 和/或 max_df 的使用。有人可以提供说明或示例来演示 min_df 或 max_df。

采纳答案by Kevin Markham

max_dfis used for removing terms that appear too frequently, also known as "corpus-specific stop words". For example:

max_df用于删除出现频率过高的术语,也称为“语料库特定停用词”。例如:

  • max_df = 0.50means "ignore terms that appear in more than 50% of the documents".
  • max_df = 25means "ignore terms that appear in more than 25 documents".
  • max_df = 0.50意思是“忽略出现在超过 50% 的文档中的术语”。
  • max_df = 25意思是“忽略出现在超过 25 个文档中的术语”。

The default max_dfis 1.0, which means "ignore terms that appear in more than 100% of the documents". Thus, the default setting does not ignore any terms.

默认max_df1.0,这意味着“忽略出现在超过 100% 的文档中的术语”。因此,默认设置不会忽略任何术语。



min_dfis used for removing terms that appear too infrequently. For example:

min_df用于删除出现频率太低的术语。例如:

  • min_df = 0.01means "ignore terms that appear in less than 1% of the documents".
  • min_df = 5means "ignore terms that appear in less than 5 documents".
  • min_df = 0.01意思是“忽略出现在不到 1% 的文档中的术语”。
  • min_df = 5意思是“忽略出现在少于 5 个文档中的术语”。

The default min_dfis 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

默认min_df1,这意味着“忽略出现在少于 1 个文档中的术语”。因此,默认设置不会忽略任何术语。

回答by Ffisegydd

As per the CountVectorizerdocumentation here.

根据此处CountVectorizer文档。

When using a float in the range [0.0, 1.0]they refer to the documentfrequency. That is the percentage of documents that contain the term.

在范围内使用浮点数时,[0.0, 1.0]它们指的是文档频率。这是包含该术语的文档的百分比。

When using an int it refers to absolute number of documents that hold this term.

使用 int 时,它指的是拥有该术语的文档的绝对数量。

Consider the example where you have 5 text files (or documents). If you set max_df = 0.6then that would translate to 0.6*5=3documents. If you set max_df = 2then that would simply translate to 2 documents.

考虑您有 5 个文本文件(或文档)的示例。如果您设置max_df = 0.6,则将转换为0.6*5=3文档。如果您设置,max_df = 2那么这将简单地转换为 2 个文档。

The source code example below is copied from Github hereand shows how the max_doc_countis constructed from the max_df. The code for min_dfis similar and can be found on the GH page.

下面的源代码示例从这里的Github 复制,并展示了如何max_doc_countmax_df. 的代码min_df类似,可以在 GH 页面上找到。

max_doc_count = (max_df
                 if isinstance(max_df, numbers.Integral)
                 else max_df * n_doc)

The defaults for min_dfand max_dfare 1 and 1.0, respectively. This basically says "If my term is found in only 1 document, then it's ignored. Similarly if it's found in all documents (100% or 1.0) then it's ignored."

为默认值min_dfmax_df分别为1和1.0。这基本上是说“如果我的术语只在 1 个文档中找到,那么它会被忽略。同样,如果它在所有文档(100% 或 1.0)中找到,那么它会被忽略。”

max_dfand min_dfare both used internally to calculate max_doc_countand min_doc_count, the maximum and minimum number of documents that a term must be found in. This is then passed to self._limit_featuresas the keyword arguments highand lowrespectively, the docstring for self._limit_featuresis

max_dfmin_df都在内部使用,以计算max_doc_countmin_doc_count,文件就术语必须在找到最大和最小数目。这随后被传递到self._limit_features作为关键字参数highlow分别为文档字符串self._limit_features就是

"""Remove too rare or too common features.

Prune features that are non zero in more samples than high or less
documents than low, modifying the vocabulary, and restricting it to
at most the limit most frequent.

This does not prune samples with zero features.
"""

回答by Monica Heddneck

The defaults for min_df and max_df are 1 and 1.0, respectively. These defaults really don't do anything at all.

min_df 和 max_df 的默认值分别为 1 和 1.0。这些默认值实际上根本没有任何作用。

That being said, I believe the currently accepted answer by @Ffisegydd answer isn't quite correct.

话虽如此,我相信@Ffisegydd 答案目前接受的答案并不完全正确。

For example, run this using the defaults, to see that when min_df=1and max_df=1.0, then

例如,使用默认值运行它,以查看 whenmin_df=1max_df=1.0, then

1) all tokens that appear in at least one document are used (e.g., all tokens!)

1) 使用出现在至少一个文档中的所有标记(例如,所有标记!)

2) all tokens that appear in all documents are used (we'll test with one candidate: everywhere).

2) 使用出现在所有文档中的所有标记(我们将使用一个候选人进行测试:无处不在)。

cv = CountVectorizer(min_df=1, max_df=1.0, lowercase=True) 
# here is just a simple list of 3 documents.
corpus = ['one two three everywhere', 'four five six everywhere', 'seven eight nine everywhere']
# below we call fit_transform on the corpus and get the feature names.
X = cv.fit_transform(corpus)
vocab = cv.get_feature_names()
print vocab
print X.toarray()
print cv.stop_words_

We get:

我们得到:

[u'eight', u'everywhere', u'five', u'four', u'nine', u'one', u'seven', u'six', u'three', u'two']
[[0 1 0 0 0 1 0 0 1 1]
 [0 1 1 1 0 0 0 1 0 0]
 [1 1 0 0 1 0 1 0 0 0]]
set([])

All tokens are kept. There are no stopwords.

保留所有令牌。没有停用词。

Further messing around with the arguments will clarify other configurations.

进一步弄乱参数将阐明其他配置。

For fun and insight, I'd also recommend playing around with stop_words = 'english'and seeing that, peculiarly, all the words except 'seven' are removed! Including `everywhere'.

为了乐趣和洞察力,我还建议您玩玩stop_words = 'english'并查看,特别是,除了“七”之外的所有单词都被删除了!包括“无处不在”。

回答by Amirabbas Askary

I would add this point also for understanding min_dfand max_dfin tf-idf better.

我也会添加这一点以更好地理解min_dfmax_df在 tf-idf 中。

If you go with the default values, meaning considering all terms, you have generated definitely more tokens. So your clustering process (or any other thing you want to do with those terms later) will take a longer time.

如果您使用默认值,这意味着考虑所有术语,您肯定会生成更多令牌。因此,您的聚类过程(或您以后想用这些术语做的任何其他事情)将需要更长的时间。

BUT the quality of your clustering should NOT be reduced.

但不应降低聚类的质量。

One might think that allowing all terms (e.g. too frequent terms or stop-words) to be present might lower the quality but in tf-idf it doesn't. Because tf-idf measurement instinctively will give a low score to those terms, effectively making them not influential (as they appear in many documents).

人们可能认为允许所有术语(例如过于频繁的术语或停用词)都存在可能会降低质量,但在 tf-idf 中却不会。因为 tf-idf 测量会本能地给这些术语打低分,从而有效地使它们没有影响力(因为它们出现在许多文档中)。

So to sum it up, pruning the terms via min_dfand max_dfis to improve the performance, not the quality of clusters (as an example).

所以总结一下,通过min_df和修剪术语max_df是为了提高性能,而不是集群的质量(例如)。

And the crucial point is that if you set the minand maxmistakenly, you would lose some important terms and thus lower the quality. So if you are unsure about the right threshold (it depends on your documents set), or if you are sure about your machine's processing capabilities, leave the min, maxparameters unchanged.

而关键的一点是,如果你设置的minmax错误的,你会失去一些重要的条款,从而降低了质量。因此,如果您不确定正确的阈值(这取决于您设置的文档),或者如果您确定机器的处理能力,请保持min,max参数不变。

回答by kavgan

The goal of MIN_DFis to ignore words that have very few occurrences to be considered meaningful. For example, in your text you may have names of people that may appear in only 1 or two documents. In some applications, this may qualify as noise and could be eliminated from further analysis. Similarly, you can ignore words that are too common with MAX_DF.

的目标MIN_DF是忽略出现次数很少被认为有意义的单词。例如,在您的文本中,您的人名可能只出现在 1 或 2 个文档中。在某些应用中,这可能属于噪声,可以从进一步分析中消除。同样,您可以忽略与MAX_DF.

Instead of using a minimum/maximum term frequency (total occurrences of a word) to eliminate words, MIN_DFand MAX_DFlook at how many documents contained a term, better known as document frequency. The threshold values can be an absolute value (e.g. 1, 2, 3, 4) or a value representing proportion of documents (e.g. 0.25 meaning, ignore words that have appeared in 25% of the documents) .

而不是使用最小/最大词频(一个字的总出现)来消除的话,MIN_DF并且MAX_DF在许多文件如何包含一个术语,更好地称为文档频率的样子。阈值可以是绝对值(例如1、2、3、4)或代表文档比例的值(例如0.25的意思,忽略在25%的文档中出现的词)。

See some usage examples here.

请参阅此处的一些使用示例