Python 了解 sklearn 中 CountVectorizer 中的 `ngram_range` 参数

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/24005762/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 03:49:32  来源:igfitidea点击:

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

pythonscikit-learnn-gramfeature-selection

提问by tumultous_rooster

I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_rangeargument works in a CountVectorizer.

我对如何在 Python 中的 scikit-learn 库中使用 ngrams 有点困惑,特别是ngram_range参数如何在 CountVectorizer 中工作。

Running this code:

运行此代码:

from sklearn.feature_extraction.text import CountVectorizer
vocabulary = ['hi ', 'bye', 'run away']
cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2))
print cv.vocabulary_

gives me:

给我:

{'hi ': 0, 'bye': 1, 'run away': 2}

Where I was under the (obviously mistaken) impression that I would get unigrams and bigrams, like this:

我的(显然是错误的)印象是我会得到 unigrams 和 bigrams,就像这样:

{'hi ': 0, 'bye': 1, 'run away': 2, 'run': 3, 'away': 4}

I am working with the documentation here: http://scikit-learn.org/stable/modules/feature_extraction.html

我正在处理这里的文档:http: //scikit-learn.org/stable/modules/feature_extraction.html

Clearly there is something terribly wrong with my understanding of how to use ngrams. Perhaps the argument is having no effect or I have some conceptual issue with what an actual bigram is! I'm stumped. If anyone has a word of advice to throw my way, I'd be grateful.

显然,我对如何使用 ngrams 的理解存在严重错误。也许这个论点没有任何效果,或者我对实际的二元语法有一些概念上的问题!我难住了。如果有人对我提出建议,我将不胜感激。

UPDATE:
I have realized the folly of my ways. I was under the impression that the ngram_rangewould affect the vocabulary, not the corpus.

更新:
我已经意识到我的方式的愚蠢。我的印象是ngram_range会影响词汇,而不是语料库。

采纳答案by Fred Foo

Setting the vocabularyexplicitly means no vocabulary is learned from data. If you don't set it, you get:

vocabulary明确设置意味着没有从数据中学习词汇。如果你不设置它,你会得到:

>>> v = CountVectorizer(ngram_range=(1, 2))
>>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_)
{u'an': 0,
 u'an apple': 1,
 u'apple': 2,
 u'apple day': 3,
 u'away': 4,
 u'day': 5,
 u'day keeps': 6,
 u'doctor': 7,
 u'doctor away': 8,
 u'keeps': 9,
 u'keeps the': 10,
 u'the': 11,
 u'the doctor': 12}

An explicit vocabulary restricts the terms that will be extracted from text; the vocabulary is not changed:

明确的词汇表限制了将从文本中提取的术语;词汇没有改变:

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]])  # unigram and bigram found

(Note that stopword filtering is applied before n-gram extraction, hence "apple day".)

(请注意,停用词过滤是在 n-gram 提取之前应用的,因此"apple day"。)