Python 我可以在 scikit-learn 中使用 CountVectorizer 来计算未用于提取标记的文档的频率吗?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/22920801/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:59:03  来源:igfitidea点击:

Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?

pythonmachine-learningscikit-learntf-idf

提问by tumultous_rooster

I have been working with the CountVectorizerclass in scikit-learn.

我一直在与CountVectorizerscikit-learn的班级一起工作。

I understand that if used in the manner shown below, the final output will consist of an array containing counts of features, or tokens.

我知道如果以下面显示的方式使用,最终输出将包含一个包含特征计数或标记的数组。

These tokens are extracted from a set of keywords, i.e.

这些令牌是从一组关键字中提取的,即

tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

The next step is:

下一步是:

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(tags).toarray()
print data

Where we get

我们得到的地方

[[0 0 0 1 1 0]
 [0 1 0 0 1 1]
 [1 1 1 0 1 0]]

This is fine, but my situation is just a little bit different.

这很好,但我的情况有点不同。

I want to extract the features the same way as above, but I don't want the rows in datato be the same documents that the features were extracted from.

我想以与上述相同的方式提取特征,但我不希望其中的行data与提取特征的文档相同。

In other words, how can I get counts of another set of documents, say,

换句话说,我如何获得另一组文档的计数,例如,

list_of_new_documents = [
  ["python, chicken"],
  ["linux, cow, ubuntu"],
  ["machine learning, bird, fish, pig"]
]

And get:

并得到:

[[0 0 0 1 0 0]
 [0 1 0 0 0 1]
 [0 0 0 0 0 0]]

I have read the documentation for the CountVectorizerclass, and came across the vocabularyargument, which is a mapping of terms to feature indices. I can't seem to get this argument to help me, however.

我已经阅读了CountVectorizer该类的文档,并遇到了一个vocabulary参数,即术语到特征索引的映射。然而,我似乎无法得到这个论点来帮助我。

Any advice is appreciated.
PS: all credit due to Matthias Friedrich's Blogfor the example I used above.

任何建议表示赞赏。
PS:所有功劳均归功于我上面使用的示例的Matthias Friedrich 的博客

采纳答案by BrenBarn

You're right that vocabularyis what you want. It works like this:

你说得对,这vocabulary就是你想要的。它是这样工作的:

>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])
>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 1]], dtype=int64)

So you pass it a dict with your desired features as the keys.

因此,您将一个带有所需功能的字典作为键传递给它。

If you used CountVectorizeron one set of documents and then you want to use the set of features from those documents for a new set, use the vocabulary_attribute of your original CountVectorizer and pass it to the new one. So in your example, you could do

如果您CountVectorizer在一组文档上使用,然后想要将这些文档中的一组特征用于新的一组,请使用vocabulary_原始 CountVectorizer的属性并将其传递给新的。所以在你的例子中,你可以做

newVec = CountVectorizer(vocabulary=vec.vocabulary_)

to create a new tokenizer using the vocabulary from your first one.

使用第一个词汇表创建一个新的分词器。

回答by Dhruv Ghulati

You should call fit_transformor just fiton your original vocabulary source so that the vectorizer learns a vocab.

您应该调用fit_transform或仅调用fit您的原始词汇源,以便矢量化器学习词汇。

Then you can use this fitvectorizer on any new data source via the transform()method.

然后您可以fit通过该transform()方法在任何新数据源上使用此向量化器。

You can obtain the vocabulary produced by the fit (i.e. mapping of word to token ID) via vectorizer.vocabulary_(assuming you name your CountVectorizerthe name vectorizer.

您可以通过vectorizer.vocabulary_(假设您CountVectorizer将 name 命名为vectorizer.

回答by user2476665

>>> tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

>>> list_of_new_documents = [
  ["python, chicken"],
  ["linux, cow, ubuntu"],
  ["machine learning, bird, fish, pig"]

]

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> vect = CountVectorizer()
>>> tags = vect.fit_transform(tags)

# vocabulary learned by CountVectorizer (vect)
>>> print(vect.vocabulary_)
{'python': 3, 'tools': 5, 'linux': 1, 'ubuntu': 6, 'distributed': 0, 'systems': 4, 'networking': 2}

# counts for tags
>>> tags.toarray()
array([[0, 0, 0, 1, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 1],
       [1, 1, 1, 0, 1, 1, 0]], dtype=int64)

# to use `transform`, `list_of_new_documents` should be a list of strings 
# `itertools.chain` flattens shallow lists more efficiently than list comprehensions

>>> from itertools import chain
>>> new_docs = list(chain.from_iterable(list_of_new_documents)
>>> new_docs = vect.transform(new_docs)

# finally, counts for new_docs!
>>> new_docs.toarray()
array([[0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0]])

To verify that CountVectorizeris using the vocabulary learned from tagson new_docs: print vect.vocabulary_again or compare the output of new_docs.toarray()to that of tags.toarray()

为了验证CountVectorizer使用来自学到的词汇tagsnew_docs:打印vect.vocabulary_一次,或输出比较new_docs.toarray()到的tags.toarray()