Python 使用 Sklearn 的 TfidfVectorizer 变换

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/20132070/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 19:39:10  来源:igfitidea点击:

Using Sklearn's TfidfVectorizer transform

pythondocumenttext-miningtf-idf

提问by Sterling

I am trying to get the tf-idf vector for a single document using Sklearn's TfidfVectorizer object. I create a vocabulary based on some training documents and use fit_transform to train the TfidfVectorizer. Then, I want to find the tf-idf vectors for any given testing document.

我正在尝试使用 Sklearn 的 TfidfVectorizer 对象获取单个文档的 tf-idf 向量。我根据一些训练文档创建了一个词汇表,并使用 fit_transform 来训练 TfidfVectorizer。然后,我想为任何给定的测试文档找到 tf-idf 向量。

from sklearn.feature_extraction.text import TfidfVectorizer

self.vocabulary = "a list of words I want to look for in the documents".split()
self.vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', 
                 stop_words='english')
self.vect.fit_transform(self.vocabulary)

...

doc = "some string I want to get tf-idf vector for"
tfidf = self.vect.transform(doc)

The problem is that this returns a matrix with n rows where n is the size of my doc string. I want it to return just a single vector representing the tf-idf for the entire string. How can I make this see the string as a single document, rather than each character being a document? Also, I am very new to text mining so if I am doing something wrong conceptually, that would be great to know. Any help is appreciated.

问题是这会返回一个包含 n 行的矩阵,其中 n 是我的文档字符串的大小。我希望它只返回一个表示整个字符串的 tf-idf 的向量。我如何才能将字符串视为单个文档,而不是每个字符都是一个文档?另外,我对文本挖掘很陌生,所以如果我在概念上做错了什么,那会很高兴知道。任何帮助表示赞赏。

采纳答案by alko

If you want to compute tf-idf only for a given vocabulary, use vocabularyargument to TfidfVectorizerconstructor,

如果您只想为给定的词汇计算 tf-idf,请使用构造函数的vocabulary参数TfidfVectorizer

vocabulary = "a list of words I want to look for in the documents".split()
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', 
           stop_words='english', vocabulary=vocabulary)

Then, to fit, i.e. calculate counts, with a given corpus, i.e. an iterable of documents, use fit:

然后,为了拟合,即计算计数,使用给定的corpus,即可迭代的文档,使用fit

vect.fit(corpus)

Method fit_transformis a shortening for

方法fit_transform是缩短

vect.fit(corpus)
corpus_tf_idf = vect.transform(corpus) 

Last, transformmethod accepts a corpus, so for a single document, you should pass it as list, or it is treated as iterable of symbols, each symbol being a document.

最后,transform方法接受一个语料库,因此对于单个文档,您应该将其作为列表传递,或者将其视为可迭代的符号,每个符号都是一个文档。

doc_tfidf = vect.transform([doc])