Python 如何使用 Scikit Learn CountVectorizer 获取语料库中的词频?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/27488446/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 01:51:59  来源:igfitidea点击:

How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

pythonscikit-learn

提问by Adrien

I'm trying to compute a simple word frequency using scikit-learn's CountVectorizer.

我正在尝试使用 scikit-learn 的CountVectorizer.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird","bird"]
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print cv.vocabulary_
{u'bird': 0, u'cat': 1, u'dog': 2, u'fish': 3}

I was expecting it to return {u'bird': 2, u'cat': 3, u'dog': 2, u'fish': 2}.

我期待它回来{u'bird': 2, u'cat': 3, u'dog': 2, u'fish': 2}

采纳答案by Ffisegydd

cv.vocabulary_in this instance is a dict, where the keys are the words (features) that you've found and the values are indices, which is why they're 0, 1, 2, 3. It's just bad luck that it looked similar to your counts :)

cv.vocabulary_在这个例子中是一个字典,其中键是你找到的词(特征),值是索引,这就是为什么它们是0, 1, 2, 3. 它看起来与您的计数相似只是运气不好:)

You need to work with the cv_fitobject to get the counts

您需要使用cv_fit对象来获取计数

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())
#['bird', 'cat', 'dog', 'fish']
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

Each row in the array is one of your original documents (strings), each column is a feature (word), and the element is the count for that particular word and document. You can see that if you sum each column you'll get the correct number

数组中的每一行都是原始文档(字符串)之一,每一列是一个特征(单词),元素是该特定单词和文档的计数。您可以看到,如果对每一列求和,您将得到正确的数字

print(cv_fit.toarray().sum(axis=0))
#[2 3 2 2]

Honestly though, I'd suggest using collections.Counteror something from NLTK, unless you have some specific reason to use scikit-learn, as it'll be simpler.

老实说,我建议使用collections.CounterNLTK 中的东西,除非您有特定的理由使用 scikit-learn,因为它会更简单。

回答by pieterbons

cv_fit.toarray().sum(axis=0)definitely gives the correct result, but it will be much faster to perform the sum on the sparse matrix and then transform it to an array:

cv_fit.toarray().sum(axis=0)肯定会给出正确的结果,但是对稀疏矩阵执行求和然后将其转换为数组会快得多:

np.asarray(cv_fit.sum(axis=0))

回答by YASH GUPTA

We are going to use the zip method to make dict from a list of words and list of their counts

我们将使用 zip 方法从单词列表和它们的计数列表中制作 dict

import pandas as pd
import numpy as np    
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird","bird"]    

cv = CountVectorizer()   
cv_fit=cv.fit_transform(texts)    
word_list = cv.get_feature_names();    
count_list = cv_fit.toarray().sum(axis=0)    

print word_list
['bird', 'cat', 'dog', 'fish']
print count_list
[2 3 2 2]
print dict(zip(word_list,count_list))
{'fish': 2, 'dog': 2, 'bird': 2, 'cat': 3}

print word_list
['鸟','猫','狗','鱼']
print count_list
[2 3 2 2]
print dict(zip(word_list,count_list))
{'鱼':2,'狗':2,'鸟':2,'猫':3}

回答by Pradeep Singh

Combining every ones else's views and some of my own :) Here is what I have for you

结合其他人的观点和我自己的一些观点:) 这是我给你的

from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

text='''Note that if you use RegexpTokenizer option, you lose 
natural language features special to word_tokenize 
like splitting apart contractions. You can naively 
split on the regex \w+ without any need for the NLTK.
'''

# tokenize
raw = ' '.join(word_tokenize(text.lower()))

tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)

# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]

# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common

Output

输出

(All ones)

(所有)

[('note', 1),
 ('use', 1),
 ('regexptokenizer', 1),
 ('option', 1),
 ('lose', 1),
 ('natural', 1),
 ('language', 1),
 ('features', 1),
 ('special', 1),
 ('word', 1),
 ('tokenize', 1),
 ('like', 1),
 ('splitting', 1),
 ('apart', 1),
 ('contractions', 1),
 ('naively', 1),
 ('split', 1),
 ('regex', 1),
 ('without', 1),
 ('need', 1)]

One can do better than this in terms of efficiency but if you are not worried about it too much, this code is the best.

在效率方面可以做得比这更好,但如果您不太担心,这段代码是最好的。