Python 如何使用 Scikit Learn CountVectorizer 获取语料库中的词频？

Question

提问by Adrien

I'm trying to compute a simple word frequency using scikit-learn's CountVectorizer.

我正在尝试使用 scikit-learn 的CountVectorizer.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird","bird"]
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print cv.vocabulary_
{u'bird': 0, u'cat': 1, u'dog': 2, u'fish': 3}

I was expecting it to return {u'bird': 2, u'cat': 3, u'dog': 2, u'fish': 2}.

我期待它回来{u'bird': 2, u'cat': 3, u'dog': 2, u'fish': 2}。

Answer 1

采纳答案by Ffisegydd

cv.vocabulary_in this instance is a dict, where the keys are the words (features) that you've found and the values are indices, which is why they're 0, 1, 2, 3. It's just bad luck that it looked similar to your counts :)

cv.vocabulary_在这个例子中是一个字典，其中键是你找到的词（特征），值是索引，这就是为什么它们是0, 1, 2, 3. 它看起来与您的计数相似只是运气不好:)

You need to work with the cv_fitobject to get the counts

您需要使用cv_fit对象来获取计数

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())
#['bird', 'cat', 'dog', 'fish']
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

Each row in the array is one of your original documents (strings), each column is a feature (word), and the element is the count for that particular word and document. You can see that if you sum each column you'll get the correct number

数组中的每一行都是原始文档（字符串）之一，每一列是一个特征（单词），元素是该特定单词和文档的计数。您可以看到，如果对每一列求和，您将得到正确的数字

print(cv_fit.toarray().sum(axis=0))
#[2 3 2 2]

Honestly though, I'd suggest using collections.Counteror something from NLTK, unless you have some specific reason to use scikit-learn, as it'll be simpler.

老实说，我建议使用collections.CounterNLTK 中的东西，除非您有特定的理由使用 scikit-learn，因为它会更简单。

Answer 2

回答by pieterbons

cv_fit.toarray().sum(axis=0)definitely gives the correct result, but it will be much faster to perform the sum on the sparse matrix and then transform it to an array:

cv_fit.toarray().sum(axis=0)肯定会给出正确的结果，但是对稀疏矩阵执行求和然后将其转换为数组会快得多：

np.asarray(cv_fit.sum(axis=0))

Answer 3

回答by YASH GUPTA

We are going to use the zip method to make dict from a list of words and list of their counts

我们将使用 zip 方法从单词列表和它们的计数列表中制作 dict

import pandas as pd
import numpy as np    
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird","bird"]    

cv = CountVectorizer()   
cv_fit=cv.fit_transform(texts)    
word_list = cv.get_feature_names();    
count_list = cv_fit.toarray().sum(axis=0)

print word_list
['bird', 'cat', 'dog', 'fish']
print count_list
[2 3 2 2]
print dict(zip(word_list,count_list))
{'fish': 2, 'dog': 2, 'bird': 2, 'cat': 3}

print word_list
['鸟'，'猫'，'狗'，'鱼']
print count_list
[2 3 2 2]
print dict(zip(word_list,count_list))
{'鱼'：2，'狗'：2，'鸟'：2，'猫'：3}

Answer 4

回答by Pradeep Singh

Combining every ones else's views and some of my own :) Here is what I have for you

结合其他人的观点和我自己的一些观点:) 这是我给你的

from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

text='''Note that if you use RegexpTokenizer option, you lose 
natural language features special to word_tokenize 
like splitting apart contractions. You can naively 
split on the regex \w+ without any need for the NLTK.
'''

# tokenize
raw = ' '.join(word_tokenize(text.lower()))

tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)

# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]

# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common

Output

输出

(All ones)

（所有）

[('note', 1),
 ('use', 1),
 ('regexptokenizer', 1),
 ('option', 1),
 ('lose', 1),
 ('natural', 1),
 ('language', 1),
 ('features', 1),
 ('special', 1),
 ('word', 1),
 ('tokenize', 1),
 ('like', 1),
 ('splitting', 1),
 ('apart', 1),
 ('contractions', 1),
 ('naively', 1),
 ('split', 1),
 ('regex', 1),
 ('without', 1),
 ('need', 1)]

One can do better than this in terms of efficiency but if you are not worried about it too much, this code is the best.

在效率方面可以做得比这更好，但如果您不太担心，这段代码是最好的。

Python 如何使用 Scikit Learn CountVectorizer 获取语料库中的词频？

提问by Adrien

采纳答案by Ffisegydd

回答by pieterbons

回答by YASH GUPTA

回答by Pradeep Singh

Output

输出

相关推荐

最近更新

标签

Python 如何使用 Scikit Learn CountVectorizer 获取语料库中的词频？

提问by Adrien

采纳答案by Ffisegydd

回答by pieterbons

回答by YASH GUPTA

回答by Pradeep Singh

Output

输出

相关推荐

我如何解决 NameError: name 'threading' is not defined in python 3.3

Python Django 模型多项选择

Python：计算一个单词在文件中出现的次数

Python 发送数据包并更改其源 IP

相关推荐

最近更新

标签