pandas 如何在熊猫数据框中找到一列的 ngram 频率?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/36572221/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
How to find ngram frequency of a column in a pandas dataframe?
提问by GeorgeOfTheRF
Below is the input pandas dataframe I have.
以下是我拥有的输入Pandas数据框。
I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown below
我想找到 unigrams 和 bigrams 的频率。我期望的示例如下所示
How to do this using nltk or scikit learn?
如何使用 nltk 或 scikit 学习做到这一点?
I wrote the below code which takes a string as input. How to extend it to series/dataframe?
我写了下面的代码,它接受一个字符串作为输入。如何将其扩展到系列/数据框?
from nltk.collocations import *
desc='john is a guy person you him guy person you him'
tokens = nltk.word_tokenize(desc)
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.ngram_fd.viewitems()
回答by Till
If your data is like
如果你的数据是这样的
import pandas as pd
df = pd.DataFrame([
'must watch. Good acting',
'average movie. Bad acting',
'good movie. Good acting',
'pathetic. Avoid',
'avoid'], columns=['description'])
You could use the CountVectorizer
of the package sklearn
:
您可以使用CountVectorizer
包的sklearn
:
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])
Which gives you :
这给了你:
frequency
good 3
pathetic 1
average movie 1
movie bad 2
watch 1
good movie 1
watch good 3
good acting 2
must 1
movie good 2
pathetic avoid 1
bad acting 1
average 1
must watch 1
acting 1
bad 1
movie 1
avoid 1
EDIT
编辑
fit
will just "train" your vectorizer : it will split the words of your corpus and create a vocabulary with it. Then transform
can take a new document and create vector of frequency based on the vectorizer vocabulary.
fit
只会“训练”你的向量化器:它会拆分你的语料库中的单词并用它创建一个词汇表。然后transform
可以获取一个新文档并根据向量化词汇表创建频率向量。
Here your training set is your output set, so you can do both at the same time (fit_transform
). Because you have 5 documents, it will create 5 vectors as a matrix. You want a global vector, so you have to make a sum
.
这里你的训练集是你的输出集,所以你可以同时做这两个 ( fit_transform
)。因为您有 5 个文档,所以它会创建 5 个向量作为矩阵。你想要一个全局向量,所以你必须制作一个sum
.
EDIT 2
编辑 2
For big dataframes, you can speed up the frequencies computation by using:
对于大数据帧,您可以使用以下方法加快频率计算:
frequencies = sum(sparse_matrix).data