在python nltk中计算n-gram频率

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/14364762/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-18 11:12:17  来源:igfitidea点击:

counting n-gram frequency in python nltk

pythonnltkn-gram

提问by Rkz

I have the following code. I know that I can use apply_freq_filterfunction to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a document, before I decide what frequency to set for filtering. As you can see I am using the nltk collocations class.

我有以下代码。我知道我可以使用apply_freq_filter函数来过滤掉小于频率计数的搭配。但是,在我决定为过滤设置什么频率之前,我不知道如何获取文档中所有 n-gram 元组(在我的情况下为双元组)的频率。如您所见,我正在使用 nltk 搭配类。

import nltk
from nltk.collocations import *
line = ""
open_file = open('a_text_file','r')
for val in open_file:
    line += val
tokens = line.split()

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(3)
print finder.nbest(bigram_measures.pmi, 100)

采纳答案by Rkz

The finder.ngram_fd.viewitems()function works

finder.ngram_fd.viewitems()功能有效

回答by Ram Narasimhan

NLTK comes with its own bigrams generator, as well as a convenient FreqDist()function.

NLTK 自带bigrams generator,还有一个方便的FreqDist()功能。

f = open('a_text_file')
raw = f.read()

tokens = nltk.word_tokenize(raw)

#Create your bigrams
bgs = nltk.bigrams(tokens)

#compute frequency distribution for all the bigrams in the text
fdist = nltk.FreqDist(bgs)
for k,v in fdist.items():
    print k,v

Once you have access to the BiGrams and the frequency distributions, you can filter according to your needs.

一旦您可以访问 BiGrams 和频率分布,您就可以根据需要进行过滤。

Hope that helps.

希望有帮助。

回答by Vahab

from nltk import FreqDist
from nltk.util import ngrams    
def compute_freq():
   textfile = open('corpus.txt','r')

   bigramfdist = FreqDist()
   threeramfdist = FreqDist()

   for line in textfile:
        if len(line) > 1:
        tokens = line.strip().split(' ')

        bigrams = ngrams(tokens, 2)
        bigramfdist.update(bigrams)
compute_freq()

回答by avinash nahar

I tried all the above and found a simpler solution. NLTK comes with a simple Most Common freq Ngrams.

我尝试了上述所有方法并找到了一个更简单的解决方案。NLTK 带有一个简单的最常见频率 Ngrams。

filtered_sentence is my word tokens

filtered_sentence 是我的单词标记

import nltk
from nltk.util import ngrams
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

word_fd = nltk.FreqDist(filtered_sentence)
bigram_fd = nltk.FreqDist(nltk.bigrams(filtered_sentence))

bigram_fd.most_common()

This should give the output as:

这应该给出如下输出:

[(('working', 'hours'), 31),
 (('9', 'hours'), 14),
 (('place', 'work'), 13),
 (('reduce', 'working'), 11),
 (('improve', 'experience'), 9)]