Python 从大量 .txt 文件及其频率生成 Ngrams(Unigrams、Bigrams 等)

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/32441605/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 11:34:47  来源:igfitidea点击:

Generating Ngrams (Unigrams,Bigrams etc) from a large corpus of .txt files and their Frequency

pythonnltk

提问by Arash

I need to write a program in NLTK that breaks a corpus (a large collection of txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams. I have already written code to input my files into the program.

我需要在 NLTK 中编写一个程序,将语料库(大量 txt 文件)分解为 unigrams、bigrams、trigrams、fourgrams 和 Fivegrams。我已经编写了代码来将我的文件输入到程序中。

The input is 300 .txt files written in English and I want the output in form of Ngrams and specially the frequency count.

输入是 300 个用英文编写的 .txt 文件,我希望以 Ngrams 的形式输出,特别是频率计数。

I know that NLTK has Bigram and Trigram modules : http://www.nltk.org/_modules/nltk/model/ngram.html

我知道 NLTK 有 Bigram 和 Trigram 模块:http://www.nltk.org/_modules/nltk/model/ngram.html

but I am not that advanced to enter them into my program.

但我并没有那么先进,无法将它们输入到我的程序中。

input: txt files NOT single sentences

输入:txt 文件不是单个句子

output example:

输出示例:

Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')] 

Trigram: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]

My code up to now is:

到目前为止,我的代码是:

from nltk.corpus import PlaintextCorpusReader
corpus = 'C:/Users/Hyman3/My folder'
files = PlaintextCorpusReader(corpus, '.*')
ngrams=2

def generate(file, ngrams):
    for gram in range(0, ngrams):
    print((file[0:-4]+"_"+str(ngrams)+"_grams.txt").replace("/","_"))


for file in files.fileids():
generate(file, ngrams)

Any help what should be done next?

任何帮助下一步应该做什么?

采纳答案by hellpanderr

Just use ntlk.ngrams.

只需使用ntlk.ngrams.

import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter

text = "I need to write a program in NLTK that breaks a corpus (a large collection of \
txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\ 
I need to write a program in NLTK that breaks a corpus"
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)

print Counter(bigrams)

Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,
 ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,
 ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,
 ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams', 
','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,
 (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',
 'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,
 ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1, 
('collection', 'of'): 1, ('files', ')'): 1})

UPDATE (with pure python):

更新(使用纯python):

import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])
for text in corpus:
    token = nltk.word_tokenize(text)
    bigrams = ngrams(token, 2)
    frequencies += Counter(bigrams)

回答by Montmons

Ok, so since you asked for an NLTK solution this might not be exactly what you where looking for but, have you considered TextBlob? It has a NLTK backend but it has a simpler syntax. It would look something like this:

好的,既然您要求使用 NLTK 解决方案,这可能不是您想要的,但是,您考虑过TextBlob吗?它有一个 NLTK 后端,但语法更简单。它看起来像这样:

from textblob import TextBlob

text = "Paste your text or text-containing variable here" 
blob = TextBlob(text)
ngram_var = blob.ngrams(n=3)
print(ngram_var)

Output:
[WordList(['Paste', 'your', 'text']), WordList(['your', 'text', 'or']), WordList(['text', 'or', 'text-containing']), WordList(['or', 'text-containing', 'variable']), WordList(['text-containing', 'variable', 'here'])]

You would of course still need to use Counter or some other method to add a count per ngram.

您当然仍然需要使用 Counter 或其他一些方法来添加每个 ngram 的计数。

However, the fastest approach (by far) I have been able to find to both create any ngram you'd like and also count in a single function them stems from thispost from 2012 and uses Itertools. It's great.

但是,我已经找到的最快方法(到目前为止)既可以创建您想要的任何 ngram,又可以将它们计入一个函数,它们源自2012 年的这篇文章并使用 Itertools。这很棒。

回答by Aziz Alto

Here is a simple example using pure Python to generate any ngram:

这是一个使用纯 Python 生成 any 的简单示例ngram

>>> def ngrams(s, n=2, i=0):
...     while len(s[i:i+n]) == n:
...         yield s[i:i+n]
...         i += 1
...
>>> txt = 'Python is one of the awesomest languages'

>>> unigram = ngrams(txt.split(), n=1)
>>> list(unigram)
[['Python'], ['is'], ['one'], ['of'], ['the'], ['awesomest'], ['languages']]

>>> bigram = ngrams(txt.split(), n=2)
>>> list(bigram)
[['Python', 'is'], ['is', 'one'], ['one', 'of'], ['of', 'the'], ['the', 'awesomest'], ['awesomest', 'languages']]

>>> trigram = ngrams(txt.split(), n=3)
>>> list(trigram)
[['Python', 'is', 'one'], ['is', 'one', 'of'], ['one', 'of', 'the'], ['of', 'the', 'awesomest'], ['the', 'awesomest',
'languages']]

回答by Yann Dubois

If efficiency is an issue and you have to build multiple different n-grams, but you want to use pure python I would do:

如果效率是一个问题并且您必须构建多个不同的 n-gram,但您想使用纯 python,我会这样做:

from itertools import chain

def n_grams(seq, n=1):
    """Returns an iterator over the n-grams given a list_tokens"""
    shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shifted_tokens = (shift_token(i) for i in range(n))
    tuple_ngrams = zip(*shifted_tokens)
    return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)

def range_ngrams(list_tokens, ngram_range=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
    return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))

Usage :

用法 :

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngram_range=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~Same speed as NLTK:

~与NLTK相同的速度:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngram_range=(1,6))
# 7.13 ms ± 165 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Repost from my previous answer.

从我以前的答案重新发布

回答by madjardi

maybe it helps. see link

也许它有帮助。见链接

import spacy  
nlp_en = spacy.load("en_core_web_sm")
[x.text for x in doc]

回答by A. Dew

The answer of @hellpander above correct, but not efficient for a very large corpus (I faced difficulties with ~650K documents). The code would slow down considerably everytime frequencies are updated, due to the expensive lookup of the dictionary as the content grows. So you will need to have additional buffer variable to help cache the frequencies Counter of @hellpander answer. Hence, isntead of doing key lookup for a very large frequencies (dictionary) everytime a new document is iterated, you would add it to the temporary, smaller Counter dict. Then, after some iterations, it will be add up to the global frequencies. This way it'll be much faster because the huge dictionary lookup is done much less frequently.

上面@hellpander 的答案是正确的,但对于非常大的语料库来说效率不高(我遇到了 ~650K 文档的困难)。每次更新频率时,代码都会显着减慢,这是由于随着内容的增长而昂贵的字典查找。所以你需要有额外的缓冲区变量来帮助缓存@hellpander 回答的频率计数器。因此,不是每次迭代新文档时都对非常大的频率(字典)进行键查找,而是将其添加到临时的、较小的 Counter 字典中。然后,经过一些迭代,它会被加起来到全局频率。这样它会快得多,因为巨大的字典查找的频率要低得多。

import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])

for i in range(0, len(corpus)):
    token = nltk.word_tokenize(corpus[i])
    bigrams = ngrams(token, 2)
    f += Counter(bigrams)
    if (i%10000 == 0):
        # store to global frequencies counter and clear up f every 10000 docs.
        frequencies += Counter(bigrams)
        f = Counter([])