python中的n-gram,四、五、六克?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/17531684/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 08:26:35  来源:igfitidea点击:

n-grams in python, four, five, six grams?

pythonstringnltkn-gram

提问by Shifu

I'm looking for a way to split a text into n-grams. Normally I would do something like:

我正在寻找一种将文本拆分为 n-gram 的方法。通常我会做这样的事情:

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams?

我知道 nltk 只提供二元组和三元组,但是有没有办法将我的文本分成四克、五克甚至一百克?

Thanks!

谢谢!

采纳答案by alvas

Great native python based answers given by other users. But here's the nltkapproach (just in case, the OP gets penalized for reinventing what's already existing in the nltklibrary).

其他用户给出的基于 Python 的优秀答案。但这是nltk方法(以防万一,OP 因重新发明nltk库中已有的内容而受到惩罚)。

There is an ngram modulethat people seldom use in nltk. It's not because it's hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.

有一个人们很少使用的ngram 模块nltk。这不是因为很难阅读 ngram,而是基于 ngram 训练模型,其中 n > 3 会导致大量数据稀疏。

from nltk import ngrams

sentence = 'this is a foo bar sentences and i want to ngramize it'

n = 6
sixgrams = ngrams(sentence.split(), n)

for grams in sixgrams:
  print grams

回答by Nik

I have never dealt with nltk but did N-grams as part of some small class project. If you want to find the frequency of all N-grams occurring in the string, here is a way to do that. Dwould give you the histogram of your N-words.

我从未处理过 nltk,但将 N-gram 作为一些小班项目的一部分。如果您想找到字符串中出现的所有 N-gram 的频率,这里有一种方法可以做到这一点。D会给你你的 N 字的直方图。

D = dict()
string = 'whatever string...'
strparts = string.split()
for i in range(len(strparts)-N): # N-grams
    try:
        D[tuple(strparts[i:i+N])] += 1
    except:
        D[tuple(strparts[i:i+N])] = 1

回答by tzaman

You can easily whip up your own function to do this using itertools:

您可以使用以下方法轻松创建自己的函数来执行此操作itertools

from itertools import izip, islice, tee
s = 'spam and eggs'
N = 3
trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
list(trigrams)
# [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
# ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
# ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
# ('e', 'g', 'g'), ('g', 'g', 's')]

回答by inspectorG4dget

I'm surprised that this hasn't shown up yet:

我很惊讶这还没有出现:

In [34]: sentence = "I really like python, it's pretty awesome.".split()

In [35]: N = 4

In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]

In [37]: for gram in grams: print gram
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']

回答by M.A.Hassan

here is another simple way for do n-grams

这是做 n-gram 的另一种简单方法

>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]

回答by sel

For four_grams it is already in NLTK, here is a piece of code that can help you toward this:

对于four_grams它已经在NLTK中,这里有一段代码可以帮助你解决这个问题:

 from nltk.collocations import *
 import nltk
 #You should tokenize your text
 text = "I do not like green eggs and ham, I do not like them Sam I am!"
 tokens = nltk.wordpunct_tokenize(text)
 fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
 for fourgram, freq in fourgrams.ngram_fd.items():  
       print fourgram, freq

I hope it helps.

我希望它有帮助。

回答by Franck Dernoncourt

You can use sklearn.feature_extraction.text.CountVectorizer:

您可以使用sklearn.feature_extraction.text.CountVectorizer

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

outputs:

输出:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

You can set to ngram_sizeto any positive integer. I.e. you can split a text in four-grams, five-grams or even hundred-grams.

您可以设置为ngram_size任何正整数。即您可以将文本拆分为四克、五克甚至一百克。

回答by Daniel Pérez Rada

Nltk is great, but sometimes is a overhead for some projects:

Nltk 很棒,但有时对某些项目来说是开销:

import re
def tokenize(text, ngrams=1):
    text = re.sub(r'[\b\(\)\\"\'\/\[\]\s+\,\.:\?;]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    tokens = text.split()
    return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]

Example use:

使用示例:

>> text = "This is an example text"
>> tokenize(text, 2)
[('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')]
>> tokenize(text, 3)
[('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]

回答by Δημητρη? Παππ??

Using only nltk tools

仅使用 nltk 工具

from nltk.tokenize import word_tokenize
from nltk.util import ngrams

def get_ngrams(text, n ):
    n_grams = ngrams(word_tokenize(text), n)
    return [ ' '.join(grams) for grams in n_grams]

Example output

示例输出

get_ngrams('This is the simplest text i could think of', 3 )

['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']

In order to keep the ngrams in array format just remove ' '.join

为了保持数组格式的 ngrams 只需删除 ' '.join

回答by Serendipity

A more elegant approach to build bigrams with python's builtin zip(). Simply convert the original string into a list by split(), then pass the list once normally and once offset by one element.

使用 python 的 builtin 构建二元组的更优雅的方法zip()。只需将原始字符串转换为列表split(),然后正常传递列表一次,并偏移一个元素。

string = "I really like python, it's pretty awesome."

def find_bigrams(s):
    input_list = s.split(" ")
    return zip(input_list, input_list[1:])

def find_ngrams(s, n):
  input_list = s.split(" ")
  return zip(*[input_list[i:] for i in range(n)])

find_bigrams(string)

[('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]