Python NLTK：Bigrams trigrams Fourgrams

Question

提问by M.A.Hassan

I have this example and i want to know how to get this result. I have text and I tokenize it then I collect the bigram and trigram and fourgram like that

我有这个例子，我想知道如何得到这个结果。我有文本并对其进行标记，然后我收集类似的二元词、三元词和四元词

import nltk
from nltk import word_tokenize
from nltk.util import ngrams
text = "Hi How are you? i am fine and you"
token=nltk.word_tokenize(text)
bigrams=ngrams(token,2)

bigrams: [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')]

二元组： [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')]

trigrams=ngrams(token,3)

trigrams: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]

三元组： [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]

bigram [(a,b) (b,c) (c,d)]
trigram [(a,b,c) (b,c,d) (c,d,f)]
i want the new trigram should be [(c,d,f)]
which mean 
newtrigram = [('are', 'you', '?'),('?', 'i','am'),...etc

any idea will be helpful

任何想法都会有所帮助

Answer 1

采纳答案by prooffreader

If you apply some set theory (if I'm interpreting your question correctly), you'll see that the trigrams you want are simply elements [2:5], [4:7], [6:8], etc. of the tokenlist.

如果你应用一些集合论（如果我正确地解释了你的问题），你会看到你想要的三元组只是元素 [2:5]、[4:7]、[6:8] 等。该token列表。

You could generate them like this:

你可以像这样生成它们：

>>> new_trigrams = []
>>> c = 2
>>> while c < len(token) - 2:
...     new_trigrams.append((token[c], token[c+1], token[c+2]))
...     c += 2
>>> print new_trigrams
[('are', 'you', '?'), ('?', 'i', 'am'), ('am', 'fine', 'and')]

Answer 2

回答by Lewistrick

I do it like this:

我这样做：

def words_to_ngrams(words, n, sep=" "):
    return [sep.join(words[i:i+n]) for i in range(len(words)-n+1)]

This takes a listof words as input and returns a list of ngrams (for given n), separated by sep(in this case a space).

这需要一个单词列表作为输入并返回一个 ngram 列表（对于给定的 n），由sep（在这种情况下是一个空格）分隔。

Answer 3

回答by alvas

Try everygrams:

尝试everygrams：

from nltk import everygrams
list(everygrams('hello', 1, 5))

[out]:

[出去]：

[('h',),
 ('e',),
 ('l',),
 ('l',),
 ('o',),
 ('h', 'e'),
 ('e', 'l'),
 ('l', 'l'),
 ('l', 'o'),
 ('h', 'e', 'l'),
 ('e', 'l', 'l'),
 ('l', 'l', 'o'),
 ('h', 'e', 'l', 'l'),
 ('e', 'l', 'l', 'o'),
 ('h', 'e', 'l', 'l', 'o')]

Word tokens:

单词标记：

from nltk import everygrams

list(everygrams('hello word is a fun program'.split(), 1, 5))

[out]:

[出去]：

[('hello',),
 ('word',),
 ('is',),
 ('a',),
 ('fun',),
 ('program',),
 ('hello', 'word'),
 ('word', 'is'),
 ('is', 'a'),
 ('a', 'fun'),
 ('fun', 'program'),
 ('hello', 'word', 'is'),
 ('word', 'is', 'a'),
 ('is', 'a', 'fun'),
 ('a', 'fun', 'program'),
 ('hello', 'word', 'is', 'a'),
 ('word', 'is', 'a', 'fun'),
 ('is', 'a', 'fun', 'program'),
 ('hello', 'word', 'is', 'a', 'fun'),
 ('word', 'is', 'a', 'fun', 'program')]

Answer 4

回答by python_user

from nltk.util import ngrams

text = "Hi How are you? i am fine and you"

n = int(input("ngram value = "))

n_grams = ngrams(text.split(), n)

for grams in n_grams :

   print(grams)

Python NLTK：Bigrams trigrams Fourgrams

提问by M.A.Hassan

采纳答案by prooffreader

回答by Lewistrick

回答by alvas

回答by python_user

相关推荐

最近更新

标签

Python NLTK：Bigrams trigrams Fourgrams

提问by M.A.Hassan

采纳答案by prooffreader

回答by Lewistrick

回答by alvas

回答by python_user

相关推荐

用python将某个网站的HTML保存在一个txt文件中

Python 熊猫数据框中的字典列

Python OpenCV 断言失败错误：(-215) scn == 3 || 函数 cv::cvtColor 中的 scn == 4 交替工作

Python re.findall 将输出打印为列表而不是字符串

相关推荐

最近更新

标签