Python NLTK:Bigrams trigrams Fourgrams
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/24347029/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Python NLTK: Bigrams trigrams fourgrams
提问by M.A.Hassan
I have this example and i want to know how to get this result. I have text and I tokenize it then I collect the bigram and trigram and fourgram like that
我有这个例子,我想知道如何得到这个结果。我有文本并对其进行标记,然后我收集类似的二元词、三元词和四元词
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
text = "Hi How are you? i am fine and you"
token=nltk.word_tokenize(text)
bigrams=ngrams(token,2)
bigrams: [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')]
二元组: [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')]
trigrams=ngrams(token,3)
trigrams: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]
三元组: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]
bigram [(a,b) (b,c) (c,d)]
trigram [(a,b,c) (b,c,d) (c,d,f)]
i want the new trigram should be [(c,d,f)]
which mean
newtrigram = [('are', 'you', '?'),('?', 'i','am'),...etc
any idea will be helpful
任何想法都会有所帮助
采纳答案by prooffreader
If you apply some set theory (if I'm interpreting your question correctly), you'll see that the trigrams you want are simply elements [2:5], [4:7], [6:8], etc. of the token
list.
如果你应用一些集合论(如果我正确地解释了你的问题),你会看到你想要的三元组只是元素 [2:5]、[4:7]、[6:8] 等。该token
列表。
You could generate them like this:
你可以像这样生成它们:
>>> new_trigrams = []
>>> c = 2
>>> while c < len(token) - 2:
... new_trigrams.append((token[c], token[c+1], token[c+2]))
... c += 2
>>> print new_trigrams
[('are', 'you', '?'), ('?', 'i', 'am'), ('am', 'fine', 'and')]
回答by Lewistrick
I do it like this:
我这样做:
def words_to_ngrams(words, n, sep=" "):
return [sep.join(words[i:i+n]) for i in range(len(words)-n+1)]
This takes a listof words as input and returns a list of ngrams (for given n), separated by sep
(in this case a space).
这需要一个单词列表作为输入并返回一个 ngram 列表(对于给定的 n),由sep
(在这种情况下是一个空格)分隔。
回答by alvas
Try everygrams
:
尝试everygrams
:
from nltk import everygrams
list(everygrams('hello', 1, 5))
[out]:
[出去]:
[('h',),
('e',),
('l',),
('l',),
('o',),
('h', 'e'),
('e', 'l'),
('l', 'l'),
('l', 'o'),
('h', 'e', 'l'),
('e', 'l', 'l'),
('l', 'l', 'o'),
('h', 'e', 'l', 'l'),
('e', 'l', 'l', 'o'),
('h', 'e', 'l', 'l', 'o')]
Word tokens:
单词标记:
from nltk import everygrams
list(everygrams('hello word is a fun program'.split(), 1, 5))
[out]:
[出去]:
[('hello',),
('word',),
('is',),
('a',),
('fun',),
('program',),
('hello', 'word'),
('word', 'is'),
('is', 'a'),
('a', 'fun'),
('fun', 'program'),
('hello', 'word', 'is'),
('word', 'is', 'a'),
('is', 'a', 'fun'),
('a', 'fun', 'program'),
('hello', 'word', 'is', 'a'),
('word', 'is', 'a', 'fun'),
('is', 'a', 'fun', 'program'),
('hello', 'word', 'is', 'a', 'fun'),
('word', 'is', 'a', 'fun', 'program')]
回答by python_user
from nltk.util import ngrams
text = "Hi How are you? i am fine and you"
n = int(input("ngram value = "))
n_grams = ngrams(text.split(), n)
for grams in n_grams :
print(grams)