Python 使用 NLTK 生成二元组

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37651057/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 19:43:47  来源:igfitidea点击:

Generate bigrams with NLTK

pythonnltkn-gram

提问by Nikhil Raghavendra

I am trying to produce a bigram list of a given sentence for example, if I type,

例如,我正在尝试生成给定句子的二元组列表,如果我输入,

    To be or not to be

I want the program to generate

我希望程序生成

     to be, be or, or not, not to, to be

I tried the following code but just gives me

我尝试了以下代码,但只是给了我

<generator object bigrams at 0x0000000009231360>

This is my code:

这是我的代码:

    import nltk
    bigrm = nltk.bigrams(text)
    print(bigrm)

So how do I get what I want? I want a list of combinations of the words like above (to be, be or, or not, not to, to be).

那么我如何得到我想要的东西呢?我想要一个像上面这样的词的组合列表(to be, be or, or not, not to, to be)。

回答by Ilja Everil?

nltk.bigrams()returns an iterator (a generator specifically) of bigrams. If you want a list, pass the iterator to list(). It also expects a sequence of items to generate bigrams from, so you have to split the text before passing it (if you had not done it):

nltk.bigrams()返回一个二元组的迭代器(特别是一个生成器)。如果您想要一个列表,请将迭代器传递给list(). 它还需要一系列项目来生成二元组,因此您必须在传递文本之前拆分文本(如果您还没有这样做):

bigrm = list(nltk.bigrams(text.split()))

To print them out separated with commas, you could (in python 3):

要以逗号分隔打印它们,您可以(在 python 3 中):

print(*map(' '.join, bigrm), sep=', ')

If on python 2, then for example:

如果在 python 2 上,那么例如:

print ', '.join(' '.join((a, b)) for a, b in bigrm)

Note that just for printing you do not need to generate a list, just use the iterator.

请注意,仅用于打印您不需要生成列表,只需使用迭代器。

回答by Ashok Kumar Jayaraman

The following code produce a bigramlist for a given sentence

以下代码生成bigram给定句子的列表

>>> import nltk
>>> from nltk.tokenize import word_tokenize
>>> text = "to be or not to be"
>>> tokens = nltk.word_tokenize(text)
>>> bigrm = nltk.bigrams(tokens)
>>> print(*map(' '.join, bigrm), sep=', ')
to be, be or, or not, not to, to be