从 Python 的 NLTK 中的自定义文本生成随机句子?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/1150144/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-11-03 21:34:44  来源:igfitidea点击:

Generating random sentences from custom text in Python's NLTK?

pythonrandomnltk

提问by James McMahon

I'm having trouble with the NLTK under Python, specifically the .generate() method.

我在使用 Python 下的 NLTK 时遇到了问题,特别是 .generate() 方法。

generate(self, length=100)

Print random text, generated using a trigram language model.

Parameters:

   * length (int) - The length of text to generate (default=100)

生成(自我,长度= 100)

打印使用三元组语言模型生成的随机文本。

参数:

   * length (int) - The length of text to generate (default=100)

Here is a simplified version of what I am attempting.

这是我正在尝试的简化版本。

import nltk

words = 'The quick brown fox jumps over the lazy dog'
tokens = nltk.word_tokenize(words)
text = nltk.Text(tokens)
print text.generate(3)

This will alwaysgenerate

总是会产生

Building ngram index...
The quick brown
None

As opposed to building a random phrase out of the words.

与从单词中构建随机短语相反。

Here is my output when I do

这是我做的时候的输出

print text.generate()

Building ngram index...
The quick brown fox jumps over the lazy dog fox jumps over the lazy
dog dog The quick brown fox jumps over the lazy dog dog brown fox
jumps over the lazy dog over the lazy dog The quick brown fox jumps
over the lazy dog fox jumps over the lazy dog lazy dog The quick brown
fox jumps over the lazy dog the lazy dog The quick brown fox jumps
over the lazy dog jumps over the lazy dog over the lazy dog brown fox
jumps over the lazy dog quick brown fox jumps over the lazy dog The
None

Again starting out with the same text, but then varying it. I've also tried using the first chapter from Orwell's 1984. Again that alwaysstarts with the first 3 tokens (one of which is a space in this case) and thengoes on to randomly generate text.

再次从相同的文本开始,然后改变它。我还尝试使用 Orwell 1984 年的第一章。同样,它总是从前3 个标记开始(在这种情况下,其中一个是空格)然后继续随机生成文本。

What am I doing wrong here?

我在这里做错了什么?

回答by Lakshman Prasad

To generate random text, U need to use Markov Chains

要生成随机文本,您需要使用马尔可夫链

code to do that: from here

代码来做到这一点:从这里

import random

class Markov(object):

  def __init__(self, open_file):
    self.cache = {}
    self.open_file = open_file
    self.words = self.file_to_words()
    self.word_size = len(self.words)
    self.database()


  def file_to_words(self):
    self.open_file.seek(0)
    data = self.open_file.read()
    words = data.split()
    return words


  def triples(self):
    """ Generates triples from the given data string. So if our string were
    "What a lovely day", we'd generate (What, a, lovely) and then
    (a, lovely, day).
    """

    if len(self.words) < 3:
      return

    for i in range(len(self.words) - 2):
      yield (self.words[i], self.words[i+1], self.words[i+2])

  def database(self):
    for w1, w2, w3 in self.triples():
      key = (w1, w2)
      if key in self.cache:
    self.cache[key].append(w3)
      else:
    self.cache[key] = [w3]

  def generate_markov_text(self, size=25):
    seed = random.randint(0, self.word_size-3)
    seed_word, next_word = self.words[seed], self.words[seed+1]
    w1, w2 = seed_word, next_word
    gen_words = []
    for i in xrange(size):
      gen_words.append(w1)
      w1, w2 = w2, random.choice(self.cache[(w1, w2)])
    gen_words.append(w2)
    return ' '.join(gen_words)

Explaination: Generating pseudo random text with Markov chains using Python

解释:使用 Python 生成带有马尔可夫链的伪随机文本

回答by drxzcl

You should be "training" the Markov model with multiple sequences, so that you accurately sample the starting state probabilities as well (called "pi" in Markov-speak). If you use a single sequence then you will always start in the same state.

您应该使用多个序列“训练”马尔可夫模型,以便您也准确地对起始状态概率进行采样(在马尔可夫语中称为“pi”)。如果您使用单个序列,那么您将始终以相同的状态开始。

In the case of Orwell's 1984 you would want to use sentence tokenization first (NLTK is very good at it), then word tokenization (yielding a list of lists of tokens, not just a single list of tokens) and then feed each sentence separately to the Markov model. This will allow it to properly model sequence starts, instead of being stuck on a single way to start every sequence.

在 Orwell 的 1984 年的情况下,您可能希望首先使用句子标记化(NLTK 非常擅长),然后是单词标记化(产生标记列表列表,而不仅仅是单个标记列表),然后将每个句子分别提供给马尔可夫模型。这将允许它正确地模拟序列的开始,而不是卡在单一的方式来开始每个序列。

回答by Mastermind

Your sample corpus is most likely to be too small. I don't know how exactly nltk builds its trigram model but it is common practice that beginning and end of sentences are handled somehow. Since there is only one beginning of sentence in your corpus this might be the reason why every sentence has the same beginning.

您的样本语料库很可能太小了。我不知道 nltk 究竟是如何构建其三元模型的,但通常的做法是以某种方式处理句子的开头和结尾。由于您的语料库中只有一个句子的开头,这可能是每个句子都有相同开头的原因。

回答by Geo

Maybe you can sort the tokens array randomly before generating a sentence.

也许您可以在生成句子之前随机对令牌数组进行排序。

回答by Mark Rushakoff

Are you sure that using word_tokenizeis the right approach?

您确定使用word_tokenize是正确的方法吗?

This Google groups pagehas the example:

此 Google 群组页面包含示例:

>>> import nltk
>>> text = nltk.Text(nltk.corpus.brown.words()) # Get text from brown
>>> text.generate() 

But I've never used nltk, so I can't say whether that works the way you want.

但是我从来没有使用过 nltk,所以我不能说它是否按你想要的方式工作。