从 Python 的 NLTK 中的自定义文本生成随机句子?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/1150144/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Generating random sentences from custom text in Python's NLTK?
提问by James McMahon
I'm having trouble with the NLTK under Python, specifically the .generate() method.
我在使用 Python 下的 NLTK 时遇到了问题,特别是 .generate() 方法。
generate(self, length=100)
Print random text, generated using a trigram language model.
Parameters:
* length (int) - The length of text to generate (default=100)
生成(自我,长度= 100)
打印使用三元组语言模型生成的随机文本。
参数:
* length (int) - The length of text to generate (default=100)
Here is a simplified version of what I am attempting.
这是我正在尝试的简化版本。
import nltk
words = 'The quick brown fox jumps over the lazy dog'
tokens = nltk.word_tokenize(words)
text = nltk.Text(tokens)
print text.generate(3)
This will alwaysgenerate
这总是会产生
Building ngram index...
The quick brown
None
As opposed to building a random phrase out of the words.
与从单词中构建随机短语相反。
Here is my output when I do
这是我做的时候的输出
print text.generate()
Building ngram index...
The quick brown fox jumps over the lazy dog fox jumps over the lazy
dog dog The quick brown fox jumps over the lazy dog dog brown fox
jumps over the lazy dog over the lazy dog The quick brown fox jumps
over the lazy dog fox jumps over the lazy dog lazy dog The quick brown
fox jumps over the lazy dog the lazy dog The quick brown fox jumps
over the lazy dog jumps over the lazy dog over the lazy dog brown fox
jumps over the lazy dog quick brown fox jumps over the lazy dog The
None
Again starting out with the same text, but then varying it. I've also tried using the first chapter from Orwell's 1984. Again that alwaysstarts with the first 3 tokens (one of which is a space in this case) and thengoes on to randomly generate text.
再次从相同的文本开始,然后改变它。我还尝试使用 Orwell 1984 年的第一章。同样,它总是从前3 个标记开始(在这种情况下,其中一个是空格)然后继续随机生成文本。
What am I doing wrong here?
我在这里做错了什么?
回答by Lakshman Prasad
To generate random text, U need to use Markov Chains
要生成随机文本,您需要使用马尔可夫链
code to do that: from here
代码来做到这一点:从这里
import random
class Markov(object):
def __init__(self, open_file):
self.cache = {}
self.open_file = open_file
self.words = self.file_to_words()
self.word_size = len(self.words)
self.database()
def file_to_words(self):
self.open_file.seek(0)
data = self.open_file.read()
words = data.split()
return words
def triples(self):
""" Generates triples from the given data string. So if our string were
"What a lovely day", we'd generate (What, a, lovely) and then
(a, lovely, day).
"""
if len(self.words) < 3:
return
for i in range(len(self.words) - 2):
yield (self.words[i], self.words[i+1], self.words[i+2])
def database(self):
for w1, w2, w3 in self.triples():
key = (w1, w2)
if key in self.cache:
self.cache[key].append(w3)
else:
self.cache[key] = [w3]
def generate_markov_text(self, size=25):
seed = random.randint(0, self.word_size-3)
seed_word, next_word = self.words[seed], self.words[seed+1]
w1, w2 = seed_word, next_word
gen_words = []
for i in xrange(size):
gen_words.append(w1)
w1, w2 = w2, random.choice(self.cache[(w1, w2)])
gen_words.append(w2)
return ' '.join(gen_words)
Explaination: Generating pseudo random text with Markov chains using Python
回答by drxzcl
You should be "training" the Markov model with multiple sequences, so that you accurately sample the starting state probabilities as well (called "pi" in Markov-speak). If you use a single sequence then you will always start in the same state.
您应该使用多个序列“训练”马尔可夫模型,以便您也准确地对起始状态概率进行采样(在马尔可夫语中称为“pi”)。如果您使用单个序列,那么您将始终以相同的状态开始。
In the case of Orwell's 1984 you would want to use sentence tokenization first (NLTK is very good at it), then word tokenization (yielding a list of lists of tokens, not just a single list of tokens) and then feed each sentence separately to the Markov model. This will allow it to properly model sequence starts, instead of being stuck on a single way to start every sequence.
在 Orwell 的 1984 年的情况下,您可能希望首先使用句子标记化(NLTK 非常擅长),然后是单词标记化(产生标记列表列表,而不仅仅是单个标记列表),然后将每个句子分别提供给马尔可夫模型。这将允许它正确地模拟序列的开始,而不是卡在单一的方式来开始每个序列。
回答by Mastermind
Your sample corpus is most likely to be too small. I don't know how exactly nltk builds its trigram model but it is common practice that beginning and end of sentences are handled somehow. Since there is only one beginning of sentence in your corpus this might be the reason why every sentence has the same beginning.
您的样本语料库很可能太小了。我不知道 nltk 究竟是如何构建其三元模型的,但通常的做法是以某种方式处理句子的开头和结尾。由于您的语料库中只有一个句子的开头,这可能是每个句子都有相同开头的原因。
回答by Geo
Maybe you can sort the tokens array randomly before generating a sentence.
也许您可以在生成句子之前随机对令牌数组进行排序。
回答by Mark Rushakoff
Are you sure that using word_tokenize
is the right approach?
您确定使用word_tokenize
是正确的方法吗?
This Google groups pagehas the example:
此 Google 群组页面包含示例:
>>> import nltk
>>> text = nltk.Text(nltk.corpus.brown.words()) # Get text from brown
>>> text.generate()
But I've never used nltk, so I can't say whether that works the way you want.
但是我从来没有使用过 nltk,所以我不能说它是否按你想要的方式工作。