Python 将段落标记为句子,然后在 NLTK 中标记为单词
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37605710/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Tokenize a paragraph into sentence and then into words in NLTK
提问by Nikhil Raghavendra
I am trying to input an entire paragraph into my word processor to be split into sentences first and then into words.
我正在尝试将整个段落输入到我的文字处理器中,先将其拆分为句子,然后再拆分为单词。
I tried the following code but it does not work,
我尝试了以下代码但它不起作用,
#text is the paragraph input
sent_text = sent_tokenize(text)
tokenized_text = word_tokenize(sent_text.split)
tagged = nltk.pos_tag(tokenized_text)
print(tagged)
however this is not working and gives me errors. So how do I tokenize paragraphs into sentences and then words?
但是这不起作用并给我错误。那么如何将段落标记为句子然后是单词呢?
An example paragraph:
一个示例段落:
This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart. He sank down in despair at the child's feet. When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. At the same time with his ears and his eyes he offered a small prayer to the child.
这东西似乎压倒了那只黑褐色的小狗,让他感到震惊,伤到了他的心。他绝望地倒在孩子的脚边。重击一拳,加上一句幼稚的告诫,他翻了个身,用一种奇怪的方式握住了他的爪子。与此同时,他用耳朵和眼睛向孩子做了一个小小的祈祷。
**WARNING:**This is just a random text from the internet, I do not own the above content.
**警告:**这只是来自互联网的随机文本,我不拥有上述内容。
回答by slider
You probably intended to loop over sent_text
:
你可能打算循环sent_text
:
import nltk
sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences
# now loop over each sentence and tokenize it separately
for sentence in sent_text:
tokenized_text = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokenized_text)
print(tagged)
回答by Brian Cugelman
Here's a shorter version. This will give you a data structure with each individual sentence, and each token within the sentence. I prefer the TweetTokenizer for messy, real world language. The sentence tokenizer is considered decent, but be careful not to lower your word case till after this step, as it may impact the accuracy of detecting the boundaries of messy text.
这是一个较短的版本。这将为您提供一个数据结构,其中包含每个单独的句子以及句子中的每个标记。我更喜欢 TweetTokenizer 来处理凌乱的现实世界语言。句子标记器被认为是不错的,但要注意不要在此步骤之后降低您的单词大小写,因为它可能会影响检测杂乱文本边界的准确性。
from nltk.tokenize import TweetTokenizer, sent_tokenize
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in
nltk.sent_tokenize(input_text)]
print(tokens_sentences)
Here's what the output looks like, which I cleaned up so the structure stands out:
这是输出的样子,我对其进行了清理,以便结构突出:
[
['This', 'thing', 'seemed', 'to', 'overpower', 'and', 'astonish', 'the', 'little', 'dark-brown', 'dog', ',', 'and', 'wounded', 'him', 'to', 'the', 'heart', '.'],
['He', 'sank', 'down', 'in', 'despair', 'at', 'the', "child's", 'feet', '.'],
['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'],
['At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
]
回答by Sripathi
import nltk
textsample ="This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart. He sank down in despair at the child's feet. When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. At the same time with his ears and his eyes he offered a small prayer to the child."
sentences = nltk.sent_tokenize(textsample)
words = nltk.word_tokenize(textsample)
sentences
[w for w in words if w.isalpha()]
The last line above will ensure only words are in the output and not special characters The sentence output is as below
上面的最后一行将确保输出中只有单词而不是特殊字符 句子输出如下
['This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart.', "He sank down in despair at the child's feet.", 'When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner.', 'At the same time with his ears and his eyes he offered a small prayer to the child.']
['这件事似乎压倒了那只深褐色的小狗,使他感到震惊,并伤害了他的心脏。',“他绝望地倒在孩子的脚下。”,“当重击时,伴随着警告幼稚的句子,他转过身来,以一种奇怪的方式握住他的爪子。”,“同时用他的耳朵和眼睛向孩子做了一个小小的祈祷。”]
The words output is as below after removing special characters
去掉特殊字符后输出的词如下
['This', 'thing', 'seemed', 'to', 'overpower', 'and', 'astonish', 'the', 'little', 'dog', 'and', 'wounded', 'him', 'to', 'the', 'heart', 'He', 'sank', 'down', 'in', 'despair', 'at', 'the', 'child', 'feet', 'When', 'the', 'blow', 'was', 'repeated', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', 'he', 'turned', 'over', 'upon', 'his', 'back', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', 'At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child']
['这','事情','似乎','到','压倒','和','惊讶','那个','小','狗','和','受伤','他', 'to', 'the', 'heart', 'He', 'sank', 'down', 'in', 'despair', 'at', 'the', 'child', 'feet' , 'when', 'the', 'blow', 'was', 'repeated', 'together', 'with', 'an', 'admonition', 'in', '幼稚', '句子', ' he', 'turned', 'over', 'on', 'his', 'back', 'and', 'hold', 'his', 'paws', 'in', 'a', '特殊的', '方式', '在', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he' , '提供', 'a', 'small', '祈祷', 'to', 'the', 'child']