Python NLTK 中 PunktSentenceTokenizer 的使用
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/35275001/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
Use of PunktSentenceTokenizer in NLTK
提问by arqam
I am learning Natural Language Processing using NLTK.
I came across the code using PunktSentenceTokenizer
whose actual use I cannot understand in the given code. The code is given :
我正在使用 NLTK 学习自然语言处理。我遇到了PunktSentenceTokenizer
在给定代码中我无法理解其实际用途的代码。给出了代码:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text) #A
tokenized = custom_sent_tokenizer.tokenize(sample_text) #B
def process_content():
try:
for i in tokenized[:5]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
So, why do we use PunktSentenceTokenizer. And what is going on in the line marked A and B. I mean there is a training text and the other a sample text, but what is the need for two data sets to get the Part of Speech tagging.
那么,我们为什么要使用 PunktSentenceTokenizer。标记 A 和 B 的行中发生了什么。我的意思是有一个训练文本,另一个是示例文本,但是需要两个数据集来获得词性标记。
Line marked as A
and B
is which I am not able to understand.
标记为A
and B
is 我无法理解的行。
PS : I did try to look in the NLTK book but could not understand what is the real use of PunktSentenceTokenizer
PS:我确实尝试查看 NLTK 书,但无法理解 PunktSentenceTokenizer 的真正用途是什么
采纳答案by alvas
PunktSentenceTokenizer
is the abstract class for the default sentence tokenizer, i.e. sent_tokenize()
, provided in NLTK. It is an implmentation of Unsupervised Multilingual Sentence
Boundary Detection (Kiss and Strunk (2005). See https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L79
PunktSentenceTokenizer
是默认句子标记器的抽象类,即sent_tokenize()
,在 NLTK 中提供。这是一个implmentation无监督的多语言句子边界检测(吻和斯特伦克(2005) 。参见https://github.com/nltk/nltk/blob/develop/nltk/tokenize/初始化的.py#L79
Given a paragraph with multiple sentence, e.g:
给定一个包含多个句子的段落,例如:
>>> from nltk.corpus import state_union
>>> train_text = state_union.raw("2005-GWBush.txt").split('\n')
>>> train_text[11]
u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '
You can use the sent_tokenize()
:
您可以使用sent_tokenize()
:
>>> sent_tokenize(train_text[11])
[u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. ']
>>> for sent in sent_tokenize(train_text[11]):
... print sent
... print '--------'
...
Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
--------
This evening I will set forth policies to advance that ideal at home and around the world.
--------
The sent_tokenize()
uses a pre-trained model from nltk_data/tokenizers/punkt/english.pickle
. You can also specify other languages, the list of available languages with pre-trained models in NLTK are:
在sent_tokenize()
使用从预先训练模式nltk_data/tokenizers/punkt/english.pickle
。您还可以指定其他语言,NLTK 中具有预训练模型的可用语言列表是:
alvas@ubi:~/nltk_data/tokenizers/punkt$ ls
czech.pickle finnish.pickle norwegian.pickle slovene.pickle
danish.pickle french.pickle polish.pickle spanish.pickle
dutch.pickle german.pickle portuguese.pickle swedish.pickle
english.pickle greek.pickle PY3 turkish.pickle
estonian.pickle italian.pickle README
Given a text in another language, do this:
给定另一种语言的文本,请执行以下操作:
>>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, G?ttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südnieders?chsischen Orgellandschaft vollst?ndig oder in Teilen erhalten. "
>>> for sent in sent_tokenize(german_text, language='german'):
... print sent
... print '---------'
...
Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, G?ttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter.
---------
über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südnieders?chsischen Orgellandschaft vollst?ndig oder in Teilen erhalten.
---------
To train your own punkt model, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.pyand training data format for nltk punkt
要训练自己的 punkt 模型,请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py和nltk punkt 的训练数据格式
回答by CentAu
PunktSentenceTokenizer
is an sentence boundary detection algorithm that must be trained to be used [1]. NLTK already includes a pre-trained version of the PunktSentenceTokenizer.
PunktSentenceTokenizer
是一种句子边界检测算法,必须经过训练才能使用[1]。NLTK 已经包含 PunktSentenceTokenizer 的预训练版本。
So if you use initialize the tokenizer without any arguments, it will default to the pre-trained version:
因此,如果您在不带任何参数的情况下使用 initialize 标记器,它将默认为预训练版本:
In [1]: import nltk
In [2]: tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
In [3]: txt = """ This is one sentence. This is another sentence."""
In [4]: tokenizer.tokenize(txt)
Out[4]: [' This is one sentence.', 'This is another sentence.']
You can also provide your own training data to train the tokenizer before using it. Punkt tokenizer uses an unsupervised algorithm, meaning you just train it with regular text.
您还可以提供自己的训练数据以在使用分词器之前对其进行训练。Punkt 分词器使用无监督算法,这意味着您只需使用常规文本对其进行训练。
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
For most of the cases, it is totally fine to use the pre-trained version. So you can simply initialize the tokenizer without providing any arguments.
在大多数情况下,使用预训练版本完全没问题。所以你可以简单地初始化分词器而不提供任何参数。
So "what all this has to do with POS tagging"? The NLTK POS tagger works with tokenized sentences, so you need to break your text into sentences and word tokens before you can POS tag.
那么“这一切与 POS 标签有什么关系”?NLTK POS 标记器使用标记化的句子,因此您需要先将文本分解为句子和单词标记,然后才能进行 POS 标记。
[1] Kiss and Strunk, " Unsupervised Multilingual Sentence Boundary Detection"
[1] Kiss 和 Strunk,“ 无监督多语言句子边界检测”
回答by Ranjeet Singh
You can refer below link to get more insight on usage of PunktSentenceTokenizer. It vividly explains why PunktSentenceTokenizer is used instead of sent-tokenize() with regard to your case.
您可以参考以下链接以更深入地了解 PunktSentenceTokenizer 的用法。它生动地解释了为什么在您的情况下使用 PunktSentenceTokenizer 而不是 sent-tokenize() 。
回答by ashirwad
def process_content(corpus):
tokenized = PunktSentenceTokenizer().tokenize(corpus)
try:
for sent in tokenized:
words = nltk.word_tokenize(sent)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content(train_text)
Without even training it on other text data it works the same as it is pre-trained.
甚至无需在其他文本数据上对其进行训练,它的工作方式与预训练时相同。