Python NLTK 中 PunktSentenceTokenizer 的使用

Question

提问by arqam

I am learning Natural Language Processing using NLTK. I came across the code using PunktSentenceTokenizerwhose actual use I cannot understand in the given code. The code is given :

我正在使用 NLTK 学习自然语言处理。我遇到了PunktSentenceTokenizer在给定代码中我无法理解其实际用途的代码。给出了代码：

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text) #A

tokenized = custom_sent_tokenizer.tokenize(sample_text)   #B

def process_content():
try:
    for i in tokenized[:5]:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)
        print(tagged)

except Exception as e:
    print(str(e))


process_content()

So, why do we use PunktSentenceTokenizer. And what is going on in the line marked A and B. I mean there is a training text and the other a sample text, but what is the need for two data sets to get the Part of Speech tagging.

那么，我们为什么要使用 PunktSentenceTokenizer。标记 A 和 B 的行中发生了什么。我的意思是有一个训练文本，另一个是示例文本，但是需要两个数据集来获得词性标记。

Line marked as Aand Bis which I am not able to understand.

标记为Aand Bis 我无法理解的行。

PS : I did try to look in the NLTK book but could not understand what is the real use of PunktSentenceTokenizer

PS：我确实尝试查看 NLTK 书，但无法理解 PunktSentenceTokenizer 的真正用途是什么

Answer 1

采纳答案by alvas

PunktSentenceTokenizeris the abstract class for the default sentence tokenizer, i.e. sent_tokenize(), provided in NLTK. It is an implmentation of Unsupervised Multilingual Sentence Boundary Detection (Kiss and Strunk (2005). See https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L79

PunktSentenceTokenizer是默认句子标记器的抽象类，即sent_tokenize()，在 NLTK 中提供。这是一个implmentation无监督的多语言句子边界检测（吻和斯特伦克（2005）。参见https://github.com/nltk/nltk/blob/develop/nltk/tokenize/初始化的.py＃L79

Given a paragraph with multiple sentence, e.g:

给定一个包含多个句子的段落，例如：

>>> from nltk.corpus import state_union
>>> train_text = state_union.raw("2005-GWBush.txt").split('\n')
>>> train_text[11]
u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all. This evening I will set forth policies to advance that ideal at home and around the world. '

You can use the sent_tokenize():

您可以使用sent_tokenize()：

>>> sent_tokenize(train_text[11])
[u'Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.', u'This evening I will set forth policies to advance that ideal at home and around the world. ']
>>> for sent in sent_tokenize(train_text[11]):
...     print sent
...     print '--------'
... 
Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
--------
This evening I will set forth policies to advance that ideal at home and around the world. 
--------

The sent_tokenize()uses a pre-trained model from nltk_data/tokenizers/punkt/english.pickle. You can also specify other languages, the list of available languages with pre-trained models in NLTK are:

在sent_tokenize()使用从预先训练模式nltk_data/tokenizers/punkt/english.pickle。您还可以指定其他语言，NLTK 中具有预训练模型的可用语言列表是：

alvas@ubi:~/nltk_data/tokenizers/punkt$ ls
czech.pickle     finnish.pickle  norwegian.pickle   slovene.pickle
danish.pickle    french.pickle   polish.pickle      spanish.pickle
dutch.pickle     german.pickle   portuguese.pickle  swedish.pickle
english.pickle   greek.pickle    PY3                turkish.pickle
estonian.pickle  italian.pickle  README

Given a text in another language, do this:

给定另一种语言的文本，请执行以下操作：

>>> german_text = u"Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, G?ttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter. über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südnieders?chsischen Orgellandschaft vollst?ndig oder in Teilen erhalten. "

>>> for sent in sent_tokenize(german_text, language='german'):
...     print sent
...     print '---------'
... 
Die Orgellandschaft Südniedersachsen umfasst das Gebiet der Landkreise Goslar, G?ttingen, Hameln-Pyrmont, Hildesheim, Holzminden, Northeim und Osterode am Harz sowie die Stadt Salzgitter.
---------
über 70 historische Orgeln vom 17. bis 19. Jahrhundert sind in der südnieders?chsischen Orgellandschaft vollst?ndig oder in Teilen erhalten. 
---------

To train your own punkt model, see https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.pyand training data format for nltk punkt

要训练自己的 punkt 模型，请参阅https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py和nltk punkt 的训练数据格式

Answer 2

回答by CentAu

PunktSentenceTokenizeris an sentence boundary detection algorithm that must be trained to be used [1]. NLTK already includes a pre-trained version of the PunktSentenceTokenizer.

PunktSentenceTokenizer是一种句子边界检测算法，必须经过训练才能使用[1]。NLTK 已经包含 PunktSentenceTokenizer 的预训练版本。

So if you use initialize the tokenizer without any arguments, it will default to the pre-trained version:

因此，如果您在不带任何参数的情况下使用 initialize 标记器，它将默认为预训练版本：

In [1]: import nltk
In [2]: tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
In [3]: txt = """ This is one sentence. This is another sentence."""
In [4]: tokenizer.tokenize(txt)
Out[4]: [' This is one sentence.', 'This is another sentence.']

You can also provide your own training data to train the tokenizer before using it. Punkt tokenizer uses an unsupervised algorithm, meaning you just train it with regular text.

您还可以提供自己的训练数据以在使用分词器之前对其进行训练。Punkt 分词器使用无监督算法，这意味着您只需使用常规文本对其进行训练。

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

For most of the cases, it is totally fine to use the pre-trained version. So you can simply initialize the tokenizer without providing any arguments.

在大多数情况下，使用预训练版本完全没问题。所以你可以简单地初始化分词器而不提供任何参数。

So "what all this has to do with POS tagging"? The NLTK POS tagger works with tokenized sentences, so you need to break your text into sentences and word tokens before you can POS tag.

那么“这一切与 POS 标签有什么关系”？NLTK POS 标记器使用标记化的句子，因此您需要先将文本分解为句子和单词标记，然后才能进行 POS 标记。

NLTK's documentation.

NLTK 的文档。

[1] Kiss and Strunk, " Unsupervised Multilingual Sentence Boundary Detection"

[1] Kiss 和 Strunk，“ 无监督多语言句子边界检测”

Answer 3

回答by Ranjeet Singh

You can refer below link to get more insight on usage of PunktSentenceTokenizer. It vividly explains why PunktSentenceTokenizer is used instead of sent-tokenize() with regard to your case.

您可以参考以下链接以更深入地了解 PunktSentenceTokenizer 的用法。它生动地解释了为什么在您的情况下使用 PunktSentenceTokenizer 而不是 sent-tokenize() 。

http://nlpforhackers.io/splitting-text-into-sentences/

Answer 4

回答by ashirwad

def process_content(corpus):

    tokenized = PunktSentenceTokenizer().tokenize(corpus)

    try:
        for sent in tokenized:
            words = nltk.word_tokenize(sent)
            tagged = nltk.pos_tag(words)
            print(tagged)
    except Exception as e:
        print(str(e))

process_content(train_text)

Without even training it on other text data it works the same as it is pre-trained.

甚至无需在其他文本数据上对其进行训练，它的工作方式与预训练时相同。

Python NLTK 中 PunktSentenceTokenizer 的使用

提问by arqam

采纳答案by alvas

回答by CentAu

回答by Ranjeet Singh

回答by ashirwad

相关推荐

最近更新

标签

Python NLTK 中 PunktSentenceTokenizer 的使用

提问by arqam

采纳答案by alvas

回答by CentAu

回答by Ranjeet Singh

回答by ashirwad

相关推荐

OpenCV 和 Python - 图像太大而无法显示

Python 没有名为 django 的模块，但已安装

Python 如何查找字符串中所有出现的单词的所有索引

Python。如何减去2个字典

相关推荐

最近更新

标签