Python 要下载什么才能使 nltk.tokenize.word_tokenize 工作？

Question

提问by petrbel

I am going to use nltk.tokenize.word_tokenizeon a cluster where my account is very limited by space quota. At home, I downloaded all nltkresources by nltk.download()but, as I found out, it takes ~2.5GB.

我将nltk.tokenize.word_tokenize在我的帐户受空间配额限制的集群上使用。在家里，我通过以下方式下载了所有nltk资源，nltk.download()但据我所知，它需要大约 2.5GB。

This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize? So far, I've seen nltk.download('punkt')but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it work?

这对我来说似乎有点矫枉过正。您能否建议 ? 的最小（或几乎最小）依赖项是nltk.tokenize.word_tokenize什么？到目前为止，我已经看到了，nltk.download('punkt')但我不确定它是否足够以及大小是多少。我到底应该运行什么才能使其工作？

Answer 1

回答by Tulio Casagrande

You are right. You need Punkt Tokenizer Models. It has 13 MB and nltk.download('punkt')should do the trick.

你是对的。您需要 Punkt Tokenizer 模型。它有 13 MB，nltk.download('punkt')应该可以解决问题。

Answer 2

回答by alvas

In short:

简而言之：

nltk.download('punkt')

would suffice.

就足够了。

In long:

长：

You don't necessary need to download all the models and corpora available in NLTk if you're just going to use NLTKfor tokenization.

如果您只是打算NLTK用于标记化，则无需下载 NLTk 中可用的所有模型和语料库。

Actually, if you're just using word_tokenize(), then you won't really need any of the resources from nltk.download(). If we look at the code, the default word_tokenize()that is basically the TreebankWordTokenizershouldn't use any additional resources:

实际上，如果您只是使用word_tokenize()，那么您实际上并不需要nltk.download(). 如果我们查看代码，默认情况下word_tokenize()基本上是TreebankWordTokenizer不应该使用任何其他资源：

alvas@ubi:~$ ls nltk_data/
chunkers  corpora  grammars  help  models  stemmers  taggers  tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data/
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import word_tokenize
>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize('This is a sentence.')
['This', 'is', 'a', 'sentence', '.']

But:

但：

alvas@ubi:~$ ls nltk_data/
chunkers  corpora  grammars  help  models  stemmers  taggers  tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import sent_tokenize
>>> sent_tokenize('This is a sentence. This is another.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
    return find(path_, path + ['']).open()
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

>>> from nltk import word_tokenize
>>> word_tokenize('This is a sentence.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
    return find(path_, path + ['']).open()
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

But it looks like that's not the case, if we look at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L93. It seems like word_tokenizehas implicitly called sent_tokenize()which requires the punktmodel.

但是，这看起来并非如此，如果我们看一下https://github.com/nltk/nltk/blob/develop/nltk/tokenize/初始化的.py＃L93。似乎word_tokenize已经隐式调用sent_tokenize()which 需要punkt模型。

I am not sure whether this is a bug or a feature but it seems like the old idiom might be outdated given the current code:

我不确定这是错误还是功能，但鉴于当前代码，旧习语似乎已经过时：

>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is a foo bar sentence. This is another sentence.'
>>> tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(sentences)]
>>> tokenized_sents
[['This', 'is', 'a', 'foo', 'bar', 'sentence', '.'], ['This', 'is', 'another', 'sentence', '.']]

It can simply be:

它可以简单地是：

>>> word_tokenize(sentences)
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']

But we see that the word_tokenize()flattens the list of list of string to a single list of string.

但是我们看到将word_tokenize()字符串列表的列表展平为单个字符串列表。

Alternatively, you can try to use a new tokenizer that was added to NLTK toktok.pybased on https://github.com/jonsafari/tok-tokthat requires no pre-trained models.

或者，您可以尝试使用toktok.py基于https://github.com/jonsafari/tok-tok添加到 NLTK 的新标记器，该标记器不需要预先训练的模型。

Python 要下载什么才能使 nltk.tokenize.word_tokenize 工作？

提问by petrbel

回答by Tulio Casagrande

回答by alvas

相关推荐

最近更新

标签

Python 要下载什么才能使 nltk.tokenize.word_tokenize 工作？

提问by petrbel

回答by Tulio Casagrande

回答by alvas

相关推荐

Python 如何删除熊猫中的小数点

Python 无法单击元素：Splinter / Selenium 中的 ElementClickInterceptedException

Python tf.shape() 在张量流中得到错误的形状

Python Keras：预期为 3 维，但得到了具有形状的数组 - 密集模型

相关推荐

最近更新

标签