Python 要下载什么才能使 nltk.tokenize.word_tokenize 工作?

声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow 原文地址: http://stackoverflow.com/questions/37101114/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me): StackOverFlow

提示:将鼠标放在中文语句上可以显示对应的英文。显示中英文
时间:2020-08-19 18:51:30  来源:igfitidea点击:

What to download in order to make nltk.tokenize.word_tokenize work?

pythonnltk

提问by petrbel

I am going to use nltk.tokenize.word_tokenizeon a cluster where my account is very limited by space quota. At home, I downloaded all nltkresources by nltk.download()but, as I found out, it takes ~2.5GB.

我将nltk.tokenize.word_tokenize在我的帐户受空间配额限制的集群上使用。在家里,我通过以下方式下载了所有nltk资源,nltk.download()但据我所知,它需要大约 2.5GB。

This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize? So far, I've seen nltk.download('punkt')but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it work?

这对我来说似乎有点矫枉过正。您能否建议 ? 的最小(或几乎最小)依赖项是nltk.tokenize.word_tokenize什么?到目前为止,我已经看到了,nltk.download('punkt')但我不确定它是否足够以及大小是多少。我到底应该运行什么才能使其工作?

回答by Tulio Casagrande

You are right. You need Punkt Tokenizer Models. It has 13 MB and nltk.download('punkt')should do the trick.

你是对的。您需要 Punkt Tokenizer 模型。它有 13 MB,nltk.download('punkt')应该可以解决问题。

回答by alvas

In short:

简而言之

nltk.download('punkt')

would suffice.

就足够了。



In long:

You don't necessary need to download all the models and corpora available in NLTk if you're just going to use NLTKfor tokenization.

如果您只是打算NLTK用于标记化,则无需下载 NLTk 中可用的所有模型和语料库。

Actually, if you're just using word_tokenize(), then you won't really need any of the resources from nltk.download(). If we look at the code, the default word_tokenize()that is basically the TreebankWordTokenizershouldn't use any additional resources:

实际上,如果您只是使用word_tokenize(),那么您实际上并不需要nltk.download(). 如果我们查看代码,默认情况下word_tokenize()基本上是TreebankWordTokenizer不应该使用任何其他资源:

alvas@ubi:~$ ls nltk_data/
chunkers  corpora  grammars  help  models  stemmers  taggers  tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data/
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import word_tokenize
>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize('This is a sentence.')
['This', 'is', 'a', 'sentence', '.']

But:

但:

alvas@ubi:~$ ls nltk_data/
chunkers  corpora  grammars  help  models  stemmers  taggers  tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import sent_tokenize
>>> sent_tokenize('This is a sentence. This is another.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
    return find(path_, path + ['']).open()
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

>>> from nltk import word_tokenize
>>> word_tokenize('This is a sentence.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
    return find(path_, path + ['']).open()
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

But it looks like that's not the case, if we look at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L93. It seems like word_tokenizehas implicitly called sent_tokenize()which requires the punktmodel.

但是,这看起来并非如此,如果我们看一下https://github.com/nltk/nltk/blob/develop/nltk/tokenize/初始化的.py#L93。似乎word_tokenize已经隐式调用sent_tokenize()which 需要punkt模型。

I am not sure whether this is a bug or a feature but it seems like the old idiom might be outdated given the current code:

我不确定这是错误还是功能,但鉴于当前代码,旧习语似乎已经过时:

>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is a foo bar sentence. This is another sentence.'
>>> tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(sentences)]
>>> tokenized_sents
[['This', 'is', 'a', 'foo', 'bar', 'sentence', '.'], ['This', 'is', 'another', 'sentence', '.']]

It can simply be:

它可以简单地是:

>>> word_tokenize(sentences)
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']

But we see that the word_tokenize()flattens the list of list of string to a single list of string.

但是我们看到将word_tokenize()字符串列表的列表展平为单个字符串列表。



Alternatively, you can try to use a new tokenizer that was added to NLTK toktok.pybased on https://github.com/jonsafari/tok-tokthat requires no pre-trained models.

或者,您可以尝试使用toktok.py基于https://github.com/jonsafari/tok-tok添加到 NLTK 的新标记器,该标记器不需要预先训练的模型。