Python 要下载什么才能使 nltk.tokenize.word_tokenize 工作?
声明:本页面是StackOverFlow热门问题的中英对照翻译,遵循CC BY-SA 4.0协议,如果您需要使用它,必须同样遵循CC BY-SA许可,注明原文地址和作者信息,同时你必须将它归于原作者(不是我):StackOverFlow
原文地址: http://stackoverflow.com/questions/37101114/
Warning: these are provided under cc-by-sa 4.0 license. You are free to use/share it, But you must attribute it to the original authors (not me):
StackOverFlow
What to download in order to make nltk.tokenize.word_tokenize work?
提问by petrbel
I am going to use nltk.tokenize.word_tokenize
on a cluster where my account is very limited by space quota. At home, I downloaded all nltk
resources by nltk.download()
but, as I found out, it takes ~2.5GB.
我将nltk.tokenize.word_tokenize
在我的帐户受空间配额限制的集群上使用。在家里,我通过以下方式下载了所有nltk
资源,nltk.download()
但据我所知,它需要大约 2.5GB。
This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize
? So far, I've seen nltk.download('punkt')
but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it work?
这对我来说似乎有点矫枉过正。您能否建议 ? 的最小(或几乎最小)依赖项是nltk.tokenize.word_tokenize
什么?到目前为止,我已经看到了,nltk.download('punkt')
但我不确定它是否足够以及大小是多少。我到底应该运行什么才能使其工作?
回答by Tulio Casagrande
You are right. You need Punkt Tokenizer Models. It has 13 MB and nltk.download('punkt')
should do the trick.
你是对的。您需要 Punkt Tokenizer 模型。它有 13 MB,nltk.download('punkt')
应该可以解决问题。
回答by alvas
In short:
简而言之:
nltk.download('punkt')
would suffice.
就足够了。
In long:
长:
You don't necessary need to download all the models and corpora available in NLTk if you're just going to use NLTK
for tokenization.
如果您只是打算NLTK
用于标记化,则无需下载 NLTk 中可用的所有模型和语料库。
Actually, if you're just using word_tokenize()
, then you won't really need any of the resources from nltk.download()
. If we look at the code, the default word_tokenize()
that is basically the TreebankWordTokenizershouldn't use any additional resources:
实际上,如果您只是使用word_tokenize()
,那么您实际上并不需要nltk.download()
. 如果我们查看代码,默认情况下word_tokenize()
基本上是TreebankWordTokenizer不应该使用任何其他资源:
alvas@ubi:~$ ls nltk_data/
chunkers corpora grammars help models stemmers taggers tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data/
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29)
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import word_tokenize
>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize('This is a sentence.')
['This', 'is', 'a', 'sentence', '.']
But:
但:
alvas@ubi:~$ ls nltk_data/
chunkers corpora grammars help models stemmers taggers tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29)
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import sent_tokenize
>>> sent_tokenize('This is a sentence. This is another.')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
opened_resource = _open(resource_url)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
return find(path_, path + ['']).open()
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource u'tokenizers/punkt/english.pickle' not found. Please
use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/home/alvas/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''
**********************************************************************
>>> from nltk import word_tokenize
>>> word_tokenize('This is a sentence.')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
return [token for sent in sent_tokenize(text, language)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
opened_resource = _open(resource_url)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
return find(path_, path + ['']).open()
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource u'tokenizers/punkt/english.pickle' not found. Please
use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/home/alvas/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''
**********************************************************************
But it looks like that's not the case, if we look at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L93. It seems like word_tokenize
has implicitly called sent_tokenize()
which requires the punkt
model.
但是,这看起来并非如此,如果我们看一下https://github.com/nltk/nltk/blob/develop/nltk/tokenize/初始化的.py#L93。似乎word_tokenize
已经隐式调用sent_tokenize()
which 需要punkt
模型。
I am not sure whether this is a bug or a feature but it seems like the old idiom might be outdated given the current code:
我不确定这是错误还是功能,但鉴于当前代码,旧习语似乎已经过时:
>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is a foo bar sentence. This is another sentence.'
>>> tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(sentences)]
>>> tokenized_sents
[['This', 'is', 'a', 'foo', 'bar', 'sentence', '.'], ['This', 'is', 'another', 'sentence', '.']]
It can simply be:
它可以简单地是:
>>> word_tokenize(sentences)
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']
But we see that the word_tokenize()
flattens the list of list of string to a single list of string.
但是我们看到将word_tokenize()
字符串列表的列表展平为单个字符串列表。
Alternatively, you can try to use a new tokenizer that was added to NLTK toktok.py
based on https://github.com/jonsafari/tok-tokthat requires no pre-trained models.
或者,您可以尝试使用toktok.py
基于https://github.com/jonsafari/tok-tok添加到 NLTK 的新标记器,该标记器不需要预先训练的模型。